csv remove rows where second column has less than X words

Question 1

I have a large csv (35gb) of following format:

"id","text","other_info"
"1","this is some text, a news report citing more than 30 sources, including investors","some other info"
"11","this is some text","sme other info"
"111","","sme other info"
"21","this is some real text Language models with hundreds of billions of parameters","sme other info"

Result:

"id","text","other_info"
"1","this is some text, a news report citing more than 30 sources, including investors","some other info"
"21","this is some real text Language models with hundreds of billions of parameters","sme other info"

The problem is text at times has less than 10 words and I want to remove such rows.
What is a fast way to do this please?

I can write a loop in python:

lst=[]
for item in row:
    if len(row[1])>10:
        lst.append(row[1])

but this would be dead slow on my 35g file.

Question 2

Based on this example file, test.csv:

1,a,other_info
2,a b,other_info
3,a b a b ab ab ab ab ab ab ab ab ab ab ab ab ab,other_info
4,a c,other_info
5,a b a b ab ab ab ab ab ab ab ab ab ab ab ab ab,other_info
6,a b a b ab ab ab ab ab ab ab ab ab ab ab ab ab,other_info

Running awk -F ',' '{split($2,words," "); if (length(words) >= 10) print}' test.csv yields:

3,a b a b ab ab ab ab ab ab ab ab ab ab ab ab ab,other_info
5,a b a b ab ab ab ab ab ab ab ab ab ab ab ab ab,other_info
6,a b a b ab ab ab ab ab ab ab ab ab ab ab ab ab,other_info

UPDATE

Now, in your provided example each field is quoted, so you can leverage GNU AWK’s FPAT.

awk 'BEGIN {FPAT = "(\"[^\"]*\")"} {split($2,words," "); if (length(words) >= 10) print}' my_file.csv

It will still use an empty string as a word separator inside the field though. So if the text field contains “this is some text, for example: a,b,c,d,e,f” it would consider it to have 7 words.

Question 3

I can write a loop in python:
lst=[]
for item in row:
    if len(row[1])>10:
        lst.append(row[1])
but this would be dead slow on my 35g file.

More serious problem is that you might run out of space for your huge lst. You should output to file to avoid that, I suggest following python3 solution, let file.csv content be

"id","text","other_info"
"1","this is some text, a news report citing more than 30 sources, including investors","some other info"
"11","this is some text","sme other info"
"111","","sme other info"
"21","this is some real text Language models with hundreds of billions of parameters","sme other info"

then

import csv
with open('file.csv') as infile, open('file.csv', newline="") as csvfile, open('filtered.csv', 'w') as outfile:
    reader = csv.reader(csvfile)
    for line, row in zip(infile, reader):
        if len(row[1].split()) > 10:
            outfile.write(line)

creates filtered.csv with following content

"1","this is some text, a news report citing more than 30 sources, including investors","some other info"
"21","this is some real text Language models with hundreds of billions of parameters","sme other info"

Explanation: I open file.csv twice, infile will be used for accessing lines, csvfile for ramming into csv.reader and outfile for writing. Then I create reader and iterate in unison over reader and file, so I have access to line as is and parsed row. If 1st element after splitting has more than 10 elements I write this line as-is to output file. I do not know if that solution would be fast enough for your use-case, so please test it against subset of your file and then estimate total running time. Disclaimer: this solution assumes that more than 10 words rules apply to all lines of input files, including 1st one.

(tested in Python 3.10.12)

Question 4

This might work for you (GNU sed):

sed -E '1b;/,"[^" ]+(\s+[^" ]+){9}/!d' file

Forget the first line and then test each line for 10 or more words in the second column otherwise delete the line.

UPDATE

Leave a Comment Cancel reply