0

I have a list of things I want to filter out of a csv, and I'm trying to figure out a pythonic way to do it. EG, this is what I'm doing:

with open('output.csv', 'wb') as outf:
    with open('input.csv', 'rbU') as inf:
         read = csv.reader(inf)
         outwriter = csv.writer(outf)
         notstrings = ['and', 'or', '&', 'is', 'a', 'the']
         for row in read:
             (if none of notstrings in row[3])
                 outwriter(row)

I don't know what to put in the parentheses (or if there's a better overall way to go about this).

4
  • You mean you want to exclude a row if column 4 contains any of those words? Commented Apr 6, 2015 at 17:19
  • What kind of values are there in row[3]? Is it a sentence? Is there punctuation? Should only whole words be matched? Commented Apr 6, 2015 at 17:22
  • No, just column 3. Also, row 3 is supposed to be a name, but I'm slowly creating a list of filters to avoid non-names (better too zealous than not zealous enough). However, I'm more using this to learn the best methods than to be specific to this one application. Commented Apr 6, 2015 at 17:33
  • I was counting from 1, row[0] is the 1st column, etc. Commented Apr 6, 2015 at 17:44

2 Answers 2

2

You can use the any() function to test each of the words in your list against a column:

if not any(w in row[3] for w in notstrings):
    # none of the strings are found, write the row

This will be true if none of those strings appear in row[3]. It'll match substrings, however, so false-positive would be a match for 'a' in 'false-positive for example.

Put into context:

with open('output.csv', 'wb') as outf:
    with open('input.csv', 'rbU') as inf:
        read = csv.reader(inf)
        outwriter = csv.writer(outf)
        notstrings = ['and', 'or', '&', 'is', 'a', 'the']
        for row in read:
            if not any(w in row[3] for w in notstrings):
                outwriter(row)

If you need to honour word boundaries then a regular expression is going to be a better idea here:

notstrings = re.compile(r'(?:\b(?:and|or|is|a|the)\b)|(?:\B&\B)')
if not notstrings.search(row[3]):
    # none of the words are found, write the row

I created a Regex101 demo for the expression to demonstrate how it works. It has two branches:

  • \b(?:and|or|is|a|the)\b - matches any of the words in the list provided they are at the start, end, or between non-word characters (punctuation, whitespace, etc.)
  • \B&\B - matches the & character if at the start, end, or between non-word characters. You can't use \b here as & is itself not a word character.
Sign up to request clarification or add additional context in comments.

2 Comments

how could I use \ba\b to avoid false positive for a? Is r'\ba\b' sufficient?
@Xodarap777: that won't work for the & as it is not a word character. For the rest it'd be sufficient. You can make it one regular expression to do the test without any(), in one step: r'\b(and|or|is|a|the)\b'. I'll look into mixing in the & there.
1

You can use sets. In this code, I transform your list into a set. I transform your row[3] into a set of words and I check the intersection between the two sets. If there is not intersection, that means none of the words in notstrings are in row[3].

Using sets, you make sure that you match only words and not parts of words.

with open('output.csv', 'wb') as outf:
    with open('input.csv', 'rbU') as inf:
        read = csv.reader(inf)
        outwriter = csv.writer(outf)
        notstrings = set(['and', 'or', '&', 'is', 'a', 'the'])
        for row in read:
            if not notstrings.intersection(set(row[3].split(' '))):
                outwriter(row)

3 Comments

Does this method avoid the false positives of any()?
This requires that there are only spaces and words in row[3]; it won't work if punctuation is involved.
@MartijnPieters That's correct. If there is punctuation, the string needs to be split by multiple delimiters or parsed with regex

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.