No item in list in string Python

Question

I have a list of things I want to filter out of a csv, and I'm trying to figure out a pythonic way to do it. EG, this is what I'm doing:

with open('output.csv', 'wb') as outf:
    with open('input.csv', 'rbU') as inf:
         read = csv.reader(inf)
         outwriter = csv.writer(outf)
         notstrings = ['and', 'or', '&', 'is', 'a', 'the']
         for row in read:
             (if none of notstrings in row[3])
                 outwriter(row)

I don't know what to put in the parentheses (or if there's a better overall way to go about this).

You mean you want to exclude a row if column 4 contains any of those words? — Martijn Pieters
– Martijn Pieters, Commented Apr 6, 2015 at 17:19
What kind of values are there in row[3]? Is it a sentence? Is there punctuation? Should only whole words be matched? — Martijn Pieters
– Martijn Pieters, Commented Apr 6, 2015 at 17:22
No, just column 3. Also, row 3 is supposed to be a name, but I'm slowly creating a list of filters to avoid non-names (better too zealous than not zealous enough). However, I'm more using this to learn the best methods than to be specific to this one application. — Xodarap777
– Xodarap777, Commented Apr 6, 2015 at 17:33

Martijn Pieters · Accepted Answer · 2015-04-06 17:38:25Z

2

You can use the any() function to test each of the words in your list against a column:

if not any(w in row[3] for w in notstrings):
    # none of the strings are found, write the row

This will be true if none of those strings appear in row[3]. It'll match substrings, however, so false-positive would be a match for 'a' in 'false-positive for example.

Put into context:

with open('output.csv', 'wb') as outf:
    with open('input.csv', 'rbU') as inf:
        read = csv.reader(inf)
        outwriter = csv.writer(outf)
        notstrings = ['and', 'or', '&', 'is', 'a', 'the']
        for row in read:
            if not any(w in row[3] for w in notstrings):
                outwriter(row)

If you need to honour word boundaries then a regular expression is going to be a better idea here:

notstrings = re.compile(r'(?:\b(?:and|or|is|a|the)\b)|(?:\B&\B)')
if not notstrings.search(row[3]):
    # none of the words are found, write the row

I created a Regex101 demo for the expression to demonstrate how it works. It has two branches:

\b(?:and|or|is|a|the)\b - matches any of the words in the list provided they are at the start, end, or between non-word characters (punctuation, whitespace, etc.)
\B&\B - matches the & character if at the start, end, or between non-word characters. You can't use \b here as & is itself not a word character.

edited Apr 6, 2015 at 17:38

answered Apr 6, 2015 at 17:19

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Xodarap777 Over a year ago

how could I use \ba\b to avoid false positive for a? Is r'\ba\b' sufficient?

Martijn Pieters Over a year ago

@Xodarap777: that won't work for the & as it is not a word character. For the rest it'd be sufficient. You can make it one regular expression to do the test without any(), in one step: r'\b(and|or|is|a|the)\b'. I'll look into mixing in the & there.

Julien Spronck · Accepted Answer · 2015-04-06 17:28:25Z

1

You can use sets. In this code, I transform your list into a set. I transform your row[3] into a set of words and I check the intersection between the two sets. If there is not intersection, that means none of the words in notstrings are in row[3].

Using sets, you make sure that you match only words and not parts of words.

with open('output.csv', 'wb') as outf:
    with open('input.csv', 'rbU') as inf:
        read = csv.reader(inf)
        outwriter = csv.writer(outf)
        notstrings = set(['and', 'or', '&', 'is', 'a', 'the'])
        for row in read:
            if not notstrings.intersection(set(row[3].split(' '))):
                outwriter(row)

edited Apr 6, 2015 at 17:28

answered Apr 6, 2015 at 17:22

Julien Spronck

15.5k5 gold badges50 silver badges57 bronze badges

3 Comments

Xodarap777 Over a year ago

Does this method avoid the false positives of any()?

Martijn Pieters Over a year ago

This requires that there are only spaces and words in row[3]; it won't work if punctuation is involved.

Julien Spronck Over a year ago

@MartijnPieters That's correct. If there is punctuation, the string needs to be split by multiple delimiters or parsed with regex

Collectives™ on Stack Overflow

No item in list in string Python

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related