Classifying List Entries in Python

The first part worked well for when badWords consisted of say, two entries. When I tried using it with more entries however, I wasn't able to finish compiling it.

@Alpine: what traceback does it show if you try to interrupt it with <kbd>Ctrl + C</kbd>? Don't put it in the comments, update your question or ask a new one.

Where should I be placing that line?

@Alpine: Ctrl + C is not a line, it is a keyboard shortcut that you could use to interrupt a script running at the command-line.

I tried using <kbd>Ctrl + C</kbd>, although I received a Syntax Error using it.

|

lvc · Accepted Answer · 2014-03-26 23:25:07Z

2

This

if badWord in txtEntry:

tests whether badWord equals any substring in textEntry. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any. You do need to normalise the txtEntry, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string tests for), and you (probably) want the search to be case insensitive:

import re

for txtEntry in txtList:
    # Ensure that `word in contents` doesn't give 
    # false positives for substrings - avoid eg, 'ass in class'
    contents = [w.lower() for w in re.split('\W+', txtEntry)]

    if any(word in contents for word in badWord):
         myClassifier.append('bad')
    else:
         myClassifer.append('good')

Note that, like other answers, I've used the list.append method instead of += to add the string to the list. If you use +=, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd'] instead of ['good', 'bad'].

Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open, and you need to then test against the contents - but the test and the normalisation stay the same:

import re

for txtEntry in txtList:
    with open(txtEntry) as f:
        # Ensure that `word in contents` doesn't give 
        # false positives for substrings - avoid eg, 'ass in class'
        contents = [w.lower() for w in re.split('\W+', f.read())]
    if any(word in contents for word in badWord):
        myClassifier.append('bad')
    else:
        myClassifer.append('good')

These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.

edited Mar 26, 2014 at 23:25

answered Mar 26, 2014 at 8:28

lvc

35.3k10 gold badges76 silver badges100 bronze badges

20 Comments

word in contents matches all substrings e.g., it finds ass in class i.e., it mistakenly classifies class as a bad word

.split() won't catch ass, (note: comma)

I'm getting an IOError at with open(txtEntry) as f:; IOError: [Errno 2] No such file or directory: 'Text from one of the text files here.'

@J.F.Sebastian ... right. Edge cases are fun. Updated again.

@Alpine the second version of the code assumes each txtEntry is a filename (while the first assumes each one is a word). If you've put the file contents into your list directly, but not split it into words, you can remove the with ...: line completely, and use replace f.read() with txtEntry (and unindent that line).

|

bingorabbit · Accepted Answer · 2014-03-26 08:59:11Z

0

You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.

for txtEntry in txtList:
    if any(word in txtEntry for word in badWord)::
        myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
    else:
        myClassifier.append("good")

Thanks to @lvc comment

edited Mar 26, 2014 at 8:59

answered Mar 26, 2014 at 7:05

bingorabbit

6951 gold badge5 silver badges11 bronze badges

3 Comments

This doesn't meet the OP's spec. Need to append "bad" or "good" once for every text enty - so, len(myClassifier) == len(txtList). This code will give len(myClassifier) == len(txtList)*len(badWord).