0

In Python I am trying to create a list (myClassifier) that appends a classification ('bad'/'good') for each text file (txtEntry) stored in a list (txtList), based on whether or not it contains a bad word stored in a list of bad words (badWord).

txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']

txtEntry is merely a placeholder, really I just want to iterate through every entry in txtList.

I've produced the following code in response:

for txtEntry in txtList:
    if badWord in txtEntry:
        myClassifier += 'bad'
    else:
        myClassifier += 'good'

However I'm receiving TypeError: 'in ' requires string as left operand, not list as a result.

I'm guessing that badWord needs to be a string as opposed to a list, though I'm not sure how I can get this to work otherwise.

How could I otherwise accomplish this?

6
  • Can you please post your input data sample? Commented Mar 26, 2014 at 7:07
  • Ok what is the type of badWord and txtEntry, from the error I am assuming badWord is list and txtEntry is string ? Commented Mar 26, 2014 at 7:19
  • so if txtEntry is string and badword is list then you probably need to alternate the if statement.ie: if txtEntry in badWord: Commented Mar 26, 2014 at 7:34
  • 1
    to clarify: do you want to find bad words in a file name or in its content (open('file name').read())? Commented Mar 26, 2014 at 8:39
  • 1
    @J.F. Sebastian I'd like to find bad words in a file's content. Commented Mar 26, 2014 at 8:47

4 Answers 4

2

To find which files have bad words in them, you could:

import re
from pprint import pprint

filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']

classified_files = {} # filename -> good/bad    
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
                           re.I).search
for filename in filenames:
    with open(filename) as file:
         for line in file:
             if has_bad_words(line):
                classified_files[filename] = 'bad'
                break # go to the next file
         else: # no bad words
             classified_files[filename] = 'good'

pprint(classified_files)

If you want to mark as 'bad' the different inflected forms of a word e.g., if cactus is in bad_words and you want to exclude cacti (a plural) then you might need stemmers or more generally lemmatizers e.g.,

from nltk.stem.porter import PorterStemmer # $ pip install nltk

stemmer = PorterStemmer()
print(stemmer.stem("pies")) 
# -> pie

Or

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus

Note: you might need import nltk; nltk.download() to download wordnet data.

It might be simpler, just to add all possible forms such as pies, cacti to bad_words list directly.

Sign up to request clarification or add additional context in comments.

8 Comments

The first part worked well for when badWords consisted of say, two entries. When I tried using it with more entries however, I wasn't able to finish compiling it.
@Alpine: what traceback does it show if you try to interrupt it with <kbd>Ctrl + C</kbd>? Don't put it in the comments, update your question or ask a new one.
Where should I be placing that line?
@Alpine: Ctrl + C is not a line, it is a keyboard shortcut that you could use to interrupt a script running at the command-line.
I tried using <kbd>Ctrl + C</kbd>, although I received a Syntax Error using it.
|
2

This

if badWord in txtEntry:

tests whether badWord equals any substring in textEntry. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any. You do need to normalise the txtEntry, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string tests for), and you (probably) want the search to be case insensitive:

import re

for txtEntry in txtList:
    # Ensure that `word in contents` doesn't give 
    # false positives for substrings - avoid eg, 'ass in class'
    contents = [w.lower() for w in re.split('\W+', txtEntry)]

    if any(word in contents for word in badWord):
         myClassifier.append('bad')
    else:
         myClassifer.append('good')

Note that, like other answers, I've used the list.append method instead of += to add the string to the list. If you use +=, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd'] instead of ['good', 'bad'].

Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open, and you need to then test against the contents - but the test and the normalisation stay the same:

import re

for txtEntry in txtList:
    with open(txtEntry) as f:
        # Ensure that `word in contents` doesn't give 
        # false positives for substrings - avoid eg, 'ass in class'
        contents = [w.lower() for w in re.split('\W+', f.read())]
    if any(word in contents for word in badWord):
        myClassifier.append('bad')
    else:
        myClassifer.append('good')   

These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.

20 Comments

word in contents matches all substrings e.g., it finds ass in class i.e., it mistakenly classifies class as a bad word
.split() won't catch ass, (note: comma)
I'm getting an IOError at with open(txtEntry) as f:; IOError: [Errno 2] No such file or directory: 'Text from one of the text files here.'
@J.F.Sebastian ... right. Edge cases are fun. Updated again.
@Alpine the second version of the code assumes each txtEntry is a filename (while the first assumes each one is a word). If you've put the file contents into your list directly, but not split it into words, you can remove the with ...: line completely, and use replace f.read() with txtEntry (and unindent that line).
|
0

You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.

for txtEntry in txtList:
    if any(word in txtEntry for word in badWord)::
        myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
    else:
        myClassifier.append("good")

Thanks to @lvc comment

3 Comments

This doesn't meet the OP's spec. Need to append "bad" or "good" once for every text enty - so, len(myClassifier) == len(txtList). This code will give len(myClassifier) == len(txtList)*len(badWord).
Still won't work. It's now equivalent to if badWord[0] in txtEntry (except its a noop rather than an error when badword is empty. If the second or third badWord is in txtEntry but the first isn't, this will append "good".
@lvc Yup, that's another catch, used any() .
-2

try this code:

    myClassifier.append('bad') 

2 Comments

@Lafexlos what makes you think that lst += is a valid operation in python?
@ManojAwasthi is a valid operator, but it won't give him the right output as it will add each letter in the word as a list item.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.