6

So I have a list of strings as below:

list = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"]

How do I iterate through the list and group partially matching strings without given keywords. The result should like below:

list 1 = [["I love cat","I love dog","I love fish"],["I hate banana","I hate apple","I hate orange"]]

Thank you so much.

3
  • What have you already tried? Some starter code so others know what you've already attempted and where you've gotten stuck is helpful in framing answers. Commented Oct 21, 2016 at 3:00
  • 1
    itertools groupby will be helpful for this. Commented Oct 21, 2016 at 3:02
  • 1
    how do you define a partial match? Commented Oct 21, 2016 at 3:08

4 Answers 4

6

Sequence matcher will do the task for you. Tune the score ratio for better results.

Try this:

from difflib import SequenceMatcher
sentence_list = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"]
result=[]
for sentence in sentence_list:
    if(len(result)==0):
        result.append([sentence])
    else:
        for i in range(0,len(result)):
            score=SequenceMatcher(None,sentence,result[i][0]).ratio()
            if(score<0.5):
                if(i==len(result)-1):
                    result.append([sentence])
            else:
                if(score != 1):
                    result[i].append(sentence)

Output:

[['I love cat', 'I love dog', 'I love fish'], ['I hate banana', 'I hate apple', 'I hate orange']]
Sign up to request clarification or add additional context in comments.

Comments

3

You can try this approach. Although it is not the best approach, it is helpful for understanding the problem in a more methodical way.

from itertools import groupby

my_list = ["I love cat","I love dog","I love fish","I hate banana","I hate apple","I hate orange"];

each_word = sorted([x.split() for x in my_list])

# I assumed the keywords would be everything except the last word
grouped = [list(value) for key, value in groupby(each_word, lambda x: x[:-1])]

result = []
for group in grouped:
    temp = []
    for i in range(len(group)):
        temp.append(" ".join(group[i]))
    result.append(temp)

print(result)

Output:

[['I hate apple', 'I hate banana', 'I hate orange'], ['I love cat', 'I love dog', 'I love fish']]

3 Comments

You should probably ensure the iterable is sorted before using itertools.groupby().
Yeah that's true @wwii. Thanks for the suggestion, I will fix that. I also realised that half the code is not necessary, and it can be improved.
Also, what do you consider as a partial match?
3

Try building an inverse index, and then you can pick whichever keywords you like. This approach ignores word order:

index = {}
for sentence in sentence_list:
    for word in set(sentence.split()):
        index.setdefault(word, set()).add(sentence)

Or this approach, which keys the index by all possible full-word phrase prefixes:

index = {}
for sentence in sentence_list:
    number_of_words = length(sentence.split())
    for i in xrange(1, number_of_words):
        key_phrase = sentence.rsplit(maxsplit=i)[0]
        index.setdefault(key_phrase, set()).add(sentence)

And then if you want to find all of the sentences that contain a keyword (or start with a phrase, if that's your index):

match_sentences = index[key_term]

Or a given set of keywords:

matching_sentences = reduce(list_of_keywords[1:], lambda x, y: x & index[y], initializer = index[list_of_keywords[0]])

Now you can generate a list grouped by pretty much any combination of terms or phrases by building a list comprehension using those indices to generate sentences. E.g., if you built the phrase prefix index and want everything grouped by the first two word phrase:

return [list(index[k]) for k in index if len(k.split()) == 2]

Comments

0

Avoid words like list in naming your variables. Also list 1 is not a valid python variable.

Try this:

import sys
from itertools import groupby

#Assuming you group by the first two words in each string, e.g. 'I love', 'I hate'.

L = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"]

L = sorted(L)

result = []

for key,group in groupby(L, lambda x: x.split(' ')[0] + ' ' + x.split(' ')[1]):
    result.append(list(group))

print(result)

1 Comment

sorted returns a value but you don't assign it to anything. Maybe use list.sort() instead for in-place sorting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.