Python - Iterate through a list of strings and group partial matching strings

Question

So I have a list of strings as below:

list = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"]

How do I iterate through the list and group partially matching strings without given keywords. The result should like below:

list 1 = [["I love cat","I love dog","I love fish"],["I hate banana","I hate apple","I hate orange"]]

Thank you so much.

What have you already tried? Some starter code so others know what you've already attempted and where you've gotten stuck is helpful in framing answers. — TheF1rstPancake
– TheF1rstPancake, Commented Oct 21, 2016 at 3:00

Nishanth Duvva · Accepted Answer · 2017-08-24 09:45:45Z

6

Sequence matcher will do the task for you. Tune the score ratio for better results.

Try this:

from difflib import SequenceMatcher
sentence_list = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"]
result=[]
for sentence in sentence_list:
    if(len(result)==0):
        result.append([sentence])
    else:
        for i in range(0,len(result)):
            score=SequenceMatcher(None,sentence,result[i][0]).ratio()
            if(score<0.5):
                if(i==len(result)-1):
                    result.append([sentence])
            else:
                if(score != 1):
                    result[i].append(sentence)

Output:

[['I love cat', 'I love dog', 'I love fish'], ['I hate banana', 'I hate apple', 'I hate orange']]

answered Aug 24, 2017 at 9:45

Nishanth Duvva

6858 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

RoadRunner · Accepted Answer · 2016-10-21 04:48:13Z

3

You can try this approach. Although it is not the best approach, it is helpful for understanding the problem in a more methodical way.

from itertools import groupby

my_list = ["I love cat","I love dog","I love fish","I hate banana","I hate apple","I hate orange"];

each_word = sorted([x.split() for x in my_list])

# I assumed the keywords would be everything except the last word
grouped = [list(value) for key, value in groupby(each_word, lambda x: x[:-1])]

result = []
for group in grouped:
    temp = []
    for i in range(len(group)):
        temp.append(" ".join(group[i]))
    result.append(temp)

print(result)

Output:

[['I hate apple', 'I hate banana', 'I hate orange'], ['I love cat', 'I love dog', 'I love fish']]

edited Oct 21, 2016 at 4:48

answered Oct 21, 2016 at 4:04

RoadRunner

26.4k6 gold badges46 silver badges77 bronze badges

3 Comments

wwii Over a year ago

You should probably ensure the iterable is sorted before using itertools.groupby().

RoadRunner Over a year ago

Yeah that's true @wwii. Thanks for the suggestion, I will fix that. I also realised that half the code is not necessary, and it can be improved.

RoadRunner Over a year ago

Also, what do you consider as a partial match?

Tore Eschliman · Accepted Answer · 2016-10-21 20:16:57Z

Try building an inverse index, and then you can pick whichever keywords you like. This approach ignores word order:

index = {}
for sentence in sentence_list:
    for word in set(sentence.split()):
        index.setdefault(word, set()).add(sentence)

Or this approach, which keys the index by all possible full-word phrase prefixes:

index = {}
for sentence in sentence_list:
    number_of_words = length(sentence.split())
    for i in xrange(1, number_of_words):
        key_phrase = sentence.rsplit(maxsplit=i)[0]
        index.setdefault(key_phrase, set()).add(sentence)

And then if you want to find all of the sentences that contain a keyword (or start with a phrase, if that's your index):

match_sentences = index[key_term]

Or a given set of keywords:

matching_sentences = reduce(list_of_keywords[1:], lambda x, y: x & index[y], initializer = index[list_of_keywords[0]])

Now you can generate a list grouped by pretty much any combination of terms or phrases by building a list comprehension using those indices to generate sentences. E.g., if you built the phrase prefix index and want everything grouped by the first two word phrase:

return [list(index[k]) for k in index if len(k.split()) == 2]

score 0 · Accepted Answer · 2016-10-21 04:47:36Z

0

Avoid words like list in naming your variables. Also list 1 is not a valid python variable.

Try this:

import sys
from itertools import groupby

#Assuming you group by the first two words in each string, e.g. 'I love', 'I hate'.

L = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"]

L = sorted(L)

result = []

for key,group in groupby(L, lambda x: x.split(' ')[0] + ' ' + x.split(' ')[1]):
    result.append(list(group))

print(result)

edited Oct 21, 2016 at 4:47

answered Oct 21, 2016 at 3:27

user671150

1 Comment

wwii Over a year ago

sorted returns a value but you don't assign it to anything. Maybe use list.sort() instead for in-place sorting.

Collectives™ on Stack Overflow

Python - Iterate through a list of strings and group partial matching strings

4 Answers 4

Comments

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related