Group items by string pattern in python

Question

supose this list:

list1=["House of Mine (1293) Item 21",
       "House of Mine (1292) Item 24",
       "The yard (1000) Item 1 ",
       "The yard (1000) Item 2 ",
       "The yard (1000) Item 4 "]

I want to add each item of it to a group (a list inside a list on this case) IF the substring till the (XXXX) is the same.

So, in this case, I am expecting to have:

[["House of Mine (1293) Item 21",
  "House of Mine (1292) Item 24"],

 ["The yard (1000) Item 1 ",
  "The yard (1000) Item 2 ",
  "The yard (1000) Item 4 "]

The following code is what I was able to make, but it's not working:

def group(list1):
    group=[]
    for i, itemg in enumerate(list1):
        try:
            group[i]
        except Exception:
            group.append([])
        for itemj in group[i]:
            if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
                group[i].append(itemg)
            else:
                group.append([])
                group[-1].append(itemg)

    return group

I've read thanks to another topic in stack, the page of regular expressions http://www.diveintopython3.net/regular-expressions.html

I know the answer lies on it, but I'm having difficult understanding some concepts of it.

Jan Vlcinsky · Accepted Answer · 2014-06-19 17:30:37Z

7

Set up the list to group:

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

Define a function, used to sort and group items (this time using the number in parenthesis):

>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'

Sort the list (in place here):

>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough

Take groupby from itertools

>>> from itertools import groupby

Check the concept:

>>> for gr, items in groupby(list1, key = keyf):
...     print "gr", gr
...     print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
 'The yard (1000) Item 2 ',
 'The yard (1000) Item 4 ',
 'House of Mine (1292) Item 24',
 'House of Mine (1293) Item 21']

Note, we had to call list on items, as items is an iterator over items.

Now using list comprehension:

>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 '],
 ['House of Mine (1292) Item 24'],
 ['House of Mine (1293) Item 21']]

and we are done.

If you want to group by all the text before first "(", the only change is to:

>>> keyf = lambda text: text.split("(")[0]

Short version answering OP

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

Variation using `re.findall`

Solution assumes that "(" is the delimiter and ignores the requirement of having four digits there. Such a task can be resolved using re.

>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '

But it raises IndexError: list index out of range if the text does not have expected content (we are trying to acces item with index 0 from empty list).

>>> text = "nothing here"
IndexError: list index out of range

We can use simple trick, to survive, we append original text to ensure, something is there:

>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'

Final solution using re

>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

edited Jun 19, 2014 at 17:30

answered Jun 19, 2014 at 16:02

Jan Vlcinsky

44.4k12 gold badges106 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Adam Smith Over a year ago

OP is actually asking to group by the substring in front of the (####). Note that groupby only works if the original list is sorted.

Jan Vlcinsky Over a year ago

@AdamSmith Thanks for correction. Anyway, my (first version) of the answer is sorting the list (but it is worth reminding). I added second versoin of keyf to sort and groupby the string OP was asking for.

Adam Smith Over a year ago

I always wonder how optimal it is to perform a sort just so we can groupby but honestly I've never tested it on any sort of "real world" data. I normally default to the method I used in my answer (defaultdict(list) with keys matching the groupings) if I think I'm going to run into unsorted data. Honestly, since you're matching the first part of the string, you can just use sorted(list1) and sort it lexicographically and not perform the extra str.split

Jan Vlcinsky Over a year ago

@AdamSmith I like pair programming with you. Simplified sorting added. Regarding efficiency - it is really difficult to evaluate, defaultdict has also some costs, but on longer unsorted list I would really expect defaultdict to be faster.

BrunoSXS Over a year ago

Is it possible to use regular expressions with this one? I was not able to do so while using string.split() The pattern (XXXX) is the year, so it will always have those four characters. Going by the first "(" can get the string wrongly since there will be strings that have "(" before the "(XXXX)" year pattern.

|

Dan Lenski · Accepted Answer · 2014-06-19 16:21:45Z

4

I'd use a collections.defaultdict and re.findall up to the paren with a lookahead.

import collections
import re

def groupitems(lst):
    groups = collections.defaultdict(list)

    for item in lst:
        try:
            head = re.findall(".+(?=\(\d{4}\))", item)[0]
        except IndexError: # there is no (\d{4})
            head = item # so take the whole string
        groups[head].append(item)

    return groups.values()
    # if you ABSOLUTELY MUST return a list, cast it here like this:
    #   return list( groups.values() )
    # however a dict_values object is list-like and should quack nicely.

edited Jun 19, 2014 at 16:21

Dan Lenski

80.5k13 gold badges86 silver badges129 bronze badges

answered Jun 19, 2014 at 16:02

Adam Smith

54.6k13 gold badges85 silver badges120 bronze badges

5 Comments

Jon Clements Over a year ago

Not a fantastic idea to have the imports inside the function, and for robustness this'll break if something doesn't have something bracketed...

Adam Smith Over a year ago

@JonClements I know, generally in a project that would need this I would have imported both collections and re at module level, but there's no amazing way to show that in a single function :). OP specifically asked how to grab a substring that ends in a (####) pattern. You could always try:except for IndexError (or honestly re.match) and take the whole string in the except

Jon Clements Over a year ago

I know that... just pointing it out for others doesn't hurt though :p

Jan Vlcinsky Over a year ago

@AdamSmith replacing re.findall by string.split would save one line on import and try/except block. And I expect string.split to be faster then re.findall

Adam Smith Over a year ago

@JanVlcinsky It would, but I'm concerned about input like "Here is the (substring and) after this comes (1234) Part 1"

nOOb cODEr · Accepted Answer · 2014-06-19 16:22:38Z

2

I would go with something a little simpler. Demo here http://dbgr.cc/8

import re

list1=[
    "House of Mine (1293) Item 21",
    "House of Mine (1292) Item 24",
    "The yard (1000) Item 1 ",
    "The yard (1000) Item 2 ",
    "The yard (1000) Item 4 "
]

def group_items(lst):
    res = {}
    reg = re.compile(r"^(.*)\(\d+\).*$")
    for item in list1:
        match = reg.match(item)
        res.setdefault(match.group(1), []).append(item)

    return res.values()

print group_items(list1)

With the output being:

[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'], ['The yard (1000) Item 1 ', 'The yard (1000) Item 2 ', 'The yard (1000) Item 4 ']]

answered Jun 19, 2014 at 16:22

nOOb cODEr

2361 silver badge4 bronze badges

4 Comments

Adam Smith Over a year ago

Note that Python interns recently used regexes, so your compile here is not strictly necessary. re.match(patt, item) would be equivalent

BrunoSXS Over a year ago

Although it's still a little difficult for me to understand thanks to some concepts like "setdefault"(I'm searching and messing with it right now), I think your answer is the best for my case. Since the files which I'm going to go over to get the names for the list, can have "(" before the pattern "(XXXX)"

Adam Smith Over a year ago

@BrunoSXS dict.setdefault(K, n) is dict[K] if that exists, else it's dict[K] = n; return n. It's useful, but in this case is only being used to simulate a defaultdict (as in my solution) which handles this more cleanly.

BrunoSXS Over a year ago

I was playing with it and reading over at python.org. It's a really usefull feature. If I knew this little one here, I would have used dictionary instead of lists.

Jan Vlcinsky · Accepted Answer · 2014-06-19 16:39:00Z

Based on my other answer and use of defaultdict as proposed by Adams Smith, here is alternative method.

It uses text.split to detect the grouping key

It uses map to loop over values to assign them to proper key in defaultdict

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

Here are the 4 lines of code:

>>> from collections import defaultdict
>>> groups = defaultdict(list)
>>> map(lambda itm: groups[itm.split("(")[0]].append(itm), list1)
[None, None, None, None, None]
>>> groups.values()
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

Anyway, this assumes, that the first "(" is the delimiter, and if there is a value like "The (unexpected) yard (1000) Item 44", it could fail in fulfilling expectations and use of re would be the way to go.

Collectives™ on Stack Overflow

Group items by string pattern in python

4 Answers 4

Short version answering OP

Variation using `re.findall`

7 Comments

5 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Short version answering OP

Variation using re.findall

7 Comments

5 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Variation using `re.findall`