3

supose this list:

list1=["House of Mine (1293) Item 21",
       "House of Mine (1292) Item 24",
       "The yard (1000) Item 1 ",
       "The yard (1000) Item 2 ",
       "The yard (1000) Item 4 "]

I want to add each item of it to a group (a list inside a list on this case) IF the substring till the (XXXX) is the same.

So, in this case, I am expecting to have:

[["House of Mine (1293) Item 21",
  "House of Mine (1292) Item 24"],

 ["The yard (1000) Item 1 ",
  "The yard (1000) Item 2 ",
  "The yard (1000) Item 4 "]

The following code is what I was able to make, but it's not working:

def group(list1):
    group=[]
    for i, itemg in enumerate(list1):
        try:
            group[i]
        except Exception:
            group.append([])
        for itemj in group[i]:
            if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
                group[i].append(itemg)
            else:
                group.append([])
                group[-1].append(itemg)

    return group

I've read thanks to another topic in stack, the page of regular expressions http://www.diveintopython3.net/regular-expressions.html

I know the answer lies on it, but I'm having difficult understanding some concepts of it.

4 Answers 4

7

Set up the list to group:

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

Define a function, used to sort and group items (this time using the number in parenthesis):

>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'

Sort the list (in place here):

>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough

Take groupby from itertools

>>> from itertools import groupby

Check the concept:

>>> for gr, items in groupby(list1, key = keyf):
...     print "gr", gr
...     print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
 'The yard (1000) Item 2 ',
 'The yard (1000) Item 4 ',
 'House of Mine (1292) Item 24',
 'House of Mine (1293) Item 21']

Note, we had to call list on items, as items is an iterator over items.

Now using list comprehension:

>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 '],
 ['House of Mine (1292) Item 24'],
 ['House of Mine (1293) Item 21']]

and we are done.

If you want to group by all the text before first "(", the only change is to:

>>> keyf = lambda text: text.split("(")[0]

Short version answering OP

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]      

Variation using re.findall

Solution assumes that "(" is the delimiter and ignores the requirement of having four digits there. Such a task can be resolved using re.

>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '

But it raises IndexError: list index out of range if the text does not have expected content (we are trying to acces item with index 0 from empty list).

>>> text = "nothing here"
IndexError: list index out of range

We can use simple trick, to survive, we append original text to ensure, something is there:

>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'

Final solution using re

>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]
Sign up to request clarification or add additional context in comments.

7 Comments

OP is actually asking to group by the substring in front of the (####). Note that groupby only works if the original list is sorted.
@AdamSmith Thanks for correction. Anyway, my (first version) of the answer is sorting the list (but it is worth reminding). I added second versoin of keyf to sort and groupby the string OP was asking for.
I always wonder how optimal it is to perform a sort just so we can groupby but honestly I've never tested it on any sort of "real world" data. I normally default to the method I used in my answer (defaultdict(list) with keys matching the groupings) if I think I'm going to run into unsorted data. Honestly, since you're matching the first part of the string, you can just use sorted(list1) and sort it lexicographically and not perform the extra str.split
@AdamSmith I like pair programming with you. Simplified sorting added. Regarding efficiency - it is really difficult to evaluate, defaultdict has also some costs, but on longer unsorted list I would really expect defaultdict to be faster.
Is it possible to use regular expressions with this one? I was not able to do so while using string.split() The pattern (XXXX) is the year, so it will always have those four characters. Going by the first "(" can get the string wrongly since there will be strings that have "(" before the "(XXXX)" year pattern.
|
4

I'd use a collections.defaultdict and re.findall up to the paren with a lookahead.

import collections
import re

def groupitems(lst):
    groups = collections.defaultdict(list)

    for item in lst:
        try:
            head = re.findall(".+(?=\(\d{4}\))", item)[0]
        except IndexError: # there is no (\d{4})
            head = item # so take the whole string
        groups[head].append(item)

    return groups.values()
    # if you ABSOLUTELY MUST return a list, cast it here like this:
    #   return list( groups.values() )
    # however a dict_values object is list-like and should quack nicely.

5 Comments

Not a fantastic idea to have the imports inside the function, and for robustness this'll break if something doesn't have something bracketed...
@JonClements I know, generally in a project that would need this I would have imported both collections and re at module level, but there's no amazing way to show that in a single function :). OP specifically asked how to grab a substring that ends in a (####) pattern. You could always try:except for IndexError (or honestly re.match) and take the whole string in the except
I know that... just pointing it out for others doesn't hurt though :p
@AdamSmith replacing re.findall by string.split would save one line on import and try/except block. And I expect string.split to be faster then re.findall
@JanVlcinsky It would, but I'm concerned about input like "Here is the (substring and) after this comes (1234) Part 1"
2

I would go with something a little simpler. Demo here http://dbgr.cc/8

import re

list1=[
    "House of Mine (1293) Item 21",
    "House of Mine (1292) Item 24",
    "The yard (1000) Item 1 ",
    "The yard (1000) Item 2 ",
    "The yard (1000) Item 4 "
]

def group_items(lst):
    res = {}
    reg = re.compile(r"^(.*)\(\d+\).*$")
    for item in list1:
        match = reg.match(item)
        res.setdefault(match.group(1), []).append(item)

    return res.values()

print group_items(list1)

With the output being:

[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'], ['The yard (1000) Item 1 ', 'The yard (1000) Item 2 ', 'The yard (1000) Item 4 ']]

4 Comments

Note that Python interns recently used regexes, so your compile here is not strictly necessary. re.match(patt, item) would be equivalent
Although it's still a little difficult for me to understand thanks to some concepts like "setdefault"(I'm searching and messing with it right now), I think your answer is the best for my case. Since the files which I'm going to go over to get the names for the list, can have "(" before the pattern "(XXXX)"
@BrunoSXS dict.setdefault(K, n) is dict[K] if that exists, else it's dict[K] = n; return n. It's useful, but in this case is only being used to simulate a defaultdict (as in my solution) which handles this more cleanly.
I was playing with it and reading over at python.org. It's a really usefull feature. If I knew this little one here, I would have used dictionary instead of lists.
0

Based on my other answer and use of defaultdict as proposed by Adams Smith, here is alternative method.

It uses text.split to detect the grouping key

It uses map to loop over values to assign them to proper key in defaultdict

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

Here are the 4 lines of code:

>>> from collections import defaultdict
>>> groups = defaultdict(list)
>>> map(lambda itm: groups[itm.split("(")[0]].append(itm), list1)
[None, None, None, None, None]
>>> groups.values()
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

Anyway, this assumes, that the first "(" is the delimiter, and if there is a value like "The (unexpected) yard (1000) Item 44", it could fail in fulfilling expectations and use of re would be the way to go.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.