most efficient way to go about identifying sub-strings in a string in python?

Question

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.

at the moment i'm doing this with a simple for loop and str.find()

the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.

what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?

An example of different formatting could be as follows

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

etc.

Thanks :)

Fred Foo · Accepted Answer · 2011-01-12 19:09:59Z

4

Try a regular expression:

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

(Modify until it matches the CPVs in your data closely.)

answered Jan 12, 2011 at 19:09

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rafe Kettler · Accepted Answer · 2011-01-12 19:08:05Z

1

Try using any of the functions in re (regular expressions for Python). See the docs for more info.

You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)

answered Jan 12, 2011 at 19:08

Rafe Kettler

77.1k21 gold badges160 silver badges151 bronze badges

Comments

Hugh Bothwell · Accepted Answer · 2011-01-13 13:15:24Z

1

cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

applied to your sample data returns

The regular expression can be read as

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

Hope that helps!

edited Jan 13, 2011 at 13:15

answered Jan 13, 2011 at 2:27

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

2 Comments

Fred Foo Over a year ago

+1 for the normalizing. I do recommend using the r string prefix instead of \\\t, though.

Hugh Bothwell Over a year ago

@larsman: thank you, I have changed it to a raw string and reordered the character-list for easier comprehension.

Collectives™ on Stack Overflow

most efficient way to go about identifying sub-strings in a string in python?

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related