4

I'm trying to match some strings using Pythons re-module, but cant get it done correctly. The strings i've to deal with look like this (example):

XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
asdf_1980_2234a_2
XY_abcd_5098_2270_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH

The pattern there is not constant, it could vary to some point. This is important to me:

  • Forget about the start of the string, up to the first numeric value. Thats not important, i don't need this, it should be stripped from any result.

  • Then there are always four digits, they can be followed by alphabetical characters (up to three). I need this part, extracted.

  • Then, after some underscore (there might be a minus in it, too), is another set of numeric values i need, it's always two to four (...and might be followed by up to three alphabetical characters, too) .

  • Right after this section, seperated by further underscores, there could be further numeric values which are important and belong to the previous values. There might be alphabetical characters in it, too...

  • The end of the string might contain something like "NC" and maybe further characters, is not important and should be stripped.

So, according to the previous example, this is what i need to work with:

('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')

...I've never done such if-and-ifnot things in RegEx, it's driving me crazy. Here is what I've got so far, but it's not exactly what I need:

pattern     = re.compile(
              r"""
              ([0-9]{4}
              [A-Z]{0,3})
              [_-]{1,3}
              ([0-9]{2,4}
              [0-9A-Z_-]{0,16})
              """,
              re.IGNORECASE | 
              re.VERBOSE
              )

if re.search(pattern, string):
    print re.findall(pattern, string)

When I use this on the last mentioned Example, this is what I get:

[(u'7659Ae', u'1450sp_rev_2_1_NC_GR')]

...almost what I need - but I don't know how to exclude this _NC_GR at the end, and this simple method of limiting the characters by count is just not good.

Does anyone have a nice and working solution to this case?

0

2 Answers 2

3

You need to use a negative lookahead to match characters that are not followed by NC. Reformatting your regular expression a little to show of the groupings:

pattern     = re.compile(r"""
              ( [0-9]{4} [A-Z]{0,3} )
              [_-]{1,3}
              ( [0-9]{2,4} (?:[0-9A-Z_-](?!NC))* )
              """, re.IGNORECASE | re.VERBOSE)

with the {0,16} replaced with a bold * quantifier, results in:

>>> for match in pattern.findall(inputtext):
...     print match
... 
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('1980', '2234a_2')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')

So the (non-capturing) group (?:[0-9A-Z_-](?!NC)) matches any digit, letter, underscore or dash that is not followed by the characters NC.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! This is exactly what i was looking for - works perfect! I wasn't able to use this lookahead as described in the docs, thank you for this solution!
While this is working really good, the regex eyquem provided does a little different job when it comes to the end-recognition of the second part, the _NC-thingy. That difference makes his regex more usable for my needs. Thank you anyway!
2

For me, the solution of Martijn doesn't work. So I give my solution.

Take attention to the fact that I don't use re.IGNORECASE
Hence, my regex is able to catch the end of
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof
I don't know if it is really what you want in this case

inputtext = """XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
asdf_1980_2234a_2
XY_abcd_5098_2270_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof"""
print inputtext

.

import re

print """\n----------------------------------------
WANTED
('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')"""
print '----------- eyquem ----------------------'
ri = re.compile('^\D+'
                '(\d{4}[a-zA-Z]{0,3})'
                '[_-]+'
                '(.+?)'
                '(?:[_-]+NC.*)?$',
                re.MULTILINE)

for match in ri.findall(inputtext):
    print match
    
print '----------- Martijn ----------------------'
ro     = re.compile(
              r"""
              ([0-9]{4}
              [A-Z]{0,3})
              [_-]{1,3}
              ([0-9]{2,4}
              [0-9A-Z_-]{0,16}?)
              (?:[-_]NC)?
              """,
              re.IGNORECASE | re.VERBOSE)

for match in ro.findall(inputtext):
    print match

result

----------------------------------------
WANTED
('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
----------- eyquem ----------------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('1980', '2234a_2')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_nc_woof')
----------- Martijn ----------------------
('1234', '0040')
('1122Ae', '1150')
('0124e', '50')
('1980', '2234')
('5098', '2270')
('7659Ae', '1450')
('7659Ae', '1450')

My regex can be used on individual lines::

for s in inputtext.splitlines(True):
    print ri.match(s).groups()

same result

.

EDIT

import re

inputtext = """XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
XY_efgh_0228e_66-__NC
asdf_1980_2234a_2   
asdf_2999_133a
XY_abcd_5098_2270_2_1_NC
XY_abcd_6099_33370_2_1_NC
XY_abcd_6099_3370abcd_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1___NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof_NC
PC_bos_7659Ae_1450sp_rev_2_1_anc_woof_NC
PC_bos_7659Ae_1450sp_rev_2_1_abNC_woof_NC"""

print '----------- Martijn 2 ------------'
ruu     = re.compile(r"""
              ( [0-9]{4} [A-Z]{0,3} )
              [_-]{1,3}
              ( [0-9]{2,4} (?:[0-9A-Z_-](?!NC))* )
              """, re.IGNORECASE | re.VERBOSE)
for match in ruu.findall(inputtext):
    print match
print '----------- eyquem 2 ------------'
rii = re.compile('[_-]'
                '(\d{4}[A-Z]{0,3})'
                '[_-]{1,3}'
                '('
                  '(?=\d{2,4}[A-Z]{0,3}(?![\dA-Z]))'
                  '(?:[0-9A-Z_-]+?)'
                 ')'
                '(?:[-_]+NC.*)?'
                '(?![0-9A-Z_-])',
                re.IGNORECASE)
for m in rii.findall(inputtext):
    print m

result

----------- Martijn 2 ------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('0228e', '66-_')
('1980', '2234a_2')
('2999', '133a')
('5098', '2270_2_1')
('6099', '33370_2_1')
('6099', '3370abcd_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1__')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_')
('7659Ae', '1450sp_rev_2_1_a')
----------- eyquem 2 ------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('0228e', '66')
('1980', '2234a_2')
('2999', '133a')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_anc_woof')
('7659Ae', '1450sp_rev_2_1_abNC_woof')

Remarks:

  • my regex doesn't catch '33370_2_1' nor '3370abcd_2_1' because they don't respect the pattern "2 to 4 letters possibly followed by max 3 digits"
    whereas Martijn's solution catches them

  • the ends of the portions catched by my regex are clean; in Martijn's code they aren't

  • Martijn's regex stops in front of every sequence NC or nc, even if it isn't preceded by an underscore, that is to say even when these sequences are letters being part part of the wanted portion.
    If this characteristic of my regex isn't desired, say to me, I will modify it

14 Comments

Hi, thank you for your effort - but your regex isn't really what i was looking for. I really have no need for "nc_woof" or such things ;-) Btw, Martijns regex above is a little different to the one you use in your code - it leads to different results. Thank you anyway!
I don't understand what you man by "no need for "nc_woof" or such things". I remark there are only digits at the end of the portion you want, separated from NC with _ or touching the end of the string when there is no NC. Do you mean that the case where letters at the end of this searched portion doesn't occur in your data ?
Also, you've put [_-]{1,3} in your code. Could these cases happen : PC_bos_7659Ae_1450sp_rev_2_1___NC_GRAPH or PC_bos_7659Ae_1450sp_rev_2_1_--NC_GRAPH ?
One of your results is '1450sp_rev_2_1_nc_woof'- that's containing the nc...-part at the end which i need be excluded from the match. I can't say where exactly case-sensitivity is important, but to be safe i'd like to match all alphabetical characters case-insensitive, that way i won't get into trouble when the data is not consistent. And while most of the data got its values seperated by underscores like value_value, some of the data could be seperated by this: value_-_value - that's why i wrote it like this.
I crafted my regex precisely to catch an ending _nc_woof , imagining that could be what you would like in such a case. But as I said, it was just hypothesis, and it was to show the danger of using re.IGNORECASE. Now my regex can be easily changed. But I'd like to understand thoroughly: may the case _2_nc_woof be possible and in this case nc having to be considered as NC ? Or is _NC the only (=cased) portion that signals the end of the portion you want to catch ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.