Python regex format

Question

I'm trying to match some strings using Pythons re-module, but cant get it done correctly. The strings i've to deal with look like this (example):

XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
asdf_1980_2234a_2
XY_abcd_5098_2270_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH

The pattern there is not constant, it could vary to some point. This is important to me:

Forget about the start of the string, up to the first numeric value. Thats not important, i don't need this, it should be stripped from any result.
Then there are always four digits, they can be followed by alphabetical characters (up to three). I need this part, extracted.
Then, after some underscore (there might be a minus in it, too), is another set of numeric values i need, it's always two to four (...and might be followed by up to three alphabetical characters, too) .
Right after this section, seperated by further underscores, there could be further numeric values which are important and belong to the previous values. There might be alphabetical characters in it, too...
The end of the string might contain something like "NC" and maybe further characters, is not important and should be stripped.

So, according to the previous example, this is what i need to work with:

('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')

...I've never done such if-and-ifnot things in RegEx, it's driving me crazy. Here is what I've got so far, but it's not exactly what I need:

pattern     = re.compile(
              r"""
              ([0-9]{4}
              [A-Z]{0,3})
              [_-]{1,3}
              ([0-9]{2,4}
              [0-9A-Z_-]{0,16})
              """,
              re.IGNORECASE | 
              re.VERBOSE
              )

if re.search(pattern, string):
    print re.findall(pattern, string)

When I use this on the last mentioned Example, this is what I get:

[(u'7659Ae', u'1450sp_rev_2_1_NC_GR')]

...almost what I need - but I don't know how to exclude this _NC_GR at the end, and this simple method of limiting the characters by count is just not good.

Does anyone have a nice and working solution to this case?

Martijn Pieters · Accepted Answer · 2013-03-20 12:05:11Z

3

You need to use a negative lookahead to match characters that are not followed by NC. Reformatting your regular expression a little to show of the groupings:

pattern     = re.compile(r"""
              ( [0-9]{4} [A-Z]{0,3} )
              [_-]{1,3}
              ( [0-9]{2,4} (?:[0-9A-Z_-](?!NC))* )
              """, re.IGNORECASE | re.VERBOSE)

with the {0,16} replaced with a bold * quantifier, results in:

>>> for match in pattern.findall(inputtext):
...     print match
... 
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('1980', '2234a_2')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')

So the (non-capturing) group (?:[0-9A-Z_-](?!NC)) matches any digit, letter, underscore or dash that is not followed by the characters NC.

edited Mar 20, 2013 at 12:05

answered Mar 19, 2013 at 19:17

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

xph Over a year ago

Thank you very much! This is exactly what i was looking for - works perfect! I wasn't able to use this lookahead as described in the docs, thank you for this solution!

xph Over a year ago

While this is working really good, the regex eyquem provided does a little different job when it comes to the end-recognition of the second part, the _NC-thingy. That difference makes his regex more usable for my needs. Thank you anyway!

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

For me, the solution of Martijn doesn't work. So I give my solution.

Take attention to the fact that I don't use re.IGNORECASE
Hence, my regex is able to catch the end of
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof
I don't know if it is really what you want in this case

inputtext = """XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
asdf_1980_2234a_2
XY_abcd_5098_2270_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof"""
print inputtext

.

import re

print """\n----------------------------------------
WANTED
('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')"""
print '----------- eyquem ----------------------'
ri = re.compile('^\D+'
                '(\d{4}[a-zA-Z]{0,3})'
                '[_-]+'
                '(.+?)'
                '(?:[_-]+NC.*)?$',
                re.MULTILINE)

for match in ri.findall(inputtext):
    print match
    
print '----------- Martijn ----------------------'
ro     = re.compile(
              r"""
              ([0-9]{4}
              [A-Z]{0,3})
              [_-]{1,3}
              ([0-9]{2,4}
              [0-9A-Z_-]{0,16}?)
              (?:[-_]NC)?
              """,
              re.IGNORECASE | re.VERBOSE)

for match in ro.findall(inputtext):
    print match

result

----------------------------------------
WANTED
('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
----------- eyquem ----------------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('1980', '2234a_2')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_nc_woof')
----------- Martijn ----------------------
('1234', '0040')
('1122Ae', '1150')
('0124e', '50')
('1980', '2234')
('5098', '2270')
('7659Ae', '1450')
('7659Ae', '1450')

My regex can be used on individual lines::

for s in inputtext.splitlines(True):
    print ri.match(s).groups()

same result

.

EDIT

import re

inputtext = """XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
XY_efgh_0228e_66-__NC
asdf_1980_2234a_2   
asdf_2999_133a
XY_abcd_5098_2270_2_1_NC
XY_abcd_6099_33370_2_1_NC
XY_abcd_6099_3370abcd_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1___NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof_NC
PC_bos_7659Ae_1450sp_rev_2_1_anc_woof_NC
PC_bos_7659Ae_1450sp_rev_2_1_abNC_woof_NC"""

print '----------- Martijn 2 ------------'
ruu     = re.compile(r"""
              ( [0-9]{4} [A-Z]{0,3} )
              [_-]{1,3}
              ( [0-9]{2,4} (?:[0-9A-Z_-](?!NC))* )
              """, re.IGNORECASE | re.VERBOSE)
for match in ruu.findall(inputtext):
    print match
print '----------- eyquem 2 ------------'
rii = re.compile('[_-]'
                '(\d{4}[A-Z]{0,3})'
                '[_-]{1,3}'
                '('
                  '(?=\d{2,4}[A-Z]{0,3}(?![\dA-Z]))'
                  '(?:[0-9A-Z_-]+?)'
                 ')'
                '(?:[-_]+NC.*)?'
                '(?![0-9A-Z_-])',
                re.IGNORECASE)
for m in rii.findall(inputtext):
    print m

result

----------- Martijn 2 ------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('0228e', '66-_')
('1980', '2234a_2')
('2999', '133a')
('5098', '2270_2_1')
('6099', '33370_2_1')
('6099', '3370abcd_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1__')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_')
('7659Ae', '1450sp_rev_2_1_a')
----------- eyquem 2 ------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('0228e', '66')
('1980', '2234a_2')
('2999', '133a')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_anc_woof')
('7659Ae', '1450sp_rev_2_1_abNC_woof')

Remarks:

my regex doesn't catch '33370_2_1' nor '3370abcd_2_1' because they don't respect the pattern "2 to 4 letters possibly followed by max 3 digits"
whereas Martijn's solution catches them
the ends of the portions catched by my regex are clean; in Martijn's code they aren't
Martijn's regex stops in front of every sequence NC or nc, even if it isn't preceded by an underscore, that is to say even when these sequences are letters being part part of the wanted portion.
If this characteristic of my regex isn't desired, say to me, I will modify it

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 19, 2013 at 21:10

eyquem

27.8k7 gold badges43 silver badges46 bronze badges

14 Comments

xph Over a year ago

Hi, thank you for your effort - but your regex isn't really what i was looking for. I really have no need for "nc_woof" or such things ;-) Btw, Martijns regex above is a little different to the one you use in your code - it leads to different results. Thank you anyway!

eyquem Over a year ago

I don't understand what you man by "no need for "nc_woof" or such things". I remark there are only digits at the end of the portion you want, separated from NC with _ or touching the end of the string when there is no NC. Do you mean that the case where letters at the end of this searched portion doesn't occur in your data ?

eyquem Over a year ago

Also, you've put [_-]{1,3} in your code. Could these cases happen : PC_bos_7659Ae_1450sp_rev_2_1___NC_GRAPH or PC_bos_7659Ae_1450sp_rev_2_1_--NC_GRAPH ?

xph Over a year ago

One of your results is '1450sp_rev_2_1_nc_woof'- that's containing the nc...-part at the end which i need be excluded from the match. I can't say where exactly case-sensitivity is important, but to be safe i'd like to match all alphabetical characters case-insensitive, that way i won't get into trouble when the data is not consistent. And while most of the data got its values seperated by underscores like value_value, some of the data could be seperated by this: value_-_value - that's why i wrote it like this.

eyquem Over a year ago

I crafted my regex precisely to catch an ending _nc_woof , imagining that could be what you would like in such a case. But as I said, it was just hypothesis, and it was to show the danger of using re.IGNORECASE. Now my regex can be easily changed. But I'd like to understand thoroughly: may the case _2_nc_woof be possible and in this case nc having to be considered as NC ? Or is _NC the only (=cased) portion that signals the end of the portion you want to catch ?

|

Collectives™ on Stack Overflow

Python regex format

2 Answers 2

2 Comments

EDIT

14 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

EDIT

14 Comments

Your Answer

Sign up or log in

Post as a guest

Related