1

I have an input data set as follows -

INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]

I would like this to be in the following formatted output form -

OUTPUT = [
('ABCD, Jun/14/1999'),
('EFGH, Jan/10/1998'),
('IJKL, Jul/15/1985'),
('MNOP, Dec/21/1999'),
('QRST, Apr/1/2000'),
('UVWX, Feb/11/2001')
]

I tried the following code which works partly but I am unable to do the formatting in the desired OUTPUT format -

import re

INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]


def formatted_def(input):
    for n in input:
        t = re.sub('[^a-zA-Z0-9 ]+','',n).split('DOB')
        print(t)


formatted_def(INPUT)

Output -

['ABCD  ', '  Jun141999']
['EFGH  ', '  Jan101998']
['IJKL  ', '  Jul151985']
['MNOP  ', '  Dec211999']
['QRST  ', '  Apr012000']
['UVWX  D O B  Feb112001 ']

Any pointers will be very helpful. Thanks in advance!

4 Answers 4

2
import re
re.findall(r'(\w+)\s+,.*?-\s+([^., ]*)', ' '.join(INPUT))
# [('ABCD', 'Jun/14/1999'), ('EFGH', 'Jan/10/1998'), ('IJKL', 'Jul/15/1985'), ('MNOP', 'Dec/21/1999'), ('QRST', 'Apr/01/2000'), ('UVWX', 'Feb/11/2001')]
Sign up to request clarification or add additional context in comments.

Comments

2

You can use re.findall:

import re
l = ['ABCD , D.O.B: - Jun/14/1999.', 'EFGH , DOB; - Jan/10/1998,', 'IJKL , D-O-B - Jul/15/1985..', 'MNOP , (DOB)* - Dec/21/1999,', 'QRST , *DOB* - Apr/01/2000.', 'UVWX , D O B, - Feb/11/2001 ']
final_data = [', '.join(re.findall('^\w+|[a-zA-Z]+/\d+/\d+(?=\W)', i)) for i in l]

Output:

['ABCD, Jun/14/1999', 'EFGH, Jan/10/1998', 'IJKL, Jul/15/1985', 'MNOP, Dec/21/1999', 'QRST, Apr/01/2000', 'UVWX, Feb/11/2001']

Comments

2

In addition to the other answer, you can also use re.sub:

INPUT = [
    'ABCD , D.O.B: - Jun/14/1999.',
    'EFGH , DOB; - Jan/10/1998,',
    'IJKL , D-O-B - Jul/15/1985..',
    'MNOP , (DOB)* - Dec/21/1999,',
    'QRST , *DOB* - Apr/01/2000.',
    'UVWX , D O B, - Feb/11/2001 '
]

pattern = r'(?i)^([a-z]+).*([a-z]{3}/\d{2}/\d{4}).*$'

OUTPUT = [re.sub(pattern, r'\1, \2', x) for x in INPUT]

# OUTPUT:

[
    'ABCD, Jun/14/1999',
    'EFGH, Jan/10/1998',
    'IJKL, Jul/15/1985',
    'MNOP, Dec/21/1999',
    'QRST, Apr/01/2000',
    'UVWX, Feb/11/2001'
]

Comments

0

The main difficult point is to get ('ABCD, Jun/14/1999'), content.

It can not be a single-element tuple, as it would have been printed as ('ABCD, Jun/14/1999',), (note extra , before the )).

So to get exactly the result you wanted, I did it using a series of print statements.

The whole script (in Python 3) can be as follows:

import re
input = [
  'ABCD , D.O.B: - Jun/14/1999.',
  'EFGH , DOB; - Jan/10/1998,',
  'IJKL , D-O-B - Jul/15/1985..',
  'MNOP , (DOB)* - Dec/21/1999,',
  'QRST , *DOB* - Apr/01/2000.',
  'UVWX , D O B, - Feb/11/2001 '
]
result = [ re.sub(r'^([a-z]+).*? - ([a-z]{3}/\d{2}/\d{4}).*',
                  r'\1, \2', txt, flags = re.IGNORECASE) for txt in input ]
print('OUTPUT = [')
for txt in result:
    print(" ('{}')".format(txt))
print(']')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.