Using Regular expressions to match a portion of the string?(python)

Question

What regular expression can i use to match genes(in bold) in the gene list string:

GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8

I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene

It appears that the genes are separated by semi-colons. You can use this fact to build a regex to meet your requirements. — Code-Apprentice
– Code-Apprentice, Commented Aug 11, 2016 at 18:15
We can help you better if you post a complete python program which you have tried. — Code-Apprentice
– Code-Apprentice, Commented Aug 11, 2016 at 18:16

dawg · Accepted Answer · 2016-08-11 18:34:00Z

1

Given:

>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"

You can use Python string methods to do:

>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

For a regex:

(?<=[:;]\s)([^\s;]+)

Demo

Or, in Python:

>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

edited Aug 11, 2016 at 18:34

answered Aug 11, 2016 at 18:20

dawg

105k24 gold badges143 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sean Sadykoff Over a year ago

In my case i am searching through a large xml file to find gene information, so i actually need to search for "Gene_list" first and then try to pull each gene out to add it to the list of all genes

dawg Over a year ago

Have you considered using an XML parser such as ElementTree?

Sean Sadykoff Over a year ago

Yes, i am actually using it, however gene information is not stored as a node of its own, it is listed in the node "notes" which has a lot of other information(other than the gene names) that i am not currently concerned about

Sean Sadykoff Over a year ago

This is more what i was looking for: (?<=[(GENE_LIST):;]\s)([^\s;]+)

dawg Over a year ago

@SeanSadykoff: The regex (?<=[(GENE_LIST):;]\s)([^\s;]+) is not what you think. With the [] around [(GENE_LIST):;] you are looking for the individual characters of the string GENE_LIST which is functionally no different than the regex (?<=[:;]\s)([^\s;]+) in this case. If you want to add the string to the lookback, you would do something like (?<=(?:^GENE_LIST: )|(?:; ))([^\s;]+) or just eliminate the lookback and use (?:(?:^GENE_LIST: )|(?:; ))([^\s;]+)

heemayl · Accepted Answer · 2016-08-11 18:36:35Z

1

You can use the following:

\s([^;\s]+)

Demo

The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)

>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

edited Aug 11, 2016 at 18:36

answered Aug 11, 2016 at 18:21

heemayl

42.5k10 gold badges86 silver badges87 bronze badges

7 Comments

Ruben Pirotte Over a year ago

this will capture the whitespace in front too, so the first gene capture would be ' F59A7.7'

heemayl Over a year ago

@RubenPirotte i am not sure if there are multiple whitespaces before each one anyway updated..

Ruben Pirotte Over a year ago

I'm sorry, I think you misunderstood: the \s captures the whitespace, while we only want the gene behind that whitespace.

Ruben Pirotte Over a year ago

so either a trim should be done, or the regex has to be modified.

heemayl Over a year ago

@RubenPirotte Did you check the modified answer?

|

Ruben Pirotte · Accepted Answer · 2016-08-11 18:27:27Z

0

UPDATE

It's in fact much simpler:

[^\s;]+

however, first use substring to take only the part you need (the genes, without GENELIST )

demo: regex demo

edited Aug 11, 2016 at 18:27

answered Aug 11, 2016 at 18:17

Ruben Pirotte

3862 silver badges11 bronze badges

2 Comments

Sean Sadykoff Over a year ago

Yes, the last gene is not followed by a semicolon

Ruben Pirotte Over a year ago

is the last gene always preceded by a space?

jcxu · Accepted Answer · 2016-08-13 03:03:05Z

0

string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)

The output is:

['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

answered Aug 13, 2016 at 3:03

jcxu

865 bronze badges

Collectives™ on Stack Overflow

Using Regular expressions to match a portion of the string?(python)

4 Answers 4

5 Comments

7 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

7 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related