0

What regular expression can i use to match genes(in bold) in the gene list string:

GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8

I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene

2
  • It appears that the genes are separated by semi-colons. You can use this fact to build a regex to meet your requirements. Commented Aug 11, 2016 at 18:15
  • We can help you better if you post a complete python program which you have tried. Commented Aug 11, 2016 at 18:16

4 Answers 4

1

Given:

>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"

You can use Python string methods to do:

>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

For a regex:

(?<=[:;]\s)([^\s;]+)

Demo

Or, in Python:

>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
Sign up to request clarification or add additional context in comments.

5 Comments

In my case i am searching through a large xml file to find gene information, so i actually need to search for "Gene_list" first and then try to pull each gene out to add it to the list of all genes
Have you considered using an XML parser such as ElementTree?
Yes, i am actually using it, however gene information is not stored as a node of its own, it is listed in the node "notes" which has a lot of other information(other than the gene names) that i am not currently concerned about
This is more what i was looking for: (?<=[(GENE_LIST):;]\s)([^\s;]+)
@SeanSadykoff: The regex (?<=[(GENE_LIST):;]\s)([^\s;]+) is not what you think. With the [] around [(GENE_LIST):;] you are looking for the individual characters of the string GENE_LIST which is functionally no different than the regex (?<=[:;]\s)([^\s;]+) in this case. If you want to add the string to the lookback, you would do something like (?<=(?:^GENE_LIST: )|(?:; ))([^\s;]+) or just eliminate the lookback and use (?:(?:^GENE_LIST: )|(?:; ))([^\s;]+)
1

You can use the following:

\s([^;\s]+)

Demo

  • The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)

>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

7 Comments

this will capture the whitespace in front too, so the first gene capture would be ' F59A7.7'
@RubenPirotte i am not sure if there are multiple whitespaces before each one anyway updated..
I'm sorry, I think you misunderstood: the \s captures the whitespace, while we only want the gene behind that whitespace.
so either a trim should be done, or the regex has to be modified.
@RubenPirotte Did you check the modified answer?
|
0

UPDATE

It's in fact much simpler:

[^\s;]+

however, first use substring to take only the part you need (the genes, without GENELIST )

demo: regex demo

2 Comments

Yes, the last gene is not followed by a semicolon
is the last gene always preceded by a space?
0
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)

The output is:

['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.