Reading regexes from file, in Python

Question

I am trying to read a bunch of regexes from a file, using python.

The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.

<\? xml([^>]*?)>,<\? XML$1>
peter,Peter

I am doing

detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
    line = line.strip()
    [search_term, replace_term] = line.split(',', 1)
    detergent += [[search_term,replace_term]]

This is not producing the right input. If I print the detergent I get

['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]

It seems to be that it is escaping the backslashes.

Moreover, in a file containing, say

<? xml ........>

a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be

<\? XML$1>

So, the $1 is not recovering the first capture group in the first regex of the pair.

What is the proper way to input regexes from a file to be later used in re.sub?

When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.

donkopotamus · Accepted Answer · 2015-11-18 20:23:28Z

2

There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...

>>> r"\?"
>>> '\\?'

The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

answered Nov 18, 2015 at 20:23

donkopotamus

23.4k3 gold badges58 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

myfirsttime1 Over a year ago

I see. I think I have got confused about this a couple of times. For some reason at regexr.com they use $1 instead of \1. There is where I usually go to learn and test my regexes.

myfirsttime1 Over a year ago

Sorry, but I am seeing an issue independent of the $1 vs \1. When the replacement regex is <\? xml\1> is it writing <\? xml ...> in the file. That backslash is appearing in the file instead of only the ?.

donkopotamus Over a year ago

@myfirsttime1 ... exactly what are you expecting \? to do in the replacement pattern other than return \? ? Are you wanting to print whether there was a `\` in the input or not? In that case you need to be capturing it in a group as well

myfirsttime1 Over a year ago

Sorry, maybe I am confused about regex. I thought ? needs to be escaped since it is also used to make quantifiers lazy. But maybe it only needs to be escaped if one needs a ? right after a quantifier to mean a question mark and not modifying the quantifier.

donkopotamus Over a year ago

@myfirsttime1 A replacement pattern does not have "quantifiers"

Collectives™ on Stack Overflow

Reading regexes from file, in Python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related