1

I am trying to read a bunch of regexes from a file, using python.

The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.

<\? xml([^>]*?)>,<\? XML$1>
peter,Peter

I am doing

detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
    line = line.strip()
    [search_term, replace_term] = line.split(',', 1)
    detergent += [[search_term,replace_term]]

This is not producing the right input. If I print the detergent I get

['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]

It seems to be that it is escaping the backslashes.

Moreover, in a file containing, say

<? xml ........>

a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be

<\? XML$1>

So, the $1 is not recovering the first capture group in the first regex of the pair.

What is the proper way to input regexes from a file to be later used in re.sub?

When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.

1 Answer 1

2

There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...

>>> r"\?"
>>> '\\?'

The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Sign up to request clarification or add additional context in comments.

5 Comments

I see. I think I have got confused about this a couple of times. For some reason at regexr.com they use $1 instead of \1. There is where I usually go to learn and test my regexes.
Sorry, but I am seeing an issue independent of the $1 vs \1. When the replacement regex is <\? xml\1> is it writing <\? xml ...> in the file. That backslash is appearing in the file instead of only the ?.
@myfirsttime1 ... exactly what are you expecting \? to do in the replacement pattern other than return \? ? Are you wanting to print whether there was a `\` in the input or not? In that case you need to be capturing it in a group as well
Sorry, maybe I am confused about regex. I thought ? needs to be escaped since it is also used to make quantifiers lazy. But maybe it only needs to be escaped if one needs a ? right after a quantifier to mean a question mark and not modifying the quantifier.
@myfirsttime1 A replacement pattern does not have "quantifiers"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.