Correctly parsing string literals with python's re module

Question

I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.

For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.

The regex is: (?<!\\)\"[^\"]+\"

Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.

To break it down:

(?<\\)\"    # The group should begin with a quotation mark that is not escaped
[^\"]+      # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\"          # and end at the first quotation mark it finds

That being said, I (obviously) start hitting problems with things like this:

"This is a string with an \"escaped quote\" inside it"

I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:

([^\"]|\\\")+     # a group of anything but a quote or an escaped quote

, but that lead to very strange results.

I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.

Thanks for any assistance.

EDIT

As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):

STRLIT = r'''(?x)   # verbose mode
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    .*?        # non-greedy 1-or-more characters
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    ''' 
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)

The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.

Is there a reason you're trying to write your own Markdown parser instead of using one that's already debugged? — kindall
– kindall, Commented Jan 16, 2013 at 19:45

Eevee · Accepted Answer · 2013-01-16 20:08:56Z

6

I think I first saw this idea in... Jinja2's source code? Later transplanted it to Mako.

r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''

Which does the following:

(\"\"\"|\'\'\'|\"|\') matches a Python opening quote, because this happens to be taken from code for parsing Python. You probably don't need all those quote types.
((?<!\\)\\\1|.) matches: EITHER a matching quote that was escaped ONLY ONCE, OR any other character. So \\" will still be recognized as the end of the string.
*? non-greedily matches as many of those as possible.
And \1 is just the closing quote.

Alas, \\\" will still incorrectly be detected as the end of the string. (The template engines only use this to check if there is a string, not to extract it.) This is a problem very poorly suited for regular expressions; short of doing insane things in Perl, where you can embed real code inside a regex, I'm not sure it's possible even with PCRE. Though I'd love to be proven wrong. :) The killer is that (?<!...) has to be constant-length, but you want to check that there's any even number of backslashes before the closing quote.

If you want to get this correct, and not just mostly-correct, you might have to use a real parser. Have a look at parsley, pyparsing, or any of these tools.

edit: By the way, there's no need to check that the opening quote doesn't have a backslash before it. That's not valid syntax outside a string in JS (or Python).

answered Jan 16, 2013 at 20:08

Eevee

48.8k11 gold badges100 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mechatroner Over a year ago

I think adding (\\\\)* before \\\1 should solve the even-number-of-backslashes problem, so the full regex will be: r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)(\\\\)*\\\1|.)*?\1'''

unutbu · Accepted Answer · 2013-01-16 20:13:38Z

5

Perhaps use two negative look behinds:

import re

text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r)    :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''

for match in (re.findall(r'''(?x)   # verbose mode
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    .*?        # 1-or-more characters
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    ''', text)):
    print(match)

yields

"This is a string with an \"escaped quote\" inside it"
""
"data"

The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.

edited Jan 16, 2013 at 20:13

answered Jan 16, 2013 at 19:46

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

7 Comments

Thomas Thorogood Over a year ago

While that does work in the simple example, it's still giving me troubles. When getting string literals in the raw jquery.min.js, for instance, it's finding most of the script as a string literal, which it's clearly not (I'm using that as a test for the regex).

unutbu Over a year ago

Are you using the non-greedy pattern I just posted?

Thomas Thorogood Over a year ago

yes, I copy-pasted that pattern directly (though I didn't know you could use that method of non-greediness!)

Thomas Thorogood Over a year ago

I edited my answer with more information about how I'm testing, if it helps at all.

unutbu Over a year ago

Can you post a (hopefully short) example of text on which the pattern is failing?

|

zcb · Accepted Answer · 2016-06-21 03:58:50Z

0

Using python, the correct regex matching double quoted string is:

pattern = r'"(\.|[^"])*"'

It describes strings starts and ends with ". For each character inside the two double quotes, it's either an escaped character OR any character expect ".

unutbu's ansever is wrong because for valid string "\\\\", cannot matched by that pattern.

answered Jun 21, 2016 at 3:58

zcb

1311 gold badge2 silver badges10 bronze badges

Collectives™ on Stack Overflow

Correctly parsing string literals with python's re module

3 Answers 3

1 Comment

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related