Regex version of python string [duplicate]

Question

I was wondering what regex-string you can use to parse a python string. After several fails I came to a regex-code, which can parse one of the most used string formats, like

"this is \"my string\", which ends here"

This is my regex-"code":

"([^"\\]|(\\.))*"

I asked this question because I didn't find anything like that on the Internet before. Can I work with that expression and "develop" it to parse all kinds of python strings? If you find this question interesting and I recommend you, where you can check your expressions very quickly.

Don't forget that you need to handle prefixes also (unicode/raw strings.) For example, u"ª unicode string", r"\I have 3 literal backslashes\\", UR'unícode and no\e\s\c\a\p\e characters'. Also delimiters - '/"/'''/""". And even though you can't escape delimiters in a raw string, you still can't end a raw string with a backslash. Lots of edge cases involved in doing what you want. — GVH
– GVH, Commented Feb 12, 2014 at 20:42

Casimir et Hippolyte · Accepted Answer · 2014-02-12 20:12:54Z

2

Your regex pattern ( and the one in @thebjorn link) will fail if there is an odd number>1 of backslashes before the quote, I suggest you to use this pattern (with singleline mode):

"(?:[^"\\]|\\{2}|\\.)*"

an optimised way:

"(?:(?=([^"\\]+|\\{2}|\\.))\1)*"

to deal with single quotes too:

(["'])(?:[^"'\\]|\\{2}|\\.|(?!\1)["'])*\1

or

(["'])(?:(?=([^"'\\]+|\\{2}|\\.|(?!\1)["']))\2)*\1

(note that the last character of the four patterns are exactly on the same line, a sign?)

edited Feb 12, 2014 at 20:12

answered Feb 12, 2014 at 20:01

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

unutbu · Accepted Answer · 2014-02-12 20:38:45Z

2

Here is a different way, which uses tokenize.generate_tokens to identify Python strings. The tokenize module uses regex; so by using tokenize you leave the complex dirty work to Python itself. By using higher-level functions you can be more confident the regex is correct (and avoid reinventing the wheel). Moreover, this will correctly identify Python strings of all sorts (e.g. strings of the single quoted, double quoted, and triple quoted varieties) without being confused by comments.

import tokenize
import token
import io
import collections

class Token(collections.namedtuple('Token', 'num val start end line')):
    @property
    def name(self):
        return token.tok_name[self.num]

text = r'''foo = 1 "this is \"my string\", which ends here" bar'''

for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
    tok = Token(*tok)            # 1
    if tok.name == 'STRING':     # 2
        print(tok.val)

tokenize.generate_tokens returns tuples. The Token class allows you to access the information in the tuple in a nicer way.
In particular, each Token has a name, such as 'STRING', 'NEWLINE', 'INDENT', or 'OP'. You can use this to identify Python strings.

Edit: I like using the Token class so I don't have to write token.tok_name[num] in lots of places. However, for the code above, it might be clearer and easier to forget about the Token class and just write the main idea explicitly:

import tokenize
import token
import io

text = r'''foo = 1 "this is \"my string\", which ends here" bar'''

for num, val, start, end, line in tokenize.generate_tokens(io.BytesIO(text).readline):
    if token.tok_name[num]  == 'STRING': 
        print(val)

edited Feb 12, 2014 at 20:38

answered Feb 12, 2014 at 20:20

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

6 Comments

Grijesh Chauhan Over a year ago

It looks very technical and excellent, but indeed very complex form new like me. --- Token(*tok) you says The Token class allows you to access the information in the tuple in a nicer way. Not clear to me. -- I have bookmarked this answer to explore further. Thanks for adding this answer.

Jon Clements Over a year ago

Why the Token class? just Token = collections.namedtuple('Token', 'num val start end line') and then tok = Token._make(tok) will do... (unless I'm missing something)

unutbu Over a year ago

@JonClements: In longer code, I don't like writing token.tok_name[num] all over the place. Replacing that with a name attribute is the purpose of the Token class.

Jon Clements Over a year ago

@unutbu my apologies - I misread the intent of what name was doing there... - ignore me :)

unutbu Over a year ago

@JonClements: No problem; thanks for prompting me to explain, since I forget that it's 10x easier to write code than understand other people's code. :)

|

georg · Accepted Answer · 2014-02-12 22:00:40Z

1

This seems to handle everything correctly:

rr = r'''(?xi)
        (r|u|ru|ur|)
        (
            ''\' (\\. | [\s\S])*? ''\'
            |
            """ (\\. | [\s\S])*? """
            |
            ' (\\. | [^'\n])* '
            |
            " (\\. | [^"\n])* "
        )
'''

Test: https://ideone.com/DEimLl

Syntax reference: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

edited Feb 12, 2014 at 22:00

answered Feb 12, 2014 at 21:54

georg

216k57 gold badges324 silver badges401 bronze badges

Collectives™ on Stack Overflow

Regex version of python string [duplicate]

3 Answers 3

Comments

6 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

Comments

Linked

Related