-1

I was wondering what regex-string you can use to parse a python string. After several fails I came to a regex-code, which can parse one of the most used string formats, like

"this is \"my string\", which ends here"

This is my regex-"code":

"([^"\\]|(\\.))*"

I asked this question because I didn't find anything like that on the Internet before. Can I work with that expression and "develop" it to parse all kinds of python strings? If you find this question interesting and I recommend you, where you can check your expressions very quickly.

2
  • stackoverflow.com/questions/14366401/… Commented Feb 12, 2014 at 19:58
  • Don't forget that you need to handle prefixes also (unicode/raw strings.) For example, u"ª unicode string", r"\I have 3 literal backslashes\\", UR'unícode and no\e\s\c\a\p\e characters'. Also delimiters - '/"/'''/""". And even though you can't escape delimiters in a raw string, you still can't end a raw string with a backslash. Lots of edge cases involved in doing what you want. Commented Feb 12, 2014 at 20:42

3 Answers 3

2

Your regex pattern ( and the one in @thebjorn link) will fail if there is an odd number>1 of backslashes before the quote, I suggest you to use this pattern (with singleline mode):

"(?:[^"\\]|\\{2}|\\.)*"

an optimised way:

"(?:(?=([^"\\]+|\\{2}|\\.))\1)*"

to deal with single quotes too:

(["'])(?:[^"'\\]|\\{2}|\\.|(?!\1)["'])*\1

or

(["'])(?:(?=([^"'\\]+|\\{2}|\\.|(?!\1)["']))\2)*\1

(note that the last character of the four patterns are exactly on the same line, a sign?)

Sign up to request clarification or add additional context in comments.

Comments

2

Here is a different way, which uses tokenize.generate_tokens to identify Python strings. The tokenize module uses regex; so by using tokenize you leave the complex dirty work to Python itself. By using higher-level functions you can be more confident the regex is correct (and avoid reinventing the wheel). Moreover, this will correctly identify Python strings of all sorts (e.g. strings of the single quoted, double quoted, and triple quoted varieties) without being confused by comments.

import tokenize
import token
import io
import collections

class Token(collections.namedtuple('Token', 'num val start end line')):
    @property
    def name(self):
        return token.tok_name[self.num]

text = r'''foo = 1 "this is \"my string\", which ends here" bar'''

for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
    tok = Token(*tok)            # 1
    if tok.name == 'STRING':     # 2
        print(tok.val)
  1. tokenize.generate_tokens returns tuples. The Token class allows you to access the information in the tuple in a nicer way.
  2. In particular, each Token has a name, such as 'STRING', 'NEWLINE', 'INDENT', or 'OP'. You can use this to identify Python strings.

Edit: I like using the Token class so I don't have to write token.tok_name[num] in lots of places. However, for the code above, it might be clearer and easier to forget about the Token class and just write the main idea explicitly:

import tokenize
import token
import io

text = r'''foo = 1 "this is \"my string\", which ends here" bar'''

for num, val, start, end, line in tokenize.generate_tokens(io.BytesIO(text).readline):
    if token.tok_name[num]  == 'STRING': 
        print(val)

6 Comments

It looks very technical and excellent, but indeed very complex form new like me. --- Token(*tok) you says The Token class allows you to access the information in the tuple in a nicer way. Not clear to me. -- I have bookmarked this answer to explore further. Thanks for adding this answer.
Why the Token class? just Token = collections.namedtuple('Token', 'num val start end line') and then tok = Token._make(tok) will do... (unless I'm missing something)
@JonClements: In longer code, I don't like writing token.tok_name[num] all over the place. Replacing that with a name attribute is the purpose of the Token class.
@unutbu my apologies - I misread the intent of what name was doing there... - ignore me :)
@JonClements: No problem; thanks for prompting me to explain, since I forget that it's 10x easier to write code than understand other people's code. :)
|
1

This seems to handle everything correctly:

rr = r'''(?xi)
        (r|u|ru|ur|)
        (
            ''\' (\\. | [\s\S])*? ''\'
            |
            """ (\\. | [\s\S])*? """
            |
            ' (\\. | [^'\n])* '
            |
            " (\\. | [^"\n])* "
        )
'''

Test: https://ideone.com/DEimLl

Syntax reference: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.