Here is a different way, which uses tokenize.generate_tokens to identify Python strings. The tokenize module uses regex; so by using tokenize you leave the complex dirty work to Python itself. By using higher-level functions you can be more confident the regex is correct (and avoid reinventing the wheel). Moreover, this will correctly identify Python strings of all sorts (e.g. strings of the single quoted, double quoted, and triple quoted varieties) without being confused by comments.
import tokenize
import token
import io
import collections
class Token(collections.namedtuple('Token', 'num val start end line')):
@property
def name(self):
return token.tok_name[self.num]
text = r'''foo = 1 "this is \"my string\", which ends here" bar'''
for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
tok = Token(*tok) # 1
if tok.name == 'STRING': # 2
print(tok.val)
- tokenize.generate_tokens returns tuples. The Token class allows you
to access the information in the tuple in a nicer way.
- In particular, each Token has a name, such as 'STRING', 'NEWLINE',
'INDENT', or 'OP'. You can use this to identify Python strings.
Edit: I like using the Token class so I don't have to write
token.tok_name[num] in lots of places. However, for the code above, it might be clearer and easier to forget about the Token class and just write the main idea explicitly:
import tokenize
import token
import io
text = r'''foo = 1 "this is \"my string\", which ends here" bar'''
for num, val, start, end, line in tokenize.generate_tokens(io.BytesIO(text).readline):
if token.tok_name[num] == 'STRING':
print(val)