Escaping regex unicode string in Python

Question

I have a user defined string. I want to use it in regex with small improvement: search by three apostrophes instead of one. For example,

APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])

It works good for latin, but for unicode list comprehension gives the following string: "[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"

Looks like it finds backslashes in both strings and then substitutes APOSTROPHES

Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].

How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"

Why not just replace "'" with ['\\u2019\\u02bc] after applying re.escape? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 16, 2016 at 7:49

Laurent LAPORTE · Accepted Answer · 2016-11-16 10:08:59Z

3

What I understand is: you want to create a regular expression which can match a given word with any apostrophe:

The RegEx which match any apostrophe can be defined in a group:

APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'

For instance, you have this (Ukrainian?) word which contains a single quote:

word = "п'ять"

EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:

word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)

To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".

You can replace this r"\'" by your apostrophe RegEx:

import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)

The new RegEx can then be used to match the same word with any apostrophe:

assert re.match(word_regex, "п'ять")  # '
assert re.match(word_regex, "п’ять")  # \u2019
assert re.match(word_regex, "пʼять")  # \u02bc

Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

edited Nov 16, 2016 at 10:08

answered Nov 16, 2016 at 8:03

Laurent LAPORTE

23.2k7 gold badges64 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Paul R Over a year ago

This works only when user enters п'ять, it doesn't work when user enters п’ять.

Laurent LAPORTE Over a year ago

@AndrewFount: OK, so you can "normalize" the word before escaping.

Collectives™ on Stack Overflow

Escaping regex unicode string in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related