1

I have a user defined string. I want to use it in regex with small improvement: search by three apostrophes instead of one. For example,

APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])

It works good for latin, but for unicode list comprehension gives the following string: "[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"

Looks like it finds backslashes in both strings and then substitutes APOSTROPHES

Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].

How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"

1
  • Why not just replace "'" with ['\\u2019\\u02bc] after applying re.escape? Commented Nov 16, 2016 at 7:49

1 Answer 1

3

What I understand is: you want to create a regular expression which can match a given word with any apostrophe:

The RegEx which match any apostrophe can be defined in a group:

APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'

For instance, you have this (Ukrainian?) word which contains a single quote:

word = "п'ять"

EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:

word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)

To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".

You can replace this r"\'" by your apostrophe RegEx:

import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)

The new RegEx can then be used to match the same word with any apostrophe:

assert re.match(word_regex, "п'ять")  # '
assert re.match(word_regex, "п’ять")  # \u2019
assert re.match(word_regex, "пʼять")  # \u02bc

Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

Sign up to request clarification or add additional context in comments.

2 Comments

This works only when user enters п'ять, it doesn't work when user enters п’ять.
@AndrewFount: OK, so you can "normalize" the word before escaping.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.