Using unicode char code in regular expression

Question

Simplifying my task, lets say I want to find any words written in Hebrew in some web page. So I know that Hebrew char codes are U+05D0 to U+05EA. I want to write something like:

expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"    

web_handle = urllib2.urlopen(url)
website_text = website_handle.read()    
matches = sre.findall(exp, website_text)
for item in matches:
    print item

The output I would expect is:

עברית

But instead the out put is a lot of Chinese/Japanese chars.

@stribizhev It won't find anything. Maybe I should use HTML codes instead? — Sanich
– Sanich, Commented Sep 15, 2015 at 17:03

Kasravnd · Accepted Answer · 2015-09-15 16:41:47Z

1

You can just use standard representation of unicode in python within a character class :

re.findall([\u05D0-\u05EA], website_text,re.U)

answered Sep 15, 2015 at 16:41

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sanich Over a year ago

It won't find anything. Maybe I should use HTML codes instead? &#1488 to &#1514

Kasravnd Over a year ago

@Sanich No it doesn't works in python, maybe you need to decode your text, actually its based on your text.

Sanich Over a year ago

the text is HTML web page

Kasravnd Over a year ago

@Sanich Can you add a sample data to your question? with your expected output?

Sanich · Accepted Answer · 2015-09-15 18:20:24Z

0

The expression should be:

expr = u"[\u05D0-\u05EA]+"

Notice the 'u' at the beginning.

answered Sep 15, 2015 at 18:20

Sanich

1,8557 gold badges28 silver badges46 bronze badges

Collectives™ on Stack Overflow

Using unicode char code in regular expression

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related