1

Simplifying my task, lets say I want to find any words written in Hebrew in some web page. So I know that Hebrew char codes are U+05D0 to U+05EA. I want to write something like:

expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"    

web_handle = urllib2.urlopen(url)
website_text = website_handle.read()    
matches = sre.findall(exp, website_text)
for item in matches:
    print item

The output I would expect is:

עברית

But instead the out put is a lot of Chinese/Japanese chars.

1
  • @stribizhev It won't find anything. Maybe I should use HTML codes instead? Commented Sep 15, 2015 at 17:03

2 Answers 2

1

You can just use standard representation of unicode in python within a character class :

re.findall([\u05D0-\u05EA], website_text,re.U)
Sign up to request clarification or add additional context in comments.

4 Comments

It won't find anything. Maybe I should use HTML codes instead? &#1488 to &#1514
@Sanich No it doesn't works in python, maybe you need to decode your text, actually its based on your text.
the text is HTML web page
@Sanich Can you add a sample data to your question? with your expected output?
0

The expression should be:

expr = u"[\u05D0-\u05EA]+"

Notice the 'u' at the beginning.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.