Simplifying my task, lets say I want to find any words written in Hebrew in some web page.
So I know that Hebrew char codes are U+05D0 to U+05EA.
I want to write something like:
expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"
web_handle = urllib2.urlopen(url)
website_text = website_handle.read()
matches = sre.findall(exp, website_text)
for item in matches:
print item
The output I would expect is:
עברית
But instead the out put is a lot of Chinese/Japanese chars.