5

i am writing a Bot that can just check thousands of website either they are in English or not.

i am using Scrapy (python 2.7 framework) for crawling each website first page ,

can some one suggest me which is the best way to check website language ,

any help would be appreciated.

8 Answers 8

4

Since you are using Python, you can try out NLTK. More precisely you can check for NLTK.detect

More information and the exact code snippet is here: NLTK and language detection

Sign up to request clarification or add additional context in comments.

Comments

2

You can use the response headers to find out:

Wikipedia

6 Comments

does every website has Content-Language attribute ? i dont have much exposure of websites ?
Likely, it's part of the http protocol, it's the easiest way to meet your requirement without other dependencies. If it doesn't suit your needs one can always extend to other measures. You might want a fallback pipeline for instance
can you please assist me more regarding your " You might want a fallback pipeline for instance" these words.
You could create a cycle of options to define the language, starting with the one least resource costly, moving on to something more robust every time the previous method failed
-1 The HTTP header is not very reliable. Many page authors don't mark up the language they write in, many web page authoring tools won't let them, many admins don't let users set this for individual pages, etc; and when people do try to specify this information, they sometimes get it wrong (for example, many Swedish pages have the country code for Sweden se instead of the language code for Swedish sv).
|
2

If the sites are multilanguage you can send the "Accept-Language:en-US,en;q=0.8" header and expect the response to be in english. If they are not, you can inspect the "response.headers" dictionary and see if you can find any information about the language.

If still unlucky, you can try mapping the IP to the country and then to the language in some way. As a last resource, try detecting the language (I don't know how accurate this is).

Comments

2

If you are using Python, I highly recommend standalone LangID module written by Marco Lui and Tim Baldwin. The model is pre-trained and the character detection is highly accurate. It can also handle XML/HTML document.

Comments

1

Look into Natural Language Toolkit:

NLTK: http://nltk.org/

What you want to look into is using corpus to extract the default vocabulary set by NLTK:

nltk.corpus.words.words()

Then, compare your text with the above using difflib.

Reference: http://docs.python.org/library/difflib.html

Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.

2 Comments

In a resource efficient crawler this is something I would add somewhere at the bottom of my pipeline tbh
Update: NLTK now offers a module for language identification
1

You can use Language Detection API at http://detectlanguage.com It accepts text string via GET or POST and provides JSON output with scores. There is free and premium services.

Comments

0

If a html website is using non English characters it is mentioned in the webpage source code in the meta tag. this helps browsers know how to render the page.

here is an example off an arabic website http://www.tanmia.ae that has both an English page and Arabic page

meta tag in the Arabic page is : meta http-equiv="X-UA-Compatible" content="IE=edge

The same page but in English is meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

maybe have the bot look into the meta tag if its english then proceed else ignore?

Comments

0

If you don't want to trust what the webpage tells you but want to check for yourself, you can use a statistical algorithm for language detection. Trigram-based algorithms are robust and should work well with pages that are mostly on another language but have a bit of English (enough to fool heuristics like "check if the words the, and, or with are on the page) Google "ngram language classification" and you'll find lots of references on how it's done.

It's easy enough to compile your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are in NLTK_DATA/corpora/langid. You could use the trigram data without the nltk library itself, but you might also want to look into the nltk.util.trigrams module.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.