python website language detection

Question

i am writing a Bot that can just check thousands of website either they are in English or not.

i am using Scrapy (python 2.7 framework) for crawling each website first page ,

can some one suggest me which is the best way to check website language ,

any help would be appreciated.

Community · Accepted Answer · 2017-05-23 11:47:25Z

4

Since you are using Python, you can try out NLTK. More precisely you can check for NLTK.detect

More information and the exact code snippet is here: NLTK and language detection

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Jul 16, 2012 at 15:21

Yavar

11.9k5 gold badges34 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hedde van der Heide · Accepted Answer · 2012-07-16 15:18:53Z

2

You can use the response headers to find out:

Wikipedia

answered Jul 16, 2012 at 15:18

Hedde van der Heide

22.5k14 gold badges74 silver badges101 bronze badges

6 Comments

akhter wahab Over a year ago

does every website has Content-Language attribute ? i dont have much exposure of websites ?

Hedde van der Heide Over a year ago

Likely, it's part of the http protocol, it's the easiest way to meet your requirement without other dependencies. If it doesn't suit your needs one can always extend to other measures. You might want a fallback pipeline for instance

akhter wahab Over a year ago

can you please assist me more regarding your " You might want a fallback pipeline for instance" these words.

Hedde van der Heide Over a year ago

You could create a cycle of options to define the language, starting with the one least resource costly, moving on to something more robust every time the previous method failed

tripleee Over a year ago

-1 The HTTP header is not very reliable. Many page authors don't mark up the language they write in, many web page authoring tools won't let them, many admins don't let users set this for individual pages, etc; and when people do try to specify this information, they sometimes get it wrong (for example, many Swedish pages have the country code for Sweden se instead of the language code for Swedish sv).

|

martincho · Accepted Answer · 2012-07-16 15:31:15Z

2

If the sites are multilanguage you can send the "Accept-Language:en-US,en;q=0.8" header and expect the response to be in english. If they are not, you can inspect the "response.headers" dictionary and see if you can find any information about the language.

If still unlucky, you can try mapping the IP to the country and then to the language in some way. As a last resource, try detecting the language (I don't know how accurate this is).

answered Jul 16, 2012 at 15:31

martincho

4,7677 gold badges35 silver badges42 bronze badges

Comments

Paolo Moretti · Accepted Answer · 2012-09-28 19:48:02Z

2

If you are using Python, I highly recommend standalone LangID module written by Marco Lui and Tim Baldwin. The model is pre-trained and the character detection is highly accurate. It can also handle XML/HTML document.

edited Sep 28, 2012 at 19:48

Paolo Moretti

56.5k23 gold badges103 silver badges93 bronze badges

answered Aug 18, 2012 at 15:52

nqngo

5283 silver badges10 bronze badges

Comments

Daniel Li · Accepted Answer · 2012-07-16 15:23:24Z

1

Look into Natural Language Toolkit:

NLTK: http://nltk.org/

What you want to look into is using corpus to extract the default vocabulary set by NLTK:

nltk.corpus.words.words()

Then, compare your text with the above using difflib.

Reference: http://docs.python.org/library/difflib.html

Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.

answered Jul 16, 2012 at 15:23

Daniel Li

15.6k6 gold badges45 silver badges60 bronze badges

2 Comments

Hedde van der Heide Over a year ago

In a resource efficient crawler this is something I would add somewhere at the bottom of my pipeline tbh

avip Over a year ago

Update: NLTK now offers a module for language identification

Laurynas · Accepted Answer · 2013-01-21 22:05:58Z

1

You can use Language Detection API at http://detectlanguage.com It accepts text string via GET or POST and provides JSON output with scores. There is free and premium services.

answered Jan 21, 2013 at 22:05

Laurynas

3,8692 gold badges34 silver badges22 bronze badges

Comments

SSSSSam · Accepted Answer · 2012-07-16 16:36:45Z

0

If a html website is using non English characters it is mentioned in the webpage source code in the meta tag. this helps browsers know how to render the page.

here is an example off an arabic website http://www.tanmia.ae that has both an English page and Arabic page

meta tag in the Arabic page is : meta http-equiv="X-UA-Compatible" content="IE=edge

The same page but in English is meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

maybe have the bot look into the meta tag if its english then proceed else ignore?

edited Jul 16, 2012 at 16:36

answered Jul 16, 2012 at 15:44

SSSSSam

1231 gold badge2 silver badges8 bronze badges

Comments

alexis · Accepted Answer · 2012-07-18 20:35:22Z

If you don't want to trust what the webpage tells you but want to check for yourself, you can use a statistical algorithm for language detection. Trigram-based algorithms are robust and should work well with pages that are mostly on another language but have a bit of English (enough to fool heuristics like "check if the words the, and, or with are on the page) Google "ngram language classification" and you'll find lots of references on how it's done.

It's easy enough to compile your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are in NLTK_DATA/corpora/langid. You could use the trigram data without the nltk library itself, but you might also want to look into the nltk.util.trigrams module.

Collectives™ on Stack Overflow

python website language detection

8 Answers 8

Comments

6 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

6 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related