How to efficiently detect language for a string on python list?

Question

I've been trying with langdetect however my results aren't satisfactory. Please see below:

from langdetect import detect   
myText = ['something like this', 'hello, I hope', 'bonjour', 'guten tag', 'hola amigos']

languages = []

for x in range(len(myText)):
    languages.append(detect(myText[x]))

languages variable returns:

['en', 'en', 'hr', 'sv', 'so']

Could someone recommend a more efficient way to detect string language for my scenario above? Thanks!

Your text snippets are too short to perform well.

user2390182
– user2390182

2020-04-28 11:31:00 +00:00
Commented Apr 28, 2020 at 11:31 — user2390182
– user2390182, Commented Apr 28, 2020 at 11:31

rdas · Accepted Answer · 2020-04-28 11:35:12Z

5

You simply don't have enough text to detect the language correctly. Check the probabilities reported by the detect_langs method:

from langdetect import detect, detect_langs
myText = ['something like this', 'hello, I hope', 'bonjour', 'guten tag', 'hola amigos']

languages = []

for text in myText:
    languages.append((text, detect_langs(text)))

print(languages)

Gives:

[('something like this', [en:0.7142843359964415, no:0.2857134272509894]), 
('hello, I hope', [en:0.5714282536622661, it:0.42856936839505744]), 
('bonjour', [hr:0.4285730214431372, sq:0.28571322755605805, fr:0.2857129560702645]),
('guten tag', [sv:0.999995044011124]), 
('hola amigos', [so:0.9999965325258])]

See how the results for bonjour are mixed - no language has a concrete lead over others.

Now if I add just a little more text to that example:

from langdetect import detect_langs

print(detect_langs('Bonjour, mon ami'))

That gives:

[fr:0.8571383531700392, sq:0.14285710967856416]

Which is a lot more accurate.

So to answer your question: get more data

answered Apr 28, 2020 at 11:35

rdas

21.4k6 gold badges39 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jongware Over a year ago

Funny how Google Translate refuses to translate "bonjour" if you set the language to Croatian, though. Likely langdetect works with engrams.

Arkistarvh Kltzuonstev · Accepted Answer · 2020-04-28 11:33:27Z

2

It gives you exact result but in ISO 639-1 code short format of languages. You can use a dictionary to map these short codes to their corresponding broad language name like :

language_dict = {'en' : 'english', ...}

For alternatives, you might check out textblob :

from textblob import TextBlob
b = TextBlob(myText[2])
b.detect_language()
# output : 'fr'

For myText list the corresponding result given is :

['en', 'en', 'fr', 'de', 'es']

edited Apr 28, 2020 at 11:33

answered Apr 28, 2020 at 11:31

Arkistarvh Kltzuonstev

6,9687 gold badges32 silver badges62 bronze badges

1 Comment

user2390182 Over a year ago

I think the OP is not so much concerned with the iso codes instead of full names, but more with the result that e.g. "bonjour" is detected as Croatian.

Collectives™ on Stack Overflow

How to efficiently detect language for a string on python list?

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related