0

I've been trying with langdetect however my results aren't satisfactory. Please see below:

from langdetect import detect   
myText = ['something like this', 'hello, I hope', 'bonjour', 'guten tag', 'hola amigos']

languages = []

for x in range(len(myText)):
    languages.append(detect(myText[x]))

languages variable returns:

['en', 'en', 'hr', 'sv', 'so']

Could someone recommend a more efficient way to detect string language for my scenario above? Thanks!

1
  • 2
    Your text snippets are too short to perform well. Commented Apr 28, 2020 at 11:31

2 Answers 2

5

You simply don't have enough text to detect the language correctly. Check the probabilities reported by the detect_langs method:

from langdetect import detect, detect_langs
myText = ['something like this', 'hello, I hope', 'bonjour', 'guten tag', 'hola amigos']

languages = []

for text in myText:
    languages.append((text, detect_langs(text)))

print(languages)

Gives:

[('something like this', [en:0.7142843359964415, no:0.2857134272509894]), 
('hello, I hope', [en:0.5714282536622661, it:0.42856936839505744]), 
('bonjour', [hr:0.4285730214431372, sq:0.28571322755605805, fr:0.2857129560702645]),
('guten tag', [sv:0.999995044011124]), 
('hola amigos', [so:0.9999965325258])]

See how the results for bonjour are mixed - no language has a concrete lead over others.

Now if I add just a little more text to that example:

from langdetect import detect_langs

print(detect_langs('Bonjour, mon ami'))

That gives:

[fr:0.8571383531700392, sq:0.14285710967856416]

Which is a lot more accurate.

So to answer your question: get more data

Sign up to request clarification or add additional context in comments.

1 Comment

Funny how Google Translate refuses to translate "bonjour" if you set the language to Croatian, though. Likely langdetect works with engrams.
2

It gives you exact result but in ISO 639-1 code short format of languages. You can use a dictionary to map these short codes to their corresponding broad language name like :

language_dict = {'en' : 'english', ...}

For alternatives, you might check out textblob :

from textblob import TextBlob
b = TextBlob(myText[2])
b.detect_language()
# output : 'fr'

For myText list the corresponding result given is :

['en', 'en', 'fr', 'de', 'es']

1 Comment

I think the OP is not so much concerned with the iso codes instead of full names, but more with the result that e.g. "bonjour" is detected as Croatian.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.