Friday, August 07, 2009

What Language Is This? 5 Tools to Identify Unknown Languages

이 웹사이트에 환영. 이것은 보기 원본이다
What language is this? Chinese? Japanese?
It’s Korean actually. Detecting this manually would have taken me a lot of time. Fortunately, I found some very accurate tools that can do this automatically. They are all listed below.
The experiment: I tested the websites using sample text (1-2 sentences with 8 words) from the following languages: Portuguese, Russian, Korean, Vietnamese, Italian, Turkish, Polish, Ukrainian, Azerbaijani, Slovenian, Macedonian, Dutch, Filipino (Tagalog), Greek, Galician, Czech, Belorussian, Finnish, Tatar and Norwegian.
Overall, I tested 20 different languages.

3 Tools to Detect Unknown Language Text

1. LangId (passed 18 out of 20 tests, didn’t pass Tatar and Belorussian)
lang-id
Pros: Overall, great online tool. It offers basic text detection functionality and they also have Twitter and email-detection bots for even quicker results.
Cons: Their engine is based on Google API but they seem to have better results than the Google detector described below. It seems they know how to utilize things very well. I didn’t like that they don’t have their own unique algorithm to detect languages.
2. Google Language Detector (passed 17 out of 20 tests, didn’t pass Portuguese, Taglog and Belorussian)
What Language Is This
Pros: Google has one of the world’s best API for language detection. They good thing is you’re able to see the probability of the result they display being true. They were able to pass most of the sample tests.
Cons: I was quite surprised they didn’t pass the Portuguese test. It seems they have a (I hope temporary) bug with this language. Also they can surely do a better job in making the page design better.
3. What Language Is This (passed 11 out of 20 tests, didn’t pass Russian, Korean, Ukrainian, Azerbaijani, Macedonian, Tagalog, Greek, Galician and Tatar)
what-language-is-this
Pros: Some languages like the South Slavic ones (Serbian, Croatian, Slovenian) are quite similar. In case you enter some Croatian text, let’s say, this website will tell you that the text could also be Serbian or Slovenian.
Cons: They need to work on making their detection system more sophisticated. I was thinking of putting Translated.net (another website for language detection) instead of this one, but Translated promised detection of more languages and actually did worse than WhatLanguageIsThis.com.

2 Tools To Detect Websites In Unknown Languages

4. Google Translate with Detect Language as the first option
Passed: 18 out of 20, didn’t pass Belorussian and Tatar.
Pros: This tool does its job very well. The thing I like about Google Translate is that if it doesn’t support a specific language it gives you the following screen:
Identify Unknown Languages
That’s a great language detector if you ask me!
5. Microsoft Bing Translator with Auto-Detect as the first option.
bing-translator
Passed: 8 out of 20, didn’t pass Dutch, Vietnamese, Turkish, Ukrainian, Azerbaijani, Slovenian, Macedonian, Tagalog, Greek, Galician, Czech and Belorussian
Pros: It supports a limited number of languages. For those languages, it does its job well.
Cons: I am very disappointed with Microsoft. They have a very limited number of languages for detection& translation and their Auto-Detect feature is terrible. In case you enter a language they don’t support, you’ll get a wrong result instead of telling you they don’t support that language.

Thoughts

Overall, my opinion is the above tools are heading in a good direction. They are currently the best ones for detecting languages online and do their job pretty well when it comes to popular languages. However, they must work on adding more obscure languages (none of the tools were able to recognize Tatar) and I’m sure that all of them, especially Google will go in that direction in near future.

No comments:

Post a Comment

[Please do not advertise, or post irrelevant links. Thank you for your cooperation.]