Shruti Rijhwani, Microsoft Research India
RT @HappelStadion: What was your favourite 1D moment at the concert? Was war für euch der schönste Moment? Tweet us!
If you know both English and German, you probably figured out what two languages this tweet uses. Either way, you likely realized that there isn’t just one language in the tweet.
We recognize languages that we are familiar with. The task is second nature to humans – is it just as easy for machines? Why do machines need to identify languages in the first place?
Most Natural Language Processing (NLP) techniques are designed for specific languages. That makes language identification a necessary first step for machines to derive meaning from human language. Computational language identification research began in 1995. Initially, language identification was performed at the document-level, that is, whole documents were assumed to contain a single language. This was only logical as back in 1995, most digital documents had professional or literary content. We didn’t expect to encounter multiple languages within documents!
However, sentence-level language identification (i.e. one language label per sentence) soon became important to understand comments, short posts and similar user-generated data on the internet. Where does code-mixing fit in, though? Let us look at this Spanish-English tweet.
@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro! 😮 Then I started getting all red, I think im allergic a algo
Even sentence-level language identification wouldn’t work when data is code-mixed, as mixing can be intra-sentential! Before we begin to process code-mixing, we need to recognise all languages present in the data. One language per sentence simply isn’t enough – word-level language identification is necessary.
Code-mixing is inherently informal and generally occurs in casual communication. The phenomenon traditionally occurred in spoken conversation. Now, we have speech-like informal conversation happening on social media and find plenty of code-mixed data in the text form as well.
How do we identify the languages in social media data? Is it as simple as looking up words in dictionaries of various languages? Going back to our example tweet,
RT @HappelStadion: What *was* your favourite 1D *moment* at the concert? *Was* war für euch der schönste *Moment*? Tweet us!
There are words (‘was’, ‘moment’) that belong to both languages! And this tweet is grammatically sound, with correct spelling. What about tweets like,
Wat n awesum movie it wazzzz!
Our language dictionaries wouldn’t identify misspelled words (‘wat’), shortened words (‘awesum’) and exaggerated words (‘wazzz’).
Not to mention, the problem of transliteration. Several languages that are not formally written in the Roman/Latin script, are often phonetically typed using the Roman script that computer keyboards generally feature.
Modi ke speech se India inspired ho gaya #namo
Although Hindi uses the Devanagari script, this Hindi-English tweet has transliterated Hindi words.
Looking up words in a dictionary might work in several cases. But the example tweets we’ve just looked at are not outliers! A large amount of social media content isn’t written with perfect grammar and spelling. Solutions to word-level language ID must counter these problems as well.
There has been exciting work on word-level language identification for social media data, including a shared task in EMNLP 2014 , the annual FIRE shared task , as well as work on Hindi-English  and Dutch-Turkish  mixing.
Most previous work deals with pairwise language identification i.e., the language pair is already known, and words in the input can only be from those languages. With plenty of annotated training data, supervised machine learning models have performed extremely well under these conditions.
However, such models have two glaring issues –
- They assume that the language pair in the input is already known and the words can only be from those languages. On Twitter, Facebook and other social media, no prior language information is available about posts.
- They use supervised machine learning models, which require plenty of annotated training data. Labelled data is scarce for most language pairs, particularly data with all the quirks of social media.
The Project Mélange team at MSR India is working towards a solution for these issues.
We aim to design a universal word-level language identification technique that works well for both code-mixed and monolingual social media data. It would require no prior information about the languages in the input. Although we have a minuscule amount of code-mixed training data, obtaining labeled monolingual data is relatively much simpler. We leverage this monolingual data and train a model that can label code-mixed input as well.
Watch this space for more on that, soon!
 Solorio, Thamar, et al. “Overview for the first shared task on language identification in code-switched data.” Proceedings of The First Workshop on Computational Approaches to Code Switching. 2014.
 Sequiera, Royal, et al. “Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.”
 Gella, Spandana, Kalika Bali, and Monojit Choudhury. ““ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification.” (2014).
 Nguyen, Dong, and A. Seza Dogruoz. “Word level language identification in online multilingual communication.” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2014.