Code-Mixed Language Identification

Shruti Rijhwani, Microsoft Research India

RT @HappelStadion: What was your favourite 1D moment at the concert? Was war für euch der schönste Moment? Tweet us!

If you know both English and German, you probably figured out what two languages this tweet uses. Either way, you likely realized that there isn’t just one language in the tweet.

We recognize languages that we are familiar with. The task is second nature to humans – is it just as easy for machines? Why do machines need to identify languages in the first place?

Most Natural Language Processing (NLP) techniques are designed for specific languages. That makes language identification a necessary first step for machines to derive meaning from human language. Computational language identification research began in 1995. Initially, language identification was performed at the document-level, that is, whole documents were assumed to contain a single language. This was only logical as back in 1995, most digital documents had professional or literary content. We didn’t expect to encounter multiple languages within documents!

However, sentence-level language identification (i.e. one language label per sentence) soon became important to understand comments, short posts and similar user-generated data on the internet. Where does code-mixing fit in, though? Let us look at this Spanish-English tweet.

@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro! 😮 Then I started getting all red, I think im allergic a algo

Even sentence-level language identification wouldn’t work when data is code-mixed, as mixing can be intra-sentential! Before we begin to process code-mixing, we need to recognise all languages present in the data. One language per sentence simply isn’t enough – word-level language identification is necessary.

Code-mixing is inherently informal and generally occurs in casual communication. The phenomenon traditionally occurred in spoken conversation. Now, we have speech-like informal conversation happening on social media and find plenty of code-mixed data in the text form as well.

How do we identify the languages in social media data? Is it as simple as looking up words in dictionaries of various languages? Going back to our example tweet,

RT @HappelStadion: What *was* your favourite 1D *moment* at the concert? *Was* war für euch der schönste *Moment*? Tweet us!

There are words (‘was’, ‘moment’) that belong to both languages! And this tweet is grammatically sound, with correct spelling. What about tweets like,

Wat n awesum movie it wazzzz!

Our language dictionaries wouldn’t identify misspelled words (‘wat’), shortened words (‘awesum’) and exaggerated words (‘wazzz’).

Not to mention, the problem of transliteration. Several languages that are not formally written in the Roman/Latin script, are often phonetically typed using the Roman script that computer keyboards generally feature.

Modi ke speech se India inspired ho gaya #namo

Although Hindi uses the Devanagari script, this Hindi-English tweet has transliterated Hindi words.

Looking up words in a dictionary might work in several cases. But the example tweets we’ve just looked at are not outliers! A large amount of social media content isn’t written with perfect grammar and spelling. Solutions to word-level language ID must counter these problems as well.

There has been exciting work on word-level language identification for social media data, including a shared task in EMNLP 2014 [1], the annual FIRE shared task [2], as well as work on Hindi-English [3] and Dutch-Turkish [4] mixing.

Most previous work deals with pairwise language identification i.e., the language pair is already known, and words in the input can only be from those languages. With plenty of annotated training data, supervised machine learning models have performed extremely well under these conditions.

However, such models have two glaring issues –

  1. They assume that the language pair in the input is already known and the words can only be from those languages. On Twitter, Facebook and other social media, no prior language information is available about posts.
  2. They use supervised machine learning models, which require plenty of annotated training data. Labelled data is scarce for most language pairs, particularly data with all the quirks of social media.

The Project Mélange team at MSR India is working towards a solution for these issues.

We aim to design a universal word-level language identification technique that works well for both code-mixed and monolingual social media data. It would require no prior information about the languages in the input. Although we have a minuscule amount of code-mixed training data, obtaining labeled monolingual data is relatively much simpler. We leverage this monolingual data and train a model that can label code-mixed input as well.

Watch this space for more on that, soon!

References

[1] Solorio, Thamar, et al. “Overview for the first shared task on language identification in code-switched data.” Proceedings of The First Workshop on Computational Approaches to Code Switching. 2014.

[2] Sequiera, Royal, et al. “Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.”

[3] Gella, Spandana, Kalika Bali, and Monojit Choudhury. ““ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification.” (2014).

[4] Nguyen, Dong, and A. Seza Dogruoz. “Word level language identification in online multilingual communication.” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2014.

Advertisements

3 comments

  1. Good work! That would make for awesome applications.

    Technically speaking though, não creo que der erste Satz nijavaagalu code-mixing use karta hai 🙂

    Like

  2. Won’t word-level language identification be almost impossible if only one word at a time is considered without any extra information. User’s history, other words in the tweet and words in the comment are some of the basic extra information that may help.
    How about creating a dictionary (or word vectors) for misspelled (or shortened) words by looking at large corpus, and then mapping to actual words using edit distance. It seems to be a reasonable way of solving the problem.

    Like

    1. Hi Viveka,
      Word-level language ID does not necessarily use features one word at a time, in fact, as you pointed out, knowing the language of a word in isolation may be difficult. The point here is that we do assign a language tag for each word. We can certainly use context to come up with that tag.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s