Project Mélange

Monojit Choudhury, Microsoft Research India

Have you ever come across a Facebook post or a tweet in some unknown language with the translate button (the tiny globe on the top right corner), but when you click it either it says “could not be translated” or the output seems a simple reproduction of the input words, may be in a slightly different order?


This translation is powered by Microsoft’s Bing. Laugh you may at the system or its output, but to be fair, the above one is a particularly hard one to crack! The tweet is in Bengali, but not written in the native Bengali script (which looks something like এরকম); rather it is a loose phonetic transcription of the words in the Roman script. Several English words – presstitute, real-life, free, and one Hindi word (again in Roman script) – nautonki have been generously embedded into the Bengali matrix. And we would wish Bing’s output was: “Well done. Now the people of the country will enjoy for free the real-life drama of the #presstitutes.”

Unfortunately, the system thought it was a German tweet, presumably due to der and korbe. Those words were translated to English, to NULL and baskets respectively. It didn’t know what to do with the rest of the words, so they were reproduced verbatim. To better appreciate the system’s plight, imagine that I ask you to translate a Basque sentence written in the Lao script, sprinkled with words from Ojibwe and Önge. Bing doesn’t expect Bengali to be written in the Roman script. Nor does it expect a Bengali sentence to contain English and Hindi words. It doesn’t even have a translator for Bengali!

Interestingly there is nothing unusual here either in the user’s behavior or in the system’s response. Multilingual users from all over the world tweet, text or post comments in code-mixed languages, and often use Roman transliterations. On the other hand, NLP systems almost always assume the input to be in a single language and in the native script.

Project Mélange aims at building NLP techniques and Natural Language Interfaces for mixed-script and mixed-code text and speech. We break this complex problem into three steps:

  1. Detection: Identifying that the input speech or text is code-mixed, and if so, what are the languages that are being mixed.
  2. Basic processing: Building basic NLP tools and resources that can process code-mixed language. The interesting question here is how we can leverage existing monolingual resources. If you know two languages, say Tamil and Swahili, you would understand a Tamil-Swahili code-mixed conversation, even if you had never heard anybody mix these two languages before, isn’t it? By the same logic, can’t we make use of Tamil and Swahili monolingual NLP technologies to create a Tamahili system?
  3. End-user Applications: Building on the basic resources and tools for code-mixing, we would finally like to build end-user applications and experiences that cater to multilinguals’ linguistic behavior.

To build all this technology, we need lots of annotated language data; and of course, a sound theoretical understanding of when, how and why people code-mix.

Though there is very little computational work on code-mixing, linguists have been pondering on these questions for more than half a century now. There are theories as well as empirical studies on what are allowable grammatical structures when two languages are mixed, what are the motivations behind code-mixing, whether teens code-mix more than their grannies, and so on. However, all these theories have been built upon little amounts of data obtained from either a small group of speakers or through self-reporting studies and interviews. People code-mix in speech; nobody writes books or legal articles or news in a code-mixed language. Therefore, it has been quite difficult to conduct large scale studies of code-mixing. Thanks to the growing popularity of social media and other forms of computer-mediated-communication, today we have large volumes of code-mixed text data. We are now using this data to understand the grammatical constraints, pragmatic functions and socio-linguistic issues of code-mixing.

Let’s hope that in the near future we will not have to make that conscious effort to not use any Spanish words or phrases in our English while speaking to Cortana or using Skype translator, and we will not have to toggle the language selection button in order to make the keyboard prediction work when we move from French to Arabic.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s