Sunayana Sitaram, Microsoft Research India
We interact with a variety of systems today that process natural language, both in the form of speech and text, including search engines, conversational agents and websites that automatically translate text from one language to another. Most of these Natural Language Processing (NLP) systems assume that the input is in a single language, and give you a response in a single language as well.
However, it has been found that in bilingual and multilingual societies, people rarely use a single language while communicating with each other, but often mix multiple languages within a conversation, within the same utterance or sometimes within the same word. This is known as code switching or code-mixing. This phenomenon is not restricted to speech – most social media posts by users in multilingual communities exhibit code-mixing as well. An additional complication that can be created is when users borrow scripts while writing multilingual text, like in the case of Romanized Indian languages and Arabizi. If you grew up in India in the 90s, you probably remember the expression “Yehi hai right choice, Baby” which is a mixture of Hindi and English!
What happens to language processing systems when they encounter code-mixing? Try entering the name of this blog (which is a mixture of three languages, Spanish, English and Kannada) into an online translation engine and see what happens! When you type “poco mix maadi” into Bing Translator it identifies the language of the entire sentence to be Spanish and translates it as “little mix maadi”. If you type in only “mix maadi”, it detects the language as English and does not translate the word “maadi”. Google Translate does something similar, and translates “poco mix maadi” to “maadi little mix”.
In addition to just improving the performance of systems that encounter code-mixing, it is also fascinating to learn about how, why and since how long human beings have been code-mixing. Code-mixing is also not restricted to a single language pair or part of the world and it is interesting to see where in the world code-mixing occurs, and what languages get mixed. This blog will cover such topics and many more in future posts. We will also talk about new developments in code-mixing in research and systems on this blog.
If you are interested in knowing more about how we define code-mixing, you can take a look at this paper:
Kalika Bali, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas, “I am borrowing ya mixing ?” An Analysis of English-Hindi Code Mixing in Facebook, in Proceedings of the First Workshop on Computational Approaches to Code Switching, Association for Computational Linguistics, Doha, Qatar, October 2014.