How Do We Characterize Code-mixing?

Gayatri Bhat, Microsoft Research India

If you are a frequent reader of this blog, you have a fair idea of what code-mixing is. In case you aren’t, it is the practice of going back and forth between two languages in the course of just one ek hi conversation, as jaise I’m doing right now abhi.

Here’s a curious thing about code mixing. Most people seem to agree that you cannot arbitrarily alternate between languages while uttering a sentence. For instance, if you speak both Hindi and English with a co-worker, you might tell him,

Office aane ke raste main I fell into a basket of machhli.

(On the way to office, I fell into a basket of fish.)

But you definitely will not say –

Office aane ke on the way I giri into a basket of fish.

It just sounds odd.

So, we might say that there are rules for code-mixing. In that case, what are they? Must code-mixers know all the rules? People who code-mix usually do so easily, without speaking slowly so that they can decide when to switch languages and definitely without trying to check whether they’re sticking to the rules. It turns out that unlike, say, writing sonnets, code-mixing is one of those things you can accomplish without consciously knowing the rules you’re using to do it.

There are people though, who are still trying to figure out the rules for code-switching, some because they’re just curious, others because they’re trying to teach computers how to participate in a code-mixed conversation (Machines don’t seem to think code-mixing is any easier than writing sonnets. Tougher, perhaps.) The frustrating bit is that nobody seems to be coming up with the correct rules. For every rule that’s made, there’s a perfectly good code-mixed sentence that violates it.

One major dispute is regarding the roles of the two (or more, but for now, let’s take two) languages being mixed. Some say that one language is in charge and only lets the other peek in here and there, while others maintain that the two languages are equal partners. This is an important debate, because it determines what sort of rules we’re looking for.

Consider the first alternative – Every sentence is originally in a single language (the superhero, or the matrix language). While code-mixing, we essentially pull out clumps of one or more words from this sentence and plug in fragments from the other language (the sidekick, or the embedded language). A fragment might have fewer or more words than the clump it replaces, and might be ordered differently, but always conveys the same information as the original clump. One may not, of course, pull out bits of these sidekick-clumps and replace them with hero-clumps. The catch, though, is that one cannot do this exercise with any group of words one fancies. Take, for instance, the sentence –

Mere kurte pe maine doodh gira diya.

(I spilt milk on my kurta.)

English-Hindi code-mixers might swap ‘mere kurte pe’ out in favour of its English counterpart –

On my kurta maine doodh gira diya.

However, one will not do this with ‘pe‘ to say –

Mere kurta on maine doodh gira diya.

In this paradigm, the matrix-embedding model, the ‘rules’ for code-switching would indicate what sorts of word-groups one can swap out. The example above illustrates a couple of rules suggested in this paper, which say that it is alright to ‘swap’ or ’embed’ a noun phrase (‘mere kurte pe’), but not a lone postposition (‘pe‘). We should note here that not being able to swap postpositions does not mean that you will never encounter a Hindi postposition in an English-hero sentence. It only means that any Hindi postposition in the sentence was swapped in as part of a particular sort of group, perhaps a noun phrase.

The other idea, which is based on both languages being equal partners, goes like this – To start off with, you have two copies of the same sentence, one in each of two languages. In order to code-mix, you start off with a slice of one of these sentences. Now place a slice of the other sentence next to it. Now another of the first. And so on, until you’ve got a code-mixed sentence that says the same thing as either of the initial single-language sentences.

A simple example in Hindi and English again. You’ve got these two –

Agar main kahoon, mujhe tumse mohobat hai, meri bas yehi chaahat hai, toh kya kahogi?

If I say I am in love with you, that this is my only wish, then what will you say?

We slice and layer to come up with –

Agar main kahoon, I am in love with you, meri bas yehi chaahat hai, toh what will you say?

This model proves a lot trickier to use than the first one. (Check it out here) The ‘rules’ here must ensure that the code-mixed sentence doesn’t include the same fragment twice, once in each language. They also mustn’t allow words that were next to each other in the original sentence to be at opposite ends of the new one, just because we sliced the sentence right between these two words. We need rules to check whether every part of the code-mixed sentence sounds grammatical according to at least one of the two languages, and whether… oh, all sorts of things, far too many things.

Definitely not something one could work out in one’s head while talking at normal speed, right? 😉


Pronunciation Modeling for Code Mixing

Sunayana Sitaram, Microsoft Research India

Have you ever wanted to have your texts and WhatsApp messages read out to you? Have you ever used a foreign word while using a system like Cortana, only to find that it does not recognize words that are not in the language it is expecting to hear? Speech Recognition and Synthesis of code-mixed utterances is a very challenging problem. Most speech processing systems are designed to be used with a single language. Moreover, people may pronounce words differently when they are speaking multiple languages at the same time, which may confuse such systems.

Let us look at the problem of reading out a recipe on a popular Hindi recipe website Nishamadhulika. Here’s the link to the recipe, if you want to take a look

Now as you can see, most of the text in the recipe description is in Hindi, written in the native script (Devanagari). This should be fairly easy for a Hindi Text to Speech system to read out to the user. However, we see some English words in the title, and also numbers in the Roman script to denote quantities.  If you scroll down to the comments, you see that many of the comments are in Hindi, but are not written in native script. Let us look at a couple of comments.

“bahut yammi recipe thi nisha ji ye soup mere baby ne jo ki 15 month ka hai bahut shok se piya hai”

“Nisha ji musroom soup bht acha bna h.mje cooking bhi bht achi lgti h.bus ye btao is e without cream healthi kaise bnaya ja skta h ans jrur dena”

We find that there are many English words in these sentences (“soup”, “yammi”, which is “yummy”, “15 month”, “baby”, “cooking” etc.). We also find that users don’t always follow a standard way of transliterating Hindi into Romanized script. For example, in the first sentence, the word “बहुत” is written as “bahut”, while in the second one, it is shortened to “bht”. Similarly, the word “है” is written as “hai” in the first comment, and only as “h” in the second one!

Now imagine if you are a Text to Speech system and you need to read out such text! You need to identify what languages the words are in, rectify spelling mistakes, expand contractions and then figure out how you are actually going to pronounce the word. This is made even harder by the fact that the training data for most Text to Speech systems today only consists of single language, clean, well-written data.

In a future post, we will talk more about how we make Text to Speech systems capable of synthesizing mixed language text. Meanwhile, you can read this paper:

‘Speech Synthesis of Code Mixed Text’, Sunayana Sitaram and Alan W Black, in Proceedings of LREC 2016, Portoroz, Slovenia

Word appropriation: To be, or not to be… formalized?

Andrew Cross, Microsoft Research India

English-adapted words, especially around technology use, are increasingly common in other languages. For instance, to tweet in Spanish is often called “tuitear”, taking the original English word and adding a Spanish grammatical ending. Similarly, “le hardware” or “le software” are used in French to describe the rather obvious English-counterparts (for other interesting Franglais phrases, check out an amusing list here). Some words, like “computer”, “bus”, or “phone/mobile” are almost universally understood around the world.

While widespread adoption of these words gives a certain uniformity and intelligibility to global conversations, there are those who lament this trend and think it undermines the original language and therefore culture. Language institutions like the Academie Française or the Real Academia Española regularly wrestle with what words to embrace from other languages, versus promoting more local renderings of the same idea (one example the director of the Real Academia Española gives is his preference to use “auto-photo” instead of “selfie”). One clear goal of defining a unified dictionary of a language as geographically dispersed as Spanish, a majority language in over 20 countries, is not only to protect the language from being infiltrated by outside influence, but also to build an identity and cultural unity for speakers and countries that use the language.

And so emerges a funny paradox that is by no means limited to the human interpretation of “language” – on the one hand you have an organic blend and evolution of language through increasing global travel, business, and media. On the other, you have a need or desire to canonize certain aspects of language both for utility (one needs to be understood), and for preserving a certain culture associated with a language. At one extreme, wholesale adoption of outside languages could lead to the ultimate demise of a language. But at the other extreme, the outright rejection of any word deemed “foreign” undermines the very nature of language dynamics.

Which brings the conversation back to technology. The global world is much more connected which presents more opportunities for languages to interact and evolve. With the near immediacy for interchange available through the internet, one can expect many of these new blends and linguistic evolutions to brew locally, but make their international debut online. How will this debate play out as words like “selfie” or “friend request” or “email” become increasingly common in online forums? Perhaps more importantly for bodies governing the words that are officially part of a language, can (or should) such standardizing efforts keep up with the rapid spread of foreign words in the new era of the internet?

Code-Mixed Language Identification

Shruti Rijhwani, Microsoft Research India

RT @HappelStadion: What was your favourite 1D moment at the concert? Was war für euch der schönste Moment? Tweet us!

If you know both English and German, you probably figured out what two languages this tweet uses. Either way, you likely realized that there isn’t just one language in the tweet.

We recognize languages that we are familiar with. The task is second nature to humans – is it just as easy for machines? Why do machines need to identify languages in the first place?

Most Natural Language Processing (NLP) techniques are designed for specific languages. That makes language identification a necessary first step for machines to derive meaning from human language. Computational language identification research began in 1995. Initially, language identification was performed at the document-level, that is, whole documents were assumed to contain a single language. This was only logical as back in 1995, most digital documents had professional or literary content. We didn’t expect to encounter multiple languages within documents!

However, sentence-level language identification (i.e. one language label per sentence) soon became important to understand comments, short posts and similar user-generated data on the internet. Where does code-mixing fit in, though? Let us look at this Spanish-English tweet.

@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro! 😮 Then I started getting all red, I think im allergic a algo

Even sentence-level language identification wouldn’t work when data is code-mixed, as mixing can be intra-sentential! Before we begin to process code-mixing, we need to recognise all languages present in the data. One language per sentence simply isn’t enough – word-level language identification is necessary.

Code-mixing is inherently informal and generally occurs in casual communication. The phenomenon traditionally occurred in spoken conversation. Now, we have speech-like informal conversation happening on social media and find plenty of code-mixed data in the text form as well.

How do we identify the languages in social media data? Is it as simple as looking up words in dictionaries of various languages? Going back to our example tweet,

RT @HappelStadion: What *was* your favourite 1D *moment* at the concert? *Was* war für euch der schönste *Moment*? Tweet us!

There are words (‘was’, ‘moment’) that belong to both languages! And this tweet is grammatically sound, with correct spelling. What about tweets like,

Wat n awesum movie it wazzzz!

Our language dictionaries wouldn’t identify misspelled words (‘wat’), shortened words (‘awesum’) and exaggerated words (‘wazzz’).

Not to mention, the problem of transliteration. Several languages that are not formally written in the Roman/Latin script, are often phonetically typed using the Roman script that computer keyboards generally feature.

Modi ke speech se India inspired ho gaya #namo

Although Hindi uses the Devanagari script, this Hindi-English tweet has transliterated Hindi words.

Looking up words in a dictionary might work in several cases. But the example tweets we’ve just looked at are not outliers! A large amount of social media content isn’t written with perfect grammar and spelling. Solutions to word-level language ID must counter these problems as well.

There has been exciting work on word-level language identification for social media data, including a shared task in EMNLP 2014 [1], the annual FIRE shared task [2], as well as work on Hindi-English [3] and Dutch-Turkish [4] mixing.

Most previous work deals with pairwise language identification i.e., the language pair is already known, and words in the input can only be from those languages. With plenty of annotated training data, supervised machine learning models have performed extremely well under these conditions.

However, such models have two glaring issues –

  1. They assume that the language pair in the input is already known and the words can only be from those languages. On Twitter, Facebook and other social media, no prior language information is available about posts.
  2. They use supervised machine learning models, which require plenty of annotated training data. Labelled data is scarce for most language pairs, particularly data with all the quirks of social media.

The Project Mélange team at MSR India is working towards a solution for these issues.

We aim to design a universal word-level language identification technique that works well for both code-mixed and monolingual social media data. It would require no prior information about the languages in the input. Although we have a minuscule amount of code-mixed training data, obtaining labeled monolingual data is relatively much simpler. We leverage this monolingual data and train a model that can label code-mixed input as well.

Watch this space for more on that, soon!


[1] Solorio, Thamar, et al. “Overview for the first shared task on language identification in code-switched data.” Proceedings of The First Workshop on Computational Approaches to Code Switching. 2014.

[2] Sequiera, Royal, et al. “Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.”

[3] Gella, Spandana, Kalika Bali, and Monojit Choudhury. ““ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification.” (2014).

[4] Nguyen, Dong, and A. Seza Dogruoz. “Word level language identification in online multilingual communication.” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2014.

Code-mixing in Music

Royal Sequiera, Microsoft Research India

It is a truth universally acknowledged, that Bollywood songs, as we know them, are replete with code mixed text. But, what about songs in other multilingual communities? Let us investigate this curious phenomenon in one of our favourite songs, The Ketchup Song:

Friday night it’s party time
feeling ready looking fine,
viene diego rumbeando,
with the magic in his eyes
checking every girl in sight,
grooving like he does the mambo
he’s the man allí en la disco,
playing sexy feeling hotter,
he’s the king bailando el ritmo ragatanga,
and the DJ that he knows well,
on the spot always around twelve,
plays the mix that diego mezcla con la salsa,
y la baila and he dances y la canta
many think it’s brujeria,
how he comes and disappears,
every move will hypnotize you,
some will call it chuleria,
others say that it’s the real,
rastafari afrogitano

The song is Spanish-English code-mixed and the first two stanzas of the song are as shown above. As you might have already observed, the English and Spanish part of the song are written in green and red font respectively. The frequent alternation of languages as in the line, “y la baila and he dances y la canta” adds a speech-like flavour to the song and makes it sound more natural. Perhaps, if the song were monolingual, it wouldn’t have been as catchy as the current one.

Here are other language pairs that have been mixed in popular songs.

Spanish – English:

Tengo Tu Love, Sie7e –
Feliz Navidad, José Feliciano –
La Isla Bonita, Madonna –
Las Ketchup –
Cha Cha, Chelo –
Living La Vida Loca, Ricky Martin –
Macarena, Los Del Rio –
The cup of life, Ricky Martin –
Before the Next Teardrop Falls, Freddy Fender –

French – English:

Que sera sera, Doris Day –
Aicha, Outlandish –
Lady marmalade, LaBelle –
Michelle, Beatles –
Eyes Without a Face, Billy Idol –
Hold on Tight, EOL –
Ma Belle Amie, The Tee Set –

Italian – English:

Underwater love, Smoke City –
Volare, Bobby Rydell –

German – English:

Sailor, Lolita –
Wooden heart –

Portuguese – English:

Corcovado, Stan Getz / Astrud Gilberto –

Arabic – English:

Desert Rose, Sting –

Do you remember the song Circle of Life from the movie Lion King? Did you know that the song is actually code-mixed and that the first stanza of the song is not some gibberish intended to confuse you! Now then, here’s a challenge for you: can you guess the language used in the first stanza of the song?

I hope you enjoyed listening to our “mixed” songs, and that, by now have picked one of them as your favorites! Do you know of any other code-mixed songs that you would like to share with us? Why wait then — please post them in the comment section below!

Functions of code-mixing

Rafiya Begum, Microsoft Research India

So far on this blog, we have seen many examples of code-mixing that occurs frequently among bilingual and multilingual communities. A very interesting question is why people mix two languages (code-mix) or switch between two languages (code-switch).

I have come across school kids whose mother tongue is different from the medium of language (second language) used among friends in school. Since they spend so much time with friends, they code-mix their mother tongue and the language used among friends even when they are back home. This continues even when they grow up since they learnt this phenomenon of mixing or switching between languages from an early age. This sometimes gives a hilarious effect when they use words or phrases from another language into their native language even if the translations of those expressions are present in their native language. See the examples of Hyderabadi Urdu-English sentences below:

ten baje hai (It is ten o’clock)

tum log double meaning dialog bolke sata rai  (You are irritating me by saying double meaning dialogs)

In the above examples, ten and double meaning dialog (phrase) are from English and the rest of the part is in Urdu.

People change their speech in order to fit in with the person they are talking with. They code-switch when they have to talk about a particular topic or to change the context or to convey the identity of the person who is code-switching. People code-switch to show formality or their attitude to the listener and when certain words are lacking in a language they get those words from another language.

Here is an example of code-switching between Hyderabadi Urdu and Telangana Telugu.

Urdu                            Telugu

arey suno miyaa… naaku  ii  pani  iiyaradey?

(Hey, listen Mister …. Can’t you give this work to me?)

In the above example, speaker is switching from Hyderabadi Urdu to Telangana Telugu in the same conversation. The speaker is using Urdu to grab the attention of the listener or address the listener and then switches to Telugu to express the actual matter. The switching location between two languages is called as switch point and it carries a lot of significance. In other words, we can say that the purpose information behind the switching is carried by the switch point. Switch points represent various code-switching categories. Looking at the Twitter code-switched Hindi-English tweets we observed the following categories which are divided into two types, i.e., Pragmatic and Structural.


Fact to opinion switch is where speakers switch languages when they are switching from expressing facts to opinions. They switch to another language for reinforcing a positive or a negative sentiment/opinion expressed in a language. In Sarcasm, a simple opinion about a particular topic is expressed in a language and a switch to another to express a sarcastic opinion about the same. Quotations, which are often employed to express opinions, are stated in the original language, while the context or fact might be stated in another language. Cause-Effect switch is used to express the reason or cause in one language and effect in another. In Translation, a fact or opinion expressed in one language is translated to the other language, perhaps for reinforcement or wider reach of the tweet.


In Reporting-Speech, we observed that often Hindi is used to quote real conversations which took place in Hindi while the reporting part is in English. The conversations may be in quotes, and the reporting may contain specific English cue words such as ‘say’, ‘ask’, ‘think’, ‘tell’, etc. The other examples of code-switching are use of wishes, greetings and addressing in one language (usually English) and then switching to another.

If you want to know more about the functions of code-switching, you can refer to the following paper:

Begum, R., Bali, K., Choudhury, M., Rudra, K., Ganguly, N. (2016). Functions of Code-Switching in Tweets: An Annotation Scheme and Some Initial Experiments. In Proc. LREC.