Word appropriation: To be, or not to be… formalized?

Andrew Cross, Microsoft Research India

English-adapted words, especially around technology use, are increasingly common in other languages. For instance, to tweet in Spanish is often called “tuitear”, taking the original English word and adding a Spanish grammatical ending. Similarly, “le hardware” or “le software” are used in French to describe the rather obvious English-counterparts (for other interesting Franglais phrases, check out an amusing list here). Some words, like “computer”, “bus”, or “phone/mobile” are almost universally understood around the world.

While widespread adoption of these words gives a certain uniformity and intelligibility to global conversations, there are those who lament this trend and think it undermines the original language and therefore culture. Language institutions like the Academie Française or the Real Academia Española regularly wrestle with what words to embrace from other languages, versus promoting more local renderings of the same idea (one example the director of the Real Academia Española gives is his preference to use “auto-photo” instead of “selfie”). One clear goal of defining a unified dictionary of a language as geographically dispersed as Spanish, a majority language in over 20 countries, is not only to protect the language from being infiltrated by outside influence, but also to build an identity and cultural unity for speakers and countries that use the language.

And so emerges a funny paradox that is by no means limited to the human interpretation of “language” – on the one hand you have an organic blend and evolution of language through increasing global travel, business, and media. On the other, you have a need or desire to canonize certain aspects of language both for utility (one needs to be understood), and for preserving a certain culture associated with a language. At one extreme, wholesale adoption of outside languages could lead to the ultimate demise of a language. But at the other extreme, the outright rejection of any word deemed “foreign” undermines the very nature of language dynamics.

Which brings the conversation back to technology. The global world is much more connected which presents more opportunities for languages to interact and evolve. With the near immediacy for interchange available through the internet, one can expect many of these new blends and linguistic evolutions to brew locally, but make their international debut online. How will this debate play out as words like “selfie” or “friend request” or “email” become increasingly common in online forums? Perhaps more importantly for bodies governing the words that are officially part of a language, can (or should) such standardizing efforts keep up with the rapid spread of foreign words in the new era of the internet?

Code-Mixed Language Identification

Shruti Rijhwani, Microsoft Research India

RT @HappelStadion: What was your favourite 1D moment at the concert? Was war für euch der schönste Moment? Tweet us!

If you know both English and German, you probably figured out what two languages this tweet uses. Either way, you likely realized that there isn’t just one language in the tweet.

We recognize languages that we are familiar with. The task is second nature to humans – is it just as easy for machines? Why do machines need to identify languages in the first place?

Most Natural Language Processing (NLP) techniques are designed for specific languages. That makes language identification a necessary first step for machines to derive meaning from human language. Computational language identification research began in 1995. Initially, language identification was performed at the document-level, that is, whole documents were assumed to contain a single language. This was only logical as back in 1995, most digital documents had professional or literary content. We didn’t expect to encounter multiple languages within documents!

However, sentence-level language identification (i.e. one language label per sentence) soon became important to understand comments, short posts and similar user-generated data on the internet. Where does code-mixing fit in, though? Let us look at this Spanish-English tweet.

@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro! 😮 Then I started getting all red, I think im allergic a algo

Even sentence-level language identification wouldn’t work when data is code-mixed, as mixing can be intra-sentential! Before we begin to process code-mixing, we need to recognise all languages present in the data. One language per sentence simply isn’t enough – word-level language identification is necessary.

Code-mixing is inherently informal and generally occurs in casual communication. The phenomenon traditionally occurred in spoken conversation. Now, we have speech-like informal conversation happening on social media and find plenty of code-mixed data in the text form as well.

How do we identify the languages in social media data? Is it as simple as looking up words in dictionaries of various languages? Going back to our example tweet,

RT @HappelStadion: What *was* your favourite 1D *moment* at the concert? *Was* war für euch der schönste *Moment*? Tweet us!

There are words (‘was’, ‘moment’) that belong to both languages! And this tweet is grammatically sound, with correct spelling. What about tweets like,

Wat n awesum movie it wazzzz!

Our language dictionaries wouldn’t identify misspelled words (‘wat’), shortened words (‘awesum’) and exaggerated words (‘wazzz’).

Not to mention, the problem of transliteration. Several languages that are not formally written in the Roman/Latin script, are often phonetically typed using the Roman script that computer keyboards generally feature.

Modi ke speech se India inspired ho gaya #namo

Although Hindi uses the Devanagari script, this Hindi-English tweet has transliterated Hindi words.

Looking up words in a dictionary might work in several cases. But the example tweets we’ve just looked at are not outliers! A large amount of social media content isn’t written with perfect grammar and spelling. Solutions to word-level language ID must counter these problems as well.

There has been exciting work on word-level language identification for social media data, including a shared task in EMNLP 2014 [1], the annual FIRE shared task [2], as well as work on Hindi-English [3] and Dutch-Turkish [4] mixing.

Most previous work deals with pairwise language identification i.e., the language pair is already known, and words in the input can only be from those languages. With plenty of annotated training data, supervised machine learning models have performed extremely well under these conditions.

However, such models have two glaring issues –

  1. They assume that the language pair in the input is already known and the words can only be from those languages. On Twitter, Facebook and other social media, no prior language information is available about posts.
  2. They use supervised machine learning models, which require plenty of annotated training data. Labelled data is scarce for most language pairs, particularly data with all the quirks of social media.

The Project Mélange team at MSR India is working towards a solution for these issues.

We aim to design a universal word-level language identification technique that works well for both code-mixed and monolingual social media data. It would require no prior information about the languages in the input. Although we have a minuscule amount of code-mixed training data, obtaining labeled monolingual data is relatively much simpler. We leverage this monolingual data and train a model that can label code-mixed input as well.

Watch this space for more on that, soon!


[1] Solorio, Thamar, et al. “Overview for the first shared task on language identification in code-switched data.” Proceedings of The First Workshop on Computational Approaches to Code Switching. 2014.

[2] Sequiera, Royal, et al. “Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.”

[3] Gella, Spandana, Kalika Bali, and Monojit Choudhury. ““ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification.” (2014).

[4] Nguyen, Dong, and A. Seza Dogruoz. “Word level language identification in online multilingual communication.” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2014.

Code-mixing in Music

Royal Sequiera, Microsoft Research India

It is a truth universally acknowledged, that Bollywood songs, as we know them, are replete with code mixed text. But, what about songs in other multilingual communities? Let us investigate this curious phenomenon in one of our favourite songs, The Ketchup Song:

Friday night it’s party time
feeling ready looking fine,
viene diego rumbeando,
with the magic in his eyes
checking every girl in sight,
grooving like he does the mambo
he’s the man allí en la disco,
playing sexy feeling hotter,
he’s the king bailando el ritmo ragatanga,
and the DJ that he knows well,
on the spot always around twelve,
plays the mix that diego mezcla con la salsa,
y la baila and he dances y la canta
many think it’s brujeria,
how he comes and disappears,
every move will hypnotize you,
some will call it chuleria,
others say that it’s the real,
rastafari afrogitano

The song is Spanish-English code-mixed and the first two stanzas of the song are as shown above. As you might have already observed, the English and Spanish part of the song are written in green and red font respectively. The frequent alternation of languages as in the line, “y la baila and he dances y la canta” adds a speech-like flavour to the song and makes it sound more natural. Perhaps, if the song were monolingual, it wouldn’t have been as catchy as the current one.

Here are other language pairs that have been mixed in popular songs.

Spanish – English:

Tengo Tu Love, Sie7e – https://www.youtube.com/watch?v=qaShyjYClD8
Feliz Navidad, José Feliciano – https://www.youtube.com/watch?v=RTtc2pM1boE
La Isla Bonita, Madonna – https://www.youtube.com/watch?v=7YzW1nMB9fk
Las Ketchup – https://www.youtube.com/watch?v=AMT698ArSfQ
Cha Cha, Chelo – https://www.youtube.com/watch?v=L4XViIs3CnQ
Living La Vida Loca, Ricky Martin – https://www.youtube.com/watch?v=p47fEXGabaY
Macarena, Los Del Rio – https://www.youtube.com/watch?v=XiBYM6g8Tck
The cup of life, Ricky Martin – https://www.youtube.com/watch?v=oP2D-km89pA
Before the Next Teardrop Falls, Freddy Fender – https://www.youtube.com/watch?v=ay5ciplY4Pg

French – English:

Que sera sera, Doris Day – https://www.youtube.com/watch?v=xZbKHDPPrrc
Aicha, Outlandish –  https://www.youtube.com/watch?v=f0nFTdKlKLw
Lady marmalade, LaBelle – https://www.youtube.com/watch?v=t4LWIP7SAjY
Michelle, Beatles – https://www.youtube.com/watch?v=VnuQMsEV8dA
Eyes Without a Face, Billy Idol – https://www.youtube.com/watch?v=9OFpfTd0EIs
Hold on Tight, EOL – https://www.youtube.com/watch?v=UkekqVPIc2M
Ma Belle Amie, The Tee Set – https://www.youtube.com/watch?v=Bioah3q7JOk

Italian – English:

Underwater love, Smoke City – https://www.youtube.com/watch?v=HuLjsW8XhY4
Volare, Bobby Rydell – https://www.youtube.com/watch?v=MprNWH625aw

German – English:

Sailor, Lolita – https://www.youtube.com/watch?v=62muVcjv2a8
Wooden heart – https://www.youtube.com/watch?v=Hlbu6SsjlSE

Portuguese – English:

Corcovado, Stan Getz / Astrud Gilberto – https://www.youtube.com/watch?v=DMX6E68qJAg

Arabic – English:

Desert Rose, Sting – https://www.youtube.com/watch?v=C3lWwBslWqg

Do you remember the song Circle of Life from the movie Lion King? Did you know that the song is actually code-mixed and that the first stanza of the song is not some gibberish intended to confuse you! Now then, here’s a challenge for you: can you guess the language used in the first stanza of the song?

I hope you enjoyed listening to our “mixed” songs, and that, by now have picked one of them as your favorites! Do you know of any other code-mixed songs that you would like to share with us? Why wait then — please post them in the comment section below!

Functions of code-mixing

Rafiya Begum, Microsoft Research India

So far on this blog, we have seen many examples of code-mixing that occurs frequently among bilingual and multilingual communities. A very interesting question is why people mix two languages (code-mix) or switch between two languages (code-switch).

I have come across school kids whose mother tongue is different from the medium of language (second language) used among friends in school. Since they spend so much time with friends, they code-mix their mother tongue and the language used among friends even when they are back home. This continues even when they grow up since they learnt this phenomenon of mixing or switching between languages from an early age. This sometimes gives a hilarious effect when they use words or phrases from another language into their native language even if the translations of those expressions are present in their native language. See the examples of Hyderabadi Urdu-English sentences below:

ten baje hai (It is ten o’clock)

tum log double meaning dialog bolke sata rai  (You are irritating me by saying double meaning dialogs)

In the above examples, ten and double meaning dialog (phrase) are from English and the rest of the part is in Urdu.

People change their speech in order to fit in with the person they are talking with. They code-switch when they have to talk about a particular topic or to change the context or to convey the identity of the person who is code-switching. People code-switch to show formality or their attitude to the listener and when certain words are lacking in a language they get those words from another language.

Here is an example of code-switching between Hyderabadi Urdu and Telangana Telugu.

Urdu                            Telugu

arey suno miyaa… naaku  ii  pani  iiyaradey?

(Hey, listen Mister …. Can’t you give this work to me?)

In the above example, speaker is switching from Hyderabadi Urdu to Telangana Telugu in the same conversation. The speaker is using Urdu to grab the attention of the listener or address the listener and then switches to Telugu to express the actual matter. The switching location between two languages is called as switch point and it carries a lot of significance. In other words, we can say that the purpose information behind the switching is carried by the switch point. Switch points represent various code-switching categories. Looking at the Twitter code-switched Hindi-English tweets we observed the following categories which are divided into two types, i.e., Pragmatic and Structural.


Fact to opinion switch is where speakers switch languages when they are switching from expressing facts to opinions. They switch to another language for reinforcing a positive or a negative sentiment/opinion expressed in a language. In Sarcasm, a simple opinion about a particular topic is expressed in a language and a switch to another to express a sarcastic opinion about the same. Quotations, which are often employed to express opinions, are stated in the original language, while the context or fact might be stated in another language. Cause-Effect switch is used to express the reason or cause in one language and effect in another. In Translation, a fact or opinion expressed in one language is translated to the other language, perhaps for reinforcement or wider reach of the tweet.


In Reporting-Speech, we observed that often Hindi is used to quote real conversations which took place in Hindi while the reporting part is in English. The conversations may be in quotes, and the reporting may contain specific English cue words such as ‘say’, ‘ask’, ‘think’, ‘tell’, etc. The other examples of code-switching are use of wishes, greetings and addressing in one language (usually English) and then switching to another.

If you want to know more about the functions of code-switching, you can refer to the following paper:

Begum, R., Bali, K., Choudhury, M., Rudra, K., Ganguly, N. (2016). Functions of Code-Switching in Tweets: An Annotation Scheme and Some Initial Experiments. In Proc. LREC.

Borrowing Ya Mixing? (Part 1)

Kalika Bali, Microsoft Research India

An English speaker might go to a café and order an egg-sandwich made with egg, mustard and mayonnaise. If she stops to think, she might realize that she has the French language to thank for the words, café , and mayonnaise. However, unless she is a linguist major with a specific interest in English Etymology, she might be surprised that the word mustard, that so very quintessential ingredient of English cooking, is also of French origin.

A villager from the heart of Hindi-speaking rural India, also might not think that when he goes to the station and buys a ticket for the bus, he is actually using English vocabulary.

The historical linguist, Hans Hock, says that “languages do not exist in vacuum”.  Languages and dialects which are in contact or co-exist are continuously being influenced by each other. The extent and the type of influence can vary depending on many socio-political, cultural and linguistic aspects and can range from borrowing of sounds, words and sometimes even entire syntactic structures.

So, when a English-French bilingual says, “Je vais à Nice pour le week-end”, is he code-mixing or is “week-end” a borrowing from English into French?

Even linguists cannot agree on “other language embeddings”.  Is it true Code-mixing?  What is nonce-word borrowing? Do these differ from loanwords that are integrated into the native vocabulary and grammatical structure?

Many linguists believe that loan-words start out as Code-Mixing or Nonce-borrowing but by repeated use and diffusion across the language they gradually convert to native vocabulary and acquire the characteristics of the “borrowing” language. In spoken forms, this would be the adaptation of the loanword to the sound-system and the grammar of the native language, that is phonological and morpho-syntactic convergence.

The problem with this is that in many cases a native accent might be mistaken for phonological convergence, and a morpho-syntactic marking might not be readily visible.

For example, most Hindi speakers of English would pronounce an English alveolar /d/ as a retroflex because an alveolar plosive is not a part of the Hindi phonology. However, this does not imply that the said English word has become a part of the native vocabulary.

Similarly, if we look at the two sentences:

“sab artists ko bulayaa hai” (all artists have been called),


“sab artist kal aayenge”

(all artists will come tomorrow)

In the first sentence the English inflection –s on the word artist marks it as plural but in the second case, the plural is marked on the Hindi Verb.

Does this imply that in the first case it is Code Mixing and in the second a case of borrowing given that both the forms and the structures are equally acceptable and common in Hindi?

It is not easy to decide these categories especially for single words without looking at diachronic data and the inherent fuzziness of the distinction itself. In general, it is believed that there exists a sort of continuum between Code Mixing and loan vocabulary where the edges might be clearly distinguishable but it is difficult to disambiguate the vast majority in the middle especially for single words.

In a future post, we will look at what this continuum might look like and one possible way we can try to distinguish true code-mixing from loanwords.

In the meantime, you can look at some earlier studies on borrowing, mixing, and what lies in between.

  1. Frederic Field. 2002. Linguistic borrowing in bilingual contexts. Amsterdam: Benjamins.
  2. Carol Myers-Scotton. 2002. Contact linguistics: Bilingual encounters and grammatical outcomes. Oxford University Press.
  3. Pieter Muysken. 2000. Bilingual speech: A typology of code-mixing. Cambridge University Press.
  4. Shana Poplack, D. Sankoff, and C. Miller. 1988. The social correlates and linguistic processes of lexical borrowing and assimilation. Linguistics 26:47-104.
  5. Shana Poplack and Nathalie Dion. 2012. “Myths and facts about loanword development.” in Language Variation and Change 24, 3.
  6. David Sankoff, Shana Poplack, and Swathi Vanniarajan. 1990. The case of the nonce loan in Tamil. Language Variation and Change, 2 (1990), 71-101. Cambridge University Press.

Hinglish conversational chemistry and Bollywood

Indrani Medhi Thies, Microsoft Research India

These days Bollywood songs, as we know, have gone international. There’s an increasing number of English phrases being used in these songs. From Shahrukh’s “You are my Chammak Challo” in Ra.One, to Aamir’s “Love is a waste of time, pyaar vyaar waste of time” in PK, many leading men are mouthing code-mixed Hinglish songs to woo their women. If the older superstars are doing it, can the younger brigade be far behind? Remember Shahid Kapoor’s “Saree ke fall sa kabhi match kiya re, Kabhi chhod diya dil kabhi catch kiya re” in R…Rajkumar?

They say art imitates life. As the numbers of young, aspirational, upwardly mobile, movie goers steadily rise, Bollywood film makers are leaving no stone unturned to make their music and movies contemporary. And one the most effective way of making things relatable for the young crowds seems to be by having the protagonists speak and sing in code-mixed Hinglish.

One of the best recent examples of this is the warm, fuzzy and fresh ‘Jab We Met’ starring Shahid Kapoor as Aditya, and Kareena Kapoor as Geet. Aditya is a depressed young man getting through a miserable breakup when he meets an extremely vivacious, and talkative Geet on a Delhi-bound train. The story is the usual boy meets girl, headed to different destinations, and eventually falling in love. But the treatment of the story is so fantastic, the screenplay so refreshing, that you want to watch the movie over and over again. The characters are very easy to relate to, they look and talk (in Hinglish) like you and me. Initially the dejected Aditya gets irritated by Geet’s overenthusiasm. But as their long, eventful journey progresses, Aditya starts opening up and eventually falls in love with Geet thereby redeeming himself. And as the credits roll you’re left wondering, “I’ve travelled by train so many times. Wish my journey was as exciting and I met someone as interesting”.

What ‘Jab We Met’ has really going for it, is its everyday conversations-inspired Hinglish dialogue.  Sample this, where the vivacious, self-loving Geet, who’s dating another guy, Anshuman, acknowledges her chemistry with Aditya…

[Geet] Tum us type ke ho ki – dekho, meri shaadi Anshuman se hone waali hain and all that. But vo agar meri life mein nahin hota, toh you never know, shayad main bhi tumse pat jaati. Just imagine.

(English translation: [Geet] You’re the type who…see, I’m about to be married to Anshuman and all that. But if Anshuman was not in my life, then you never know, even I could have fallen for you. Just imagine.)

[Aditya] Tum apne aap ko bahut pasand karti ho na?

(You like yourself a lot, don’t you?)

[Geet] Bahut… Main apni favourite hoon.

(A lot…I’m my own favourite.)

 Or the times when the dejected Aditya is starting to open up to Geet and her ways…

[Aditya] Tu original piece hai

(You are an original piece)

[Aditya] Tum hamesha aise hi bakwas karti ho ya aaj koi special occasion hain?

(Do you always talk nonsense like this or is today some special occasion?)

Kareena, as Geet, and Shahid, as Aditya, are excellent in their performances, but the real hero(ine) of the film is the conversational chemistry between the two. And that is made possible by their natural, believable exchanges. Geet and Aditya talk like you and me, the many Hinglish speakers, who generously mix English words in our Hindi speech and vice versa, in everyday lives. We may not realize it, but like Geet and Aditya, we do it almost all the time.

If you’re interested in anything even remotely related to code-mixing or are just looking for something to uplift your spirits, go watch ‘Jab We Met’!  Kyonki kya pata, even if you don’t run into your own Geet, you might feel inspired to take an exciting journey, jo shayad zindagi bhar yaad rahe.

(Because who knows, even if you don’t run into your own Geet, you might feel inspired to take an exciting journey, which will remain with you for a lifetime).

Project Mélange

Monojit Choudhury, Microsoft Research India

Have you ever come across a Facebook post or a tweet in some unknown language with the translate button (the tiny globe on the top right corner), but when you click it either it says “could not be translated” or the output seems a simple reproduction of the input words, may be in a slightly different order?


This translation is powered by Microsoft’s Bing. Laugh you may at the system or its output, but to be fair, the above one is a particularly hard one to crack! The tweet is in Bengali, but not written in the native Bengali script (which looks something like এরকম); rather it is a loose phonetic transcription of the words in the Roman script. Several English words – presstitute, real-life, free, and one Hindi word (again in Roman script) – nautonki have been generously embedded into the Bengali matrix. And we would wish Bing’s output was: “Well done. Now the people of the country will enjoy for free the real-life drama of the #presstitutes.”

Unfortunately, the system thought it was a German tweet, presumably due to der and korbe. Those words were translated to English, to NULL and baskets respectively. It didn’t know what to do with the rest of the words, so they were reproduced verbatim. To better appreciate the system’s plight, imagine that I ask you to translate a Basque sentence written in the Lao script, sprinkled with words from Ojibwe and Önge. Bing doesn’t expect Bengali to be written in the Roman script. Nor does it expect a Bengali sentence to contain English and Hindi words. It doesn’t even have a translator for Bengali!

Interestingly there is nothing unusual here either in the user’s behavior or in the system’s response. Multilingual users from all over the world tweet, text or post comments in code-mixed languages, and often use Roman transliterations. On the other hand, NLP systems almost always assume the input to be in a single language and in the native script.

Project Mélange aims at building NLP techniques and Natural Language Interfaces for mixed-script and mixed-code text and speech. We break this complex problem into three steps:

  1. Detection: Identifying that the input speech or text is code-mixed, and if so, what are the languages that are being mixed.
  2. Basic processing: Building basic NLP tools and resources that can process code-mixed language. The interesting question here is how we can leverage existing monolingual resources. If you know two languages, say Tamil and Swahili, you would understand a Tamil-Swahili code-mixed conversation, even if you had never heard anybody mix these two languages before, isn’t it? By the same logic, can’t we make use of Tamil and Swahili monolingual NLP technologies to create a Tamahili system?
  3. End-user Applications: Building on the basic resources and tools for code-mixing, we would finally like to build end-user applications and experiences that cater to multilinguals’ linguistic behavior.

To build all this technology, we need lots of annotated language data; and of course, a sound theoretical understanding of when, how and why people code-mix.

Though there is very little computational work on code-mixing, linguists have been pondering on these questions for more than half a century now. There are theories as well as empirical studies on what are allowable grammatical structures when two languages are mixed, what are the motivations behind code-mixing, whether teens code-mix more than their grannies, and so on. However, all these theories have been built upon little amounts of data obtained from either a small group of speakers or through self-reporting studies and interviews. People code-mix in speech; nobody writes books or legal articles or news in a code-mixed language. Therefore, it has been quite difficult to conduct large scale studies of code-mixing. Thanks to the growing popularity of social media and other forms of computer-mediated-communication, today we have large volumes of code-mixed text data. We are now using this data to understand the grammatical constraints, pragmatic functions and socio-linguistic issues of code-mixing.

Let’s hope that in the near future we will not have to make that conscious effort to not use any Spanish words or phrases in our English while speaking to Cortana or using Skype translator, and we will not have to toggle the language selection button in order to make the keyboard prediction work when we move from French to Arabic.