Pronunciation Modeling for Code Mixing

Sunayana Sitaram, Microsoft Research India

Have you ever wanted to have your texts and WhatsApp messages read out to you? Have you ever used a foreign word while using a system like Cortana, only to find that it does not recognize words that are not in the language it is expecting to hear? Speech Recognition and Synthesis of code-mixed utterances is a very challenging problem. Most speech processing systems are designed to be used with a single language. Moreover, people may pronounce words differently when they are speaking multiple languages at the same time, which may confuse such systems.

Let us look at the problem of reading out a recipe on a popular Hindi recipe website Nishamadhulika. Here’s the link to the recipe, if you want to take a look http://nishamadhulika.com/1064-creamy-mushroom-soup-recipe.html

Now as you can see, most of the text in the recipe description is in Hindi, written in the native script (Devanagari). This should be fairly easy for a Hindi Text to Speech system to read out to the user. However, we see some English words in the title, and also numbers in the Roman script to denote quantities.  If you scroll down to the comments, you see that many of the comments are in Hindi, but are not written in native script. Let us look at a couple of comments.

“bahut yammi recipe thi nisha ji ye soup mere baby ne jo ki 15 month ka hai bahut shok se piya hai”

“Nisha ji musroom soup bht acha bna h.mje cooking bhi bht achi lgti h.bus ye btao is e without cream healthi kaise bnaya ja skta h ans jrur dena”

We find that there are many English words in these sentences (“soup”, “yammi”, which is “yummy”, “15 month”, “baby”, “cooking” etc.). We also find that users don’t always follow a standard way of transliterating Hindi into Romanized script. For example, in the first sentence, the word “बहुत” is written as “bahut”, while in the second one, it is shortened to “bht”. Similarly, the word “है” is written as “hai” in the first comment, and only as “h” in the second one!

Now imagine if you are a Text to Speech system and you need to read out such text! You need to identify what languages the words are in, rectify spelling mistakes, expand contractions and then figure out how you are actually going to pronounce the word. This is made even harder by the fact that the training data for most Text to Speech systems today only consists of single language, clean, well-written data.

In a future post, we will talk more about how we make Text to Speech systems capable of synthesizing mixed language text. Meanwhile, you can read this paper:

‘Speech Synthesis of Code Mixed Text’, Sunayana Sitaram and Alan W Black, in Proceedings of LREC 2016, Portoroz, Slovenia

Word appropriation: To be, or not to be… formalized?

Andrew Cross, Microsoft Research India

English-adapted words, especially around technology use, are increasingly common in other languages. For instance, to tweet in Spanish is often called “tuitear”, taking the original English word and adding a Spanish grammatical ending. Similarly, “le hardware” or “le software” are used in French to describe the rather obvious English-counterparts (for other interesting Franglais phrases, check out an amusing list here). Some words, like “computer”, “bus”, or “phone/mobile” are almost universally understood around the world.

While widespread adoption of these words gives a certain uniformity and intelligibility to global conversations, there are those who lament this trend and think it undermines the original language and therefore culture. Language institutions like the Academie Française or the Real Academia Española regularly wrestle with what words to embrace from other languages, versus promoting more local renderings of the same idea (one example the director of the Real Academia Española gives is his preference to use “auto-photo” instead of “selfie”). One clear goal of defining a unified dictionary of a language as geographically dispersed as Spanish, a majority language in over 20 countries, is not only to protect the language from being infiltrated by outside influence, but also to build an identity and cultural unity for speakers and countries that use the language.

And so emerges a funny paradox that is by no means limited to the human interpretation of “language” – on the one hand you have an organic blend and evolution of language through increasing global travel, business, and media. On the other, you have a need or desire to canonize certain aspects of language both for utility (one needs to be understood), and for preserving a certain culture associated with a language. At one extreme, wholesale adoption of outside languages could lead to the ultimate demise of a language. But at the other extreme, the outright rejection of any word deemed “foreign” undermines the very nature of language dynamics.

Which brings the conversation back to technology. The global world is much more connected which presents more opportunities for languages to interact and evolve. With the near immediacy for interchange available through the internet, one can expect many of these new blends and linguistic evolutions to brew locally, but make their international debut online. How will this debate play out as words like “selfie” or “friend request” or “email” become increasingly common in online forums? Perhaps more importantly for bodies governing the words that are officially part of a language, can (or should) such standardizing efforts keep up with the rapid spread of foreign words in the new era of the internet?

Code-Mixed Language Identification

Shruti Rijhwani, Microsoft Research India

RT @HappelStadion: What was your favourite 1D moment at the concert? Was war für euch der schönste Moment? Tweet us!

If you know both English and German, you probably figured out what two languages this tweet uses. Either way, you likely realized that there isn’t just one language in the tweet.

We recognize languages that we are familiar with. The task is second nature to humans – is it just as easy for machines? Why do machines need to identify languages in the first place?

Most Natural Language Processing (NLP) techniques are designed for specific languages. That makes language identification a necessary first step for machines to derive meaning from human language. Computational language identification research began in 1995. Initially, language identification was performed at the document-level, that is, whole documents were assumed to contain a single language. This was only logical as back in 1995, most digital documents had professional or literary content. We didn’t expect to encounter multiple languages within documents!

However, sentence-level language identification (i.e. one language label per sentence) soon became important to understand comments, short posts and similar user-generated data on the internet. Where does code-mixing fit in, though? Let us look at this Spanish-English tweet.

@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro! 😮 Then I started getting all red, I think im allergic a algo

Even sentence-level language identification wouldn’t work when data is code-mixed, as mixing can be intra-sentential! Before we begin to process code-mixing, we need to recognise all languages present in the data. One language per sentence simply isn’t enough – word-level language identification is necessary.

Code-mixing is inherently informal and generally occurs in casual communication. The phenomenon traditionally occurred in spoken conversation. Now, we have speech-like informal conversation happening on social media and find plenty of code-mixed data in the text form as well.

How do we identify the languages in social media data? Is it as simple as looking up words in dictionaries of various languages? Going back to our example tweet,

RT @HappelStadion: What *was* your favourite 1D *moment* at the concert? *Was* war für euch der schönste *Moment*? Tweet us!

There are words (‘was’, ‘moment’) that belong to both languages! And this tweet is grammatically sound, with correct spelling. What about tweets like,

Wat n awesum movie it wazzzz!

Our language dictionaries wouldn’t identify misspelled words (‘wat’), shortened words (‘awesum’) and exaggerated words (‘wazzz’).

Not to mention, the problem of transliteration. Several languages that are not formally written in the Roman/Latin script, are often phonetically typed using the Roman script that computer keyboards generally feature.

Modi ke speech se India inspired ho gaya #namo

Although Hindi uses the Devanagari script, this Hindi-English tweet has transliterated Hindi words.

Looking up words in a dictionary might work in several cases. But the example tweets we’ve just looked at are not outliers! A large amount of social media content isn’t written with perfect grammar and spelling. Solutions to word-level language ID must counter these problems as well.

There has been exciting work on word-level language identification for social media data, including a shared task in EMNLP 2014 [1], the annual FIRE shared task [2], as well as work on Hindi-English [3] and Dutch-Turkish [4] mixing.

Most previous work deals with pairwise language identification i.e., the language pair is already known, and words in the input can only be from those languages. With plenty of annotated training data, supervised machine learning models have performed extremely well under these conditions.

However, such models have two glaring issues –

  1. They assume that the language pair in the input is already known and the words can only be from those languages. On Twitter, Facebook and other social media, no prior language information is available about posts.
  2. They use supervised machine learning models, which require plenty of annotated training data. Labelled data is scarce for most language pairs, particularly data with all the quirks of social media.

The Project Mélange team at MSR India is working towards a solution for these issues.

We aim to design a universal word-level language identification technique that works well for both code-mixed and monolingual social media data. It would require no prior information about the languages in the input. Although we have a minuscule amount of code-mixed training data, obtaining labeled monolingual data is relatively much simpler. We leverage this monolingual data and train a model that can label code-mixed input as well.

Watch this space for more on that, soon!

References

[1] Solorio, Thamar, et al. “Overview for the first shared task on language identification in code-switched data.” Proceedings of The First Workshop on Computational Approaches to Code Switching. 2014.

[2] Sequiera, Royal, et al. “Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.”

[3] Gella, Spandana, Kalika Bali, and Monojit Choudhury. ““ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification.” (2014).

[4] Nguyen, Dong, and A. Seza Dogruoz. “Word level language identification in online multilingual communication.” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2014.

Code-mixing in Music

Royal Sequiera, Microsoft Research India

It is a truth universally acknowledged, that Bollywood songs, as we know them, are replete with code mixed text. But, what about songs in other multilingual communities? Let us investigate this curious phenomenon in one of our favourite songs, The Ketchup Song:

Friday night it’s party time
feeling ready looking fine,
viene diego rumbeando,
with the magic in his eyes
checking every girl in sight,
grooving like he does the mambo
he’s the man allí en la disco,
playing sexy feeling hotter,
he’s the king bailando el ritmo ragatanga,
and the DJ that he knows well,
on the spot always around twelve,
plays the mix that diego mezcla con la salsa,
y la baila and he dances y la canta
many think it’s brujeria,
how he comes and disappears,
every move will hypnotize you,
some will call it chuleria,
others say that it’s the real,
rastafari afrogitano

The song is Spanish-English code-mixed and the first two stanzas of the song are as shown above. As you might have already observed, the English and Spanish part of the song are written in green and red font respectively. The frequent alternation of languages as in the line, “y la baila and he dances y la canta” adds a speech-like flavour to the song and makes it sound more natural. Perhaps, if the song were monolingual, it wouldn’t have been as catchy as the current one.

Here are other language pairs that have been mixed in popular songs.

Spanish – English:

Tengo Tu Love, Sie7e – https://www.youtube.com/watch?v=qaShyjYClD8
Feliz Navidad, José Feliciano – https://www.youtube.com/watch?v=RTtc2pM1boE
La Isla Bonita, Madonna – https://www.youtube.com/watch?v=7YzW1nMB9fk
Las Ketchup – https://www.youtube.com/watch?v=AMT698ArSfQ
Cha Cha, Chelo – https://www.youtube.com/watch?v=L4XViIs3CnQ
Living La Vida Loca, Ricky Martin – https://www.youtube.com/watch?v=p47fEXGabaY
Macarena, Los Del Rio – https://www.youtube.com/watch?v=XiBYM6g8Tck
The cup of life, Ricky Martin – https://www.youtube.com/watch?v=oP2D-km89pA
Before the Next Teardrop Falls, Freddy Fender – https://www.youtube.com/watch?v=ay5ciplY4Pg

French – English:

Que sera sera, Doris Day – https://www.youtube.com/watch?v=xZbKHDPPrrc
Aicha, Outlandish –  https://www.youtube.com/watch?v=f0nFTdKlKLw
Lady marmalade, LaBelle – https://www.youtube.com/watch?v=t4LWIP7SAjY
Michelle, Beatles – https://www.youtube.com/watch?v=VnuQMsEV8dA
Eyes Without a Face, Billy Idol – https://www.youtube.com/watch?v=9OFpfTd0EIs
Hold on Tight, EOL – https://www.youtube.com/watch?v=UkekqVPIc2M
Ma Belle Amie, The Tee Set – https://www.youtube.com/watch?v=Bioah3q7JOk

Italian – English:

Underwater love, Smoke City – https://www.youtube.com/watch?v=HuLjsW8XhY4
Volare, Bobby Rydell – https://www.youtube.com/watch?v=MprNWH625aw

German – English:

Sailor, Lolita – https://www.youtube.com/watch?v=62muVcjv2a8
Wooden heart – https://www.youtube.com/watch?v=Hlbu6SsjlSE

Portuguese – English:

Corcovado, Stan Getz / Astrud Gilberto – https://www.youtube.com/watch?v=DMX6E68qJAg

Arabic – English:

Desert Rose, Sting – https://www.youtube.com/watch?v=C3lWwBslWqg

Do you remember the song Circle of Life from the movie Lion King? Did you know that the song is actually code-mixed and that the first stanza of the song is not some gibberish intended to confuse you! Now then, here’s a challenge for you: can you guess the language used in the first stanza of the song?

I hope you enjoyed listening to our “mixed” songs, and that, by now have picked one of them as your favorites! Do you know of any other code-mixed songs that you would like to share with us? Why wait then — please post them in the comment section below!

Functions of code-mixing

Rafiya Begum, Microsoft Research India

So far on this blog, we have seen many examples of code-mixing that occurs frequently among bilingual and multilingual communities. A very interesting question is why people mix two languages (code-mix) or switch between two languages (code-switch).

I have come across school kids whose mother tongue is different from the medium of language (second language) used among friends in school. Since they spend so much time with friends, they code-mix their mother tongue and the language used among friends even when they are back home. This continues even when they grow up since they learnt this phenomenon of mixing or switching between languages from an early age. This sometimes gives a hilarious effect when they use words or phrases from another language into their native language even if the translations of those expressions are present in their native language. See the examples of Hyderabadi Urdu-English sentences below:

ten baje hai (It is ten o’clock)

tum log double meaning dialog bolke sata rai  (You are irritating me by saying double meaning dialogs)

In the above examples, ten and double meaning dialog (phrase) are from English and the rest of the part is in Urdu.

People change their speech in order to fit in with the person they are talking with. They code-switch when they have to talk about a particular topic or to change the context or to convey the identity of the person who is code-switching. People code-switch to show formality or their attitude to the listener and when certain words are lacking in a language they get those words from another language.

Here is an example of code-switching between Hyderabadi Urdu and Telangana Telugu.

Urdu                            Telugu

arey suno miyaa… naaku  ii  pani  iiyaradey?

(Hey, listen Mister …. Can’t you give this work to me?)

In the above example, speaker is switching from Hyderabadi Urdu to Telangana Telugu in the same conversation. The speaker is using Urdu to grab the attention of the listener or address the listener and then switches to Telugu to express the actual matter. The switching location between two languages is called as switch point and it carries a lot of significance. In other words, we can say that the purpose information behind the switching is carried by the switch point. Switch points represent various code-switching categories. Looking at the Twitter code-switched Hindi-English tweets we observed the following categories which are divided into two types, i.e., Pragmatic and Structural.

Pragmatics:

Fact to opinion switch is where speakers switch languages when they are switching from expressing facts to opinions. They switch to another language for reinforcing a positive or a negative sentiment/opinion expressed in a language. In Sarcasm, a simple opinion about a particular topic is expressed in a language and a switch to another to express a sarcastic opinion about the same. Quotations, which are often employed to express opinions, are stated in the original language, while the context or fact might be stated in another language. Cause-Effect switch is used to express the reason or cause in one language and effect in another. In Translation, a fact or opinion expressed in one language is translated to the other language, perhaps for reinforcement or wider reach of the tweet.

Structural:

In Reporting-Speech, we observed that often Hindi is used to quote real conversations which took place in Hindi while the reporting part is in English. The conversations may be in quotes, and the reporting may contain specific English cue words such as ‘say’, ‘ask’, ‘think’, ‘tell’, etc. The other examples of code-switching are use of wishes, greetings and addressing in one language (usually English) and then switching to another.

If you want to know more about the functions of code-switching, you can refer to the following paper:

Begum, R., Bali, K., Choudhury, M., Rudra, K., Ganguly, N. (2016). Functions of Code-Switching in Tweets: An Annotation Scheme and Some Initial Experiments. In Proc. LREC.

Borrowing Ya Mixing? (Part 1)

Kalika Bali, Microsoft Research India

An English speaker might go to a café and order an egg-sandwich made with egg, mustard and mayonnaise. If she stops to think, she might realize that she has the French language to thank for the words, café , and mayonnaise. However, unless she is a linguist major with a specific interest in English Etymology, she might be surprised that the word mustard, that so very quintessential ingredient of English cooking, is also of French origin.

A villager from the heart of Hindi-speaking rural India, also might not think that when he goes to the station and buys a ticket for the bus, he is actually using English vocabulary.

The historical linguist, Hans Hock, says that “languages do not exist in vacuum”.  Languages and dialects which are in contact or co-exist are continuously being influenced by each other. The extent and the type of influence can vary depending on many socio-political, cultural and linguistic aspects and can range from borrowing of sounds, words and sometimes even entire syntactic structures.

So, when a English-French bilingual says, “Je vais à Nice pour le week-end”, is he code-mixing or is “week-end” a borrowing from English into French?

Even linguists cannot agree on “other language embeddings”.  Is it true Code-mixing?  What is nonce-word borrowing? Do these differ from loanwords that are integrated into the native vocabulary and grammatical structure?

Many linguists believe that loan-words start out as Code-Mixing or Nonce-borrowing but by repeated use and diffusion across the language they gradually convert to native vocabulary and acquire the characteristics of the “borrowing” language. In spoken forms, this would be the adaptation of the loanword to the sound-system and the grammar of the native language, that is phonological and morpho-syntactic convergence.

The problem with this is that in many cases a native accent might be mistaken for phonological convergence, and a morpho-syntactic marking might not be readily visible.

For example, most Hindi speakers of English would pronounce an English alveolar /d/ as a retroflex because an alveolar plosive is not a part of the Hindi phonology. However, this does not imply that the said English word has become a part of the native vocabulary.

Similarly, if we look at the two sentences:

“sab artists ko bulayaa hai” (all artists have been called),

and

“sab artist kal aayenge”

(all artists will come tomorrow)

In the first sentence the English inflection –s on the word artist marks it as plural but in the second case, the plural is marked on the Hindi Verb.

Does this imply that in the first case it is Code Mixing and in the second a case of borrowing given that both the forms and the structures are equally acceptable and common in Hindi?

It is not easy to decide these categories especially for single words without looking at diachronic data and the inherent fuzziness of the distinction itself. In general, it is believed that there exists a sort of continuum between Code Mixing and loan vocabulary where the edges might be clearly distinguishable but it is difficult to disambiguate the vast majority in the middle especially for single words.

In a future post, we will look at what this continuum might look like and one possible way we can try to distinguish true code-mixing from loanwords.

In the meantime, you can look at some earlier studies on borrowing, mixing, and what lies in between.

  1. Frederic Field. 2002. Linguistic borrowing in bilingual contexts. Amsterdam: Benjamins.
  2. Carol Myers-Scotton. 2002. Contact linguistics: Bilingual encounters and grammatical outcomes. Oxford University Press.
  3. Pieter Muysken. 2000. Bilingual speech: A typology of code-mixing. Cambridge University Press.
  4. Shana Poplack, D. Sankoff, and C. Miller. 1988. The social correlates and linguistic processes of lexical borrowing and assimilation. Linguistics 26:47-104.
  5. Shana Poplack and Nathalie Dion. 2012. “Myths and facts about loanword development.” in Language Variation and Change 24, 3.
  6. David Sankoff, Shana Poplack, and Swathi Vanniarajan. 1990. The case of the nonce loan in Tamil. Language Variation and Change, 2 (1990), 71-101. Cambridge University Press.