Uncategorized

How Do We Characterize Code-mixing?

Gayatri Bhat, Microsoft Research India

If you are a frequent reader of this blog, you have a fair idea of what code-mixing is. In case you aren’t, it is the practice of going back and forth between two languages in the course of just one ek hi conversation, as jaise I’m doing right now abhi.

Here’s a curious thing about code mixing. Most people seem to agree that you cannot arbitrarily alternate between languages while uttering a sentence. For instance, if you speak both Hindi and English with a co-worker, you might tell him,

Office aane ke raste main I fell into a basket of machhli.

(On the way to office, I fell into a basket of fish.)

But you definitely will not say –

Office aane ke on the way I giri into a basket of fish.

It just sounds odd.

So, we might say that there are rules for code-mixing. In that case, what are they? Must code-mixers know all the rules? People who code-mix usually do so easily, without speaking slowly so that they can decide when to switch languages and definitely without trying to check whether they’re sticking to the rules. It turns out that unlike, say, writing sonnets, code-mixing is one of those things you can accomplish without consciously knowing the rules you’re using to do it.

There are people though, who are still trying to figure out the rules for code-switching, some because they’re just curious, others because they’re trying to teach computers how to participate in a code-mixed conversation (Machines don’t seem to think code-mixing is any easier than writing sonnets. Tougher, perhaps.) The frustrating bit is that nobody seems to be coming up with the correct rules. For every rule that’s made, there’s a perfectly good code-mixed sentence that violates it.

One major dispute is regarding the roles of the two (or more, but for now, let’s take two) languages being mixed. Some say that one language is in charge and only lets the other peek in here and there, while others maintain that the two languages are equal partners. This is an important debate, because it determines what sort of rules we’re looking for.

Consider the first alternative – Every sentence is originally in a single language (the superhero, or the matrix language). While code-mixing, we essentially pull out clumps of one or more words from this sentence and plug in fragments from the other language (the sidekick, or the embedded language). A fragment might have fewer or more words than the clump it replaces, and might be ordered differently, but always conveys the same information as the original clump. One may not, of course, pull out bits of these sidekick-clumps and replace them with hero-clumps. The catch, though, is that one cannot do this exercise with any group of words one fancies. Take, for instance, the sentence –

Mere kurte pe maine doodh gira diya.

(I spilt milk on my kurta.)

English-Hindi code-mixers might swap ‘mere kurte pe’ out in favour of its English counterpart –

On my kurta maine doodh gira diya.

However, one will not do this with ‘pe‘ to say –

Mere kurta on maine doodh gira diya.

In this paradigm, the matrix-embedding model, the ‘rules’ for code-switching would indicate what sorts of word-groups one can swap out. The example above illustrates a couple of rules suggested in this paper, which say that it is alright to ‘swap’ or ’embed’ a noun phrase (‘mere kurte pe’), but not a lone postposition (‘pe‘). We should note here that not being able to swap postpositions does not mean that you will never encounter a Hindi postposition in an English-hero sentence. It only means that any Hindi postposition in the sentence was swapped in as part of a particular sort of group, perhaps a noun phrase.

The other idea, which is based on both languages being equal partners, goes like this – To start off with, you have two copies of the same sentence, one in each of two languages. In order to code-mix, you start off with a slice of one of these sentences. Now place a slice of the other sentence next to it. Now another of the first. And so on, until you’ve got a code-mixed sentence that says the same thing as either of the initial single-language sentences.

A simple example in Hindi and English again. You’ve got these two –

Agar main kahoon, mujhe tumse mohobat hai, meri bas yehi chaahat hai, toh kya kahogi?

If I say I am in love with you, that this is my only wish, then what will you say?

We slice and layer to come up with –

Agar main kahoon, I am in love with you, meri bas yehi chaahat hai, toh what will you say?

This model proves a lot trickier to use than the first one. (Check it out here) The ‘rules’ here must ensure that the code-mixed sentence doesn’t include the same fragment twice, once in each language. They also mustn’t allow words that were next to each other in the original sentence to be at opposite ends of the new one, just because we sliced the sentence right between these two words. We need rules to check whether every part of the code-mixed sentence sounds grammatical according to at least one of the two languages, and whether… oh, all sorts of things, far too many things.

Definitely not something one could work out in one’s head while talking at normal speed, right? 😉

Pronunciation Modeling for Code Mixing

Sunayana Sitaram, Microsoft Research India

Have you ever wanted to have your texts and WhatsApp messages read out to you? Have you ever used a foreign word while using a system like Cortana, only to find that it does not recognize words that are not in the language it is expecting to hear? Speech Recognition and Synthesis of code-mixed utterances is a very challenging problem. Most speech processing systems are designed to be used with a single language. Moreover, people may pronounce words differently when they are speaking multiple languages at the same time, which may confuse such systems.

Let us look at the problem of reading out a recipe on a popular Hindi recipe website Nishamadhulika. Here’s the link to the recipe, if you want to take a look http://nishamadhulika.com/1064-creamy-mushroom-soup-recipe.html

Now as you can see, most of the text in the recipe description is in Hindi, written in the native script (Devanagari). This should be fairly easy for a Hindi Text to Speech system to read out to the user. However, we see some English words in the title, and also numbers in the Roman script to denote quantities.  If you scroll down to the comments, you see that many of the comments are in Hindi, but are not written in native script. Let us look at a couple of comments.

“bahut yammi recipe thi nisha ji ye soup mere baby ne jo ki 15 month ka hai bahut shok se piya hai”

“Nisha ji musroom soup bht acha bna h.mje cooking bhi bht achi lgti h.bus ye btao is e without cream healthi kaise bnaya ja skta h ans jrur dena”

We find that there are many English words in these sentences (“soup”, “yammi”, which is “yummy”, “15 month”, “baby”, “cooking” etc.). We also find that users don’t always follow a standard way of transliterating Hindi into Romanized script. For example, in the first sentence, the word “बहुत” is written as “bahut”, while in the second one, it is shortened to “bht”. Similarly, the word “है” is written as “hai” in the first comment, and only as “h” in the second one!

Now imagine if you are a Text to Speech system and you need to read out such text! You need to identify what languages the words are in, rectify spelling mistakes, expand contractions and then figure out how you are actually going to pronounce the word. This is made even harder by the fact that the training data for most Text to Speech systems today only consists of single language, clean, well-written data.

In a future post, we will talk more about how we make Text to Speech systems capable of synthesizing mixed language text. Meanwhile, you can read this paper:

‘Speech Synthesis of Code Mixed Text’, Sunayana Sitaram and Alan W Black, in Proceedings of LREC 2016, Portoroz, Slovenia

Word appropriation: To be, or not to be… formalized?

Andrew Cross, Microsoft Research India

English-adapted words, especially around technology use, are increasingly common in other languages. For instance, to tweet in Spanish is often called “tuitear”, taking the original English word and adding a Spanish grammatical ending. Similarly, “le hardware” or “le software” are used in French to describe the rather obvious English-counterparts (for other interesting Franglais phrases, check out an amusing list here). Some words, like “computer”, “bus”, or “phone/mobile” are almost universally understood around the world.

While widespread adoption of these words gives a certain uniformity and intelligibility to global conversations, there are those who lament this trend and think it undermines the original language and therefore culture. Language institutions like the Academie Française or the Real Academia Española regularly wrestle with what words to embrace from other languages, versus promoting more local renderings of the same idea (one example the director of the Real Academia Española gives is his preference to use “auto-photo” instead of “selfie”). One clear goal of defining a unified dictionary of a language as geographically dispersed as Spanish, a majority language in over 20 countries, is not only to protect the language from being infiltrated by outside influence, but also to build an identity and cultural unity for speakers and countries that use the language.

And so emerges a funny paradox that is by no means limited to the human interpretation of “language” – on the one hand you have an organic blend and evolution of language through increasing global travel, business, and media. On the other, you have a need or desire to canonize certain aspects of language both for utility (one needs to be understood), and for preserving a certain culture associated with a language. At one extreme, wholesale adoption of outside languages could lead to the ultimate demise of a language. But at the other extreme, the outright rejection of any word deemed “foreign” undermines the very nature of language dynamics.

Which brings the conversation back to technology. The global world is much more connected which presents more opportunities for languages to interact and evolve. With the near immediacy for interchange available through the internet, one can expect many of these new blends and linguistic evolutions to brew locally, but make their international debut online. How will this debate play out as words like “selfie” or “friend request” or “email” become increasingly common in online forums? Perhaps more importantly for bodies governing the words that are officially part of a language, can (or should) such standardizing efforts keep up with the rapid spread of foreign words in the new era of the internet?

Code-Mixed Language Identification

Shruti Rijhwani, Microsoft Research India

RT @HappelStadion: What was your favourite 1D moment at the concert? Was war für euch der schönste Moment? Tweet us!

If you know both English and German, you probably figured out what two languages this tweet uses. Either way, you likely realized that there isn’t just one language in the tweet.

We recognize languages that we are familiar with. The task is second nature to humans – is it just as easy for machines? Why do machines need to identify languages in the first place?

Most Natural Language Processing (NLP) techniques are designed for specific languages. That makes language identification a necessary first step for machines to derive meaning from human language. Computational language identification research began in 1995. Initially, language identification was performed at the document-level, that is, whole documents were assumed to contain a single language. This was only logical as back in 1995, most digital documents had professional or literary content. We didn’t expect to encounter multiple languages within documents!

However, sentence-level language identification (i.e. one language label per sentence) soon became important to understand comments, short posts and similar user-generated data on the internet. Where does code-mixing fit in, though? Let us look at this Spanish-English tweet.

@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro! 😮 Then I started getting all red, I think im allergic a algo

Even sentence-level language identification wouldn’t work when data is code-mixed, as mixing can be intra-sentential! Before we begin to process code-mixing, we need to recognise all languages present in the data. One language per sentence simply isn’t enough – word-level language identification is necessary.

Code-mixing is inherently informal and generally occurs in casual communication. The phenomenon traditionally occurred in spoken conversation. Now, we have speech-like informal conversation happening on social media and find plenty of code-mixed data in the text form as well.

How do we identify the languages in social media data? Is it as simple as looking up words in dictionaries of various languages? Going back to our example tweet,

RT @HappelStadion: What *was* your favourite 1D *moment* at the concert? *Was* war für euch der schönste *Moment*? Tweet us!

There are words (‘was’, ‘moment’) that belong to both languages! And this tweet is grammatically sound, with correct spelling. What about tweets like,

Wat n awesum movie it wazzzz!

Our language dictionaries wouldn’t identify misspelled words (‘wat’), shortened words (‘awesum’) and exaggerated words (‘wazzz’).

Not to mention, the problem of transliteration. Several languages that are not formally written in the Roman/Latin script, are often phonetically typed using the Roman script that computer keyboards generally feature.

Modi ke speech se India inspired ho gaya #namo

Although Hindi uses the Devanagari script, this Hindi-English tweet has transliterated Hindi words.

Looking up words in a dictionary might work in several cases. But the example tweets we’ve just looked at are not outliers! A large amount of social media content isn’t written with perfect grammar and spelling. Solutions to word-level language ID must counter these problems as well.

There has been exciting work on word-level language identification for social media data, including a shared task in EMNLP 2014 [1], the annual FIRE shared task [2], as well as work on Hindi-English [3] and Dutch-Turkish [4] mixing.

Most previous work deals with pairwise language identification i.e., the language pair is already known, and words in the input can only be from those languages. With plenty of annotated training data, supervised machine learning models have performed extremely well under these conditions.

However, such models have two glaring issues –

  1. They assume that the language pair in the input is already known and the words can only be from those languages. On Twitter, Facebook and other social media, no prior language information is available about posts.
  2. They use supervised machine learning models, which require plenty of annotated training data. Labelled data is scarce for most language pairs, particularly data with all the quirks of social media.

The Project Mélange team at MSR India is working towards a solution for these issues.

We aim to design a universal word-level language identification technique that works well for both code-mixed and monolingual social media data. It would require no prior information about the languages in the input. Although we have a minuscule amount of code-mixed training data, obtaining labeled monolingual data is relatively much simpler. We leverage this monolingual data and train a model that can label code-mixed input as well.

Watch this space for more on that, soon!

References

[1] Solorio, Thamar, et al. “Overview for the first shared task on language identification in code-switched data.” Proceedings of The First Workshop on Computational Approaches to Code Switching. 2014.

[2] Sequiera, Royal, et al. “Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.”

[3] Gella, Spandana, Kalika Bali, and Monojit Choudhury. ““ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification.” (2014).

[4] Nguyen, Dong, and A. Seza Dogruoz. “Word level language identification in online multilingual communication.” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2014.