Author: pocomixmaadi

Interspeech Special Session on Code-switching

We are very pleased to announce the Interspeech 2017 Special Session on Speech Technologies for Code-Switching in Multilingual Communities.

Code-switching provides various interesting challenges to the speech community, such as language modeling for mixed languages, acoustic modeling of mixed language speech, pronunciation modeling and language identification from speech. The special session focusing on speech processing of code-switched speech will include oral presentations and a poster session. We expect participants from academic and industry spanning a wide variety of language pairs and data sets. We also expect discussions on how to create speech and language resources for code-switching and sharing of data.

For more details about the special session, please see https://www.microsoft.com/en-us/research/event/interspeech-2017-special-session-speech-technologies-for-code-switching-in-multilingual-communities/

Multilinguals cannot help but code-switch!

Linguists have been saying this for ever, but now we have empirical evidence from cognitive scientists to support this: If you speak two languages and have ever found this task to be difficult—choosing the “right” tongue based on the context you’re in—it’s because both languages are always “on” in the brains of bilinguals.

In fact, “ [exposure to multiple languages] is not confusing them [bilingual babies] or messing them up developmentally—the opposite is true.” says Judith F. Kroll , Professor Emeritus, University of Pennsylvania.

Read more.

Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?

Koustav Rudra, IIT Kharagpur

Code-switching or code-mixing — a fluid alternation between two or more languages in a single conversation is very common in chats, conversations, and messages posted over social media like Twitter and Facebook. In India, around 125 million people in India speak English, half of whom have Hindi as their mother-tongue. The large proportion of the remaining half, especially those residing in the metropolitan cities, also know at least some Hindi. This makes Hindi-English code-switching, commonly called Hinglish, extremely widespread in India.

 It is observed that users generally alter languages in their conversations or posts due to various reasons:

  •  Sometimes users try to post information in one language, and personal opinions in another one, e.g. – “was it a modi wave or just wave? #aapsweep #aapkidilli bjp ki halat congress se jyada achchi ni lag rahi
  •  Users want to emphasize their opinion, e.g. – “best wishes to indian team tiranga aapke saath hai
  •  Users want to curse or use abusive language, e.g. – “Seeing the movie I thought ki kis bandar ke haath me camera de do to wo bhi movie banaa le

 In this study, we try to understand language preference of users for expression of opinions and sentiment. For this, we first build a corpus of 430,000 unique India-specific tweets across four domains (sports, entertainment, politics, and current events) posted by 1, 25,396 unique users and automatically classify the tweets by their language: English, Hindi and Hindi-English code-switched. We then develop an opinion detector for each language class to further categorize them into opinionated and non-opinionated (facts, wishes, questions etc.) tweets. Sentiment detectors further classify the opinionated tweets as positive, negative or neutral.

 We observe several interesting phenomena in the language usage pattern by end users:

  •  In online social media, Hindi is seldom written in the Devanagari script. Instead, loose Roman transliteration, or Romanized Hindi, is common, especially when users code-switch between Hindi and English.
  • There is a strong preference towards Hindi (i.e. the native language) over English for expression of negative opinion. The effect is clearly visible in code-switched tweets, where a switch from English to Hindi is often correlated with a switch from a positive to negative sentiment like “#beefban is a right step forward .. me saala bangal ho jaaye toh behtar hoga .. ye bangladeshi ghuspaithiye isi bahane india aate hai ..”. Table 1 shows distribution of opinion/sentiment and non-opinion across different monolingual languages (English, Devanagari Hindi, Romanized Hindi), and Table 2 shows distribution over English and Romanized-Hindi fragments of code-switched tweets.

Table 1: Non-opinion and Sentiment across languages

English Devanagari-Hindi Romanized-Hindi
Nonopinion 0.35 0.47 0.39
Negative 0.17 0.17 0.25
Negative/Positive 0.38 3.27 1.60

Table 2: Non-opinion and Sentiment across English and Romanized Hindi fragments of code-switched tweets

English Romanized-Hindi
Nonopinion 0.45 0.39
Negative 0.14 0.22
Negative/Positive 0.34 2.2
  •  Native language Hindi is preferred for swearing over English. We also measure the distribution of abusive Romanized-Hindi and English fragments for code-switched tweets. Interestingly, over 90% of the swear words occur in Hindi parts. The figure below shows the distribution of abusive tweets across different languages.

If you want to know more details about this study, please read our EMNLP 2016 paper:

“Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?”, Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury and Niloy Ganguly, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

With ‘Udta Punjab’ there’s more code-mixing in Bollywood

Indrani Medhi Thies, Microsoft Research India

Saare Gabru toh sooiyan lagake tight hai madam, Ab Ladies ko hi kuch karna padega na”, says Assistant Sub-Inspector Sartaj to Dr. Preet Sahni in ‘Udta Punjab’.

(English: ‘All the young hunks are wasted from injecting needles, so it’s the women who’ll have to do something”.)

‘Udta Punjab’ is Bollywood’s attempt at creating awareness about the drug menace of Punjab. In recent years drugs have crippled the youth of this state and things are only getting worse. Udta Punjab looks at the drug problem from the perspective of three intertwining stories of its four protagonists. The stories have one connection, how drugs are affecting lives across the socio-economic strata and how people are fighting back in their own way. However, another connection that you can’t help noticing is the use of code-mixed language across these stories, of Punjabi, Hindi, Bhojpuri, and English.

In Udta Punjab Sartaj’s (Diljit Dosanjh) teenaged brother Balli has turned an addict and the family has no idea about his situation. Sartaj gets sensitized about the drug problem with the help of the beautiful, strong, independent Dr. Preet Sahni (Kareena Kapoor), a crusader against drugs. Together Sartaj and Preet carry out a series of sting operations that will expose the drug racket to the entire nation.

And it’s not just teenaged boys like Balli, but the London-born Punjabi rockstar Tommy Singh, (Shahid Kapoor) is also a drug addict. Known by his fans as ‘Gabru’, he’s a youth icon. He has composed music about drugs:

Enough is enough

Kutte se bhi tough

Life ho jaaye toh maaro 10 puff

Match fix lagega toh pitch khodenge

Hoke besharam bandi rich khojenge

Emo-panti apne ko to suit na kare

Gabru hi kya jo hip se shoot na kare

(English: Enough is enough

Tougher than a dog

When life gets to you, then take in 10 puffs

If we think that a match is fixed, we will dig up the pitch

We’ll be shameless, we will search for a rich girl

Being emotional doesn’t suit us

One who doesn’t shoot from his hip isn’t a young man)

Powder ki line’o ka rakhega kaun hisaab…

Haan ud-daa Punjab

(English: Who’ll keep track of the lines of powder [cocaine]…

Yes, Punjab’s flying).

Tommy’s street-smart ‘Tayaji’ (Uncle, Satish Kaushik) has used every opportunity to make money off Tommy’s popularity, but is also very loyal and supportive. He speaks in a chaste Punjabi accent,

Puttar, Honda hai cocaine chaddo toh withdrawal symptom hote hain“.

(English: Son, it happens, when you stop using cocaine, you experience withdrawal symptoms).

The third story is of migrant Bihari labourer (name not revealed till the end, Alia Bhatt) who gets dragged into drug peddling by accident, and lands into the worst situation amongst all of the film’s protagonists. It’s only later revealed that she’s a district-level hockey champion who is forced to give up her dream to work as a labourer in the fields in Punjab so she can support her family. Tommy and she meet in the most unusual of circumstances. Tommy tells her:

Agar tujhe koi bole jo nikal gaya woh teri life ka sabse accha time tha….Maal khatam…Party over. Go Home“.

(English: If someone were to tell you, what’s past was the best time of your life… Stuff’s over (now), party over, go home).

Alia’s character doesn’t do much code-mixing; there’s just one English word or two in her utterances:

“Kaise na aya acha time re, sala acha time khojte khojte toh ye hal ho gya…”

(English: How will good times not come, idiot, see what situation searching for good times has landed me in…).

‘Udta Punjab’ exposes its audience to gut-wrenching reality about the drug problem in Punjab. Time is running out and something needs to be done soon. The situation is bleak, but a ray of hope flickers. Udta Punjab is also a great instantiation of how code-mixing is becoming increasingly commonplace in films. So if you’re interested in some realistic cinema, or just curious how code-mixing overlaps with Bollywood,

Go, aglaa show book karo, jaldi, if you haven’t already!

How Do We Characterize Code-mixing?

Gayatri Bhat, Microsoft Research India

If you are a frequent reader of this blog, you have a fair idea of what code-mixing is. In case you aren’t, it is the practice of going back and forth between two languages in the course of just one ek hi conversation, as jaise I’m doing right now abhi.

Here’s a curious thing about code mixing. Most people seem to agree that you cannot arbitrarily alternate between languages while uttering a sentence. For instance, if you speak both Hindi and English with a co-worker, you might tell him,

Office aane ke raste main I fell into a basket of machhli.

(On the way to office, I fell into a basket of fish.)

But you definitely will not say –

Office aane ke on the way I giri into a basket of fish.

It just sounds odd.

So, we might say that there are rules for code-mixing. In that case, what are they? Must code-mixers know all the rules? People who code-mix usually do so easily, without speaking slowly so that they can decide when to switch languages and definitely without trying to check whether they’re sticking to the rules. It turns out that unlike, say, writing sonnets, code-mixing is one of those things you can accomplish without consciously knowing the rules you’re using to do it.

There are people though, who are still trying to figure out the rules for code-switching, some because they’re just curious, others because they’re trying to teach computers how to participate in a code-mixed conversation (Machines don’t seem to think code-mixing is any easier than writing sonnets. Tougher, perhaps.) The frustrating bit is that nobody seems to be coming up with the correct rules. For every rule that’s made, there’s a perfectly good code-mixed sentence that violates it.

One major dispute is regarding the roles of the two (or more, but for now, let’s take two) languages being mixed. Some say that one language is in charge and only lets the other peek in here and there, while others maintain that the two languages are equal partners. This is an important debate, because it determines what sort of rules we’re looking for.

Consider the first alternative – Every sentence is originally in a single language (the superhero, or the matrix language). While code-mixing, we essentially pull out clumps of one or more words from this sentence and plug in fragments from the other language (the sidekick, or the embedded language). A fragment might have fewer or more words than the clump it replaces, and might be ordered differently, but always conveys the same information as the original clump. One may not, of course, pull out bits of these sidekick-clumps and replace them with hero-clumps. The catch, though, is that one cannot do this exercise with any group of words one fancies. Take, for instance, the sentence –

Mere kurte pe maine doodh gira diya.

(I spilt milk on my kurta.)

English-Hindi code-mixers might swap ‘mere kurte pe’ out in favour of its English counterpart –

On my kurta maine doodh gira diya.

However, one will not do this with ‘pe‘ to say –

Mere kurta on maine doodh gira diya.

In this paradigm, the matrix-embedding model, the ‘rules’ for code-switching would indicate what sorts of word-groups one can swap out. The example above illustrates a couple of rules suggested in this paper, which say that it is alright to ‘swap’ or ’embed’ a noun phrase (‘mere kurte pe’), but not a lone postposition (‘pe‘). We should note here that not being able to swap postpositions does not mean that you will never encounter a Hindi postposition in an English-hero sentence. It only means that any Hindi postposition in the sentence was swapped in as part of a particular sort of group, perhaps a noun phrase.

The other idea, which is based on both languages being equal partners, goes like this – To start off with, you have two copies of the same sentence, one in each of two languages. In order to code-mix, you start off with a slice of one of these sentences. Now place a slice of the other sentence next to it. Now another of the first. And so on, until you’ve got a code-mixed sentence that says the same thing as either of the initial single-language sentences.

A simple example in Hindi and English again. You’ve got these two –

Agar main kahoon, mujhe tumse mohobat hai, meri bas yehi chaahat hai, toh kya kahogi?

If I say I am in love with you, that this is my only wish, then what will you say?

We slice and layer to come up with –

Agar main kahoon, I am in love with you, meri bas yehi chaahat hai, toh what will you say?

This model proves a lot trickier to use than the first one. (Check it out here) The ‘rules’ here must ensure that the code-mixed sentence doesn’t include the same fragment twice, once in each language. They also mustn’t allow words that were next to each other in the original sentence to be at opposite ends of the new one, just because we sliced the sentence right between these two words. We need rules to check whether every part of the code-mixed sentence sounds grammatical according to at least one of the two languages, and whether… oh, all sorts of things, far too many things.

Definitely not something one could work out in one’s head while talking at normal speed, right? 😉

Pronunciation Modeling for Code Mixing

Sunayana Sitaram, Microsoft Research India

Have you ever wanted to have your texts and WhatsApp messages read out to you? Have you ever used a foreign word while using a system like Cortana, only to find that it does not recognize words that are not in the language it is expecting to hear? Speech Recognition and Synthesis of code-mixed utterances is a very challenging problem. Most speech processing systems are designed to be used with a single language. Moreover, people may pronounce words differently when they are speaking multiple languages at the same time, which may confuse such systems.

Let us look at the problem of reading out a recipe on a popular Hindi recipe website Nishamadhulika. Here’s the link to the recipe, if you want to take a look http://nishamadhulika.com/1064-creamy-mushroom-soup-recipe.html

Now as you can see, most of the text in the recipe description is in Hindi, written in the native script (Devanagari). This should be fairly easy for a Hindi Text to Speech system to read out to the user. However, we see some English words in the title, and also numbers in the Roman script to denote quantities.  If you scroll down to the comments, you see that many of the comments are in Hindi, but are not written in native script. Let us look at a couple of comments.

“bahut yammi recipe thi nisha ji ye soup mere baby ne jo ki 15 month ka hai bahut shok se piya hai”

“Nisha ji musroom soup bht acha bna h.mje cooking bhi bht achi lgti h.bus ye btao is e without cream healthi kaise bnaya ja skta h ans jrur dena”

We find that there are many English words in these sentences (“soup”, “yammi”, which is “yummy”, “15 month”, “baby”, “cooking” etc.). We also find that users don’t always follow a standard way of transliterating Hindi into Romanized script. For example, in the first sentence, the word “बहुत” is written as “bahut”, while in the second one, it is shortened to “bht”. Similarly, the word “है” is written as “hai” in the first comment, and only as “h” in the second one!

Now imagine if you are a Text to Speech system and you need to read out such text! You need to identify what languages the words are in, rectify spelling mistakes, expand contractions and then figure out how you are actually going to pronounce the word. This is made even harder by the fact that the training data for most Text to Speech systems today only consists of single language, clean, well-written data.

In a future post, we will talk more about how we make Text to Speech systems capable of synthesizing mixed language text. Meanwhile, you can read this paper:

‘Speech Synthesis of Code Mixed Text’, Sunayana Sitaram and Alan W Black, in Proceedings of LREC 2016, Portoroz, Slovenia