Author: pocomixmaadi

Project Mélange @ Interspeech 2017

Interspeech is the annual conference of the International Speech Communication Association (ISCA). This year, Interspeech was held in beautiful Stockholm, Sweden in August 2017 and saw a record participation of over 2000 attendees!

Project Mélange members Sunayana Sitaram and Kalika Bali co-organized a special session on code-switching with Prof. Alan Black from CMU, Prof. Julia Hirschberg from Columbia University, Prof. Thamar Solorio from University of Houston and Prof. Mona Diab from George Washington University. The special session was in the pre and post-lunch sessions on the first day of the conference, and we were very excited to see the room full of people for most of the session!

Topics in the special session ranged from data collection and resource building to applications such as Automatic Speech Recognition and Speech Synthesis. The session covered various language pairs – French-Algerian Arabic, Dutch-Frisian, Hindi-English, isiZulu-English and Spanish-English. At the end of the 9 oral presentations, we had an interesting and engaging panel discussion on techniques, applications, data and resources for code-switching in speech and language.

Take a look at the special session website for the full list of papers and more details!


Code switching and code-mixing in 19th century Bengal

Aniruddha Baul, Jadavpur University

Code switching and code mixing strategies took an important part in the conversation pattern in 19th century Bengal.  I would like to highlight the variations of code switching and code mixing in Bangla, considering the prahasan (skit) and Rupchand Pokkhir gaan (songs of Rupchand Pokkhi, a renowned poet cum musician). Before discussing the examples and variations I would like to elaborate why these two particular literary traditions were chosen as the documents of code switching and code mixing while talking about conversation.

The dialogues of prahasans were mainly based on real conversations. The main target of the prahasans was to show the real image of the contemporary society including their language. There were no questions of loyalty to the standard language in the dialogues of most of the prahasan. So, we can take the dialogue of the prahasans as authentic language data. I chose a prahasan named “Ekei ki bole sovvota” by Michael Madhusudan Dutt for initial analysis. The prahasan is primarily based on the lifestyle of youth who have just learned English in 19th century Kolkata . Here are some dialogues –

Shibu: ja bol bhai, kintu ora dujon lekhanpora besh jane

(Whatever you would say but they are very learned persons)

Bolai: between ourselves, emon ki jane ?

(among us, what do they know?)

Mahesh: hya,hya, sokoleri biddye jana ache! se din je nobo ek khana chithi likhechilo ta toh dekheicho, tate Lindley Morer je durdosa ta toh mone ache ?

(I know about their knowledge! I think you can remember the letter written by Nobo where the English grammar was poor)

Naba :kintu gentle man, ekhon e desh amader pokkhe jeno mosto jelkhana;ei griho kebol amader liberty hall orthath amader swadhinotar dalan;ekhane jar je khusi se ti koro|gentleman, in the name of freedom let us enjoy ourselves

(But gentleman, now this country is like a prison to us, this room is our liberty hall, the passage of our freedom ,you can do everything whatever you wish to. Gentleman, in the name of freedom let us enjoy ourselves)

Despite the two having the same language background, we notice that there is a lot of code mixing and code switching in this dialogue. We see that sometimes vocatives or address terms of Bangla are replaced by English words like “gentleman”. Concepts like “liberty hall”, which came from Western thought, remained English in the dialogue. Prepositions are often replaced but the syntax follows the structures of the matrix language. Knowledge was measured by skill in the English language and English was the status symbol for native Bengalis during the colonial period of 19th century.  These kinds of code mixing and code switching established the speakers’ identity as “educated” people.

Sibu : That’s a lie

Naba: What! Tumi amake lair bolo? Tumi jano na ami tomake akhoni shoot korbo?

(What! You call me liar! Don’t you know that I am going to shoot you? )

Chaitan : Ha! jete dao, jete dao, ekta trifling kotha niye miche jhogra keno ?

(Ignore the thing! Why are you fighting meaninglessly for a trifling word?)

Naba: Trifling! – o amake liar bolle –abar trifling? O amake bangala bolle na keno? O amake mitthebaadi bolle na keno? Tate kon shala ragto? Kintu –liar –e ki bordasto hoye?

( trifling! he calls me liar? why did not he tell it in Bangla? If he calls me “mittyebadi” {Bengali word for “liar”} instead of liar I don’t mind anything).

From the examples above, we can see in sentences like “ami tomake akhoni shoot korbo” (I am going to shoot you), the verb “shoot” acts as a noun and  “korbo” the auxiliary verb of Bangla is added after “shoot” to become a compound verb like “shoot korbo”. If we focus on the content we can see the hierarchy of prestige languages – “liar” and  “mitthyebadi” have the same meaning but the word “liar” was considered superior to the Bangla equivalent term “mitthyebadi”. These dialogues show the language hierarchy during the Colonial period.

When the British communicated with each other they choose to do so in English but when they had to communicate with native people they had to use mixed language for the sake of negotiation. But it is interesting to observe that when the upper class and middle class native people communicated with each other they used mixed languages too. In this case, there was no question of negotiation. If we follow the songs of famous singer Rupchand Pokkhi, who became famous for singing using mixed languages in his songs in 19th century Bengal, we can understand the possible reasons why native people chose to use mixed language. There was a story that one day a high class British officer was invited as the main guest of the party organized by Rupchand’s patron. The British officer wanted to hear Rupchand’s song but it was impossible for him to understand the lyrics. Rupchand saved the day by singing ,

                                                   Let me go Ohe dwari I visit to bongsidhari

                                                  ( Oh gatekeeper)           (flute player, Krishna)

                                                    Esechi brojo hote ami brojer kulonaari

                                  ( I am coming from Braj, I am a respected woman from Braj)

                                         Beg you doorkeeper let me get, I want to see blockhead,

                                                Far whom our radhe dead ,ami search kori

                                                                      (I am searching for)

                               Srimoti radhar kena servant ,ei dakh ache daskhoth agreement,

                            (Servant owned by Srimati Radhe, see we do have signed agreement)

                                                   Ekhoni korbo present ,brojopure lob dhori,

                                       (I shall be presenting now, shall be hijacking to Brajapur)

                                                   Moral character suno or,butterthief nonichor

                                   (Please find the moral character of him, the butter thief)

                                                 Blaggard rakhal poor, chor mothurar dondodhari

                                        ( the poor shepherd, the thief is the authority of Mathura)

                                               Kohe R.C.D Bird king , black nonsense ver cunning


                                                     Flute te kore sing, mojayeche Raikishori

                                          (By singing using a flute, convinced the Raikishori)

(Friend of Radha wanted to meet Krishna and tell him the condition of Radha. But, the doorkeeper of the palace did not allow her inside so she was rebuking Krishna for cheating on Radha )

We can see the nature of code switching very clearly in the song where the matrix language of some sentences are in English and some of them are in Bangla. Most of the address terms are from Bangla. We also notice compound verbs like “search kori” where one item of the verb is in English and one is in Bangla. Apart from this, we can see that many lexical codes are mixed in the song. Both lexical mixing and structural mixing happened in the songs of Rupchand. Let us see a song here,

                                        Amare fraud kore kaliya damn tui kotha geli

                                 (To me)          (by doing)             (Where did you go?)

                                       I am for you very sorry, golden body holo kaali

                                                                                                 (Became pale)

                                        Ho my dear dearest , modhupure tui geli kesto

                                                                ( Krishna you went to madhupur)

                                      Oh my dear,how to rest,here dear bonomali

                                                       Soon re shyam tore boli

                                    (Oh Shyam, please listen, what I am saying)

                                      Poor creature milk gerel(girl),tader breast’e marli shel

                                                                         (their)               (by targeting arrow)

                                       Nonsense tor naiko akkel,breach of contract korli

                                                   (You have no sense)                   (did)

                                                        Femalegone fail korli

                                                     (You have failed the female)

                                       Lompot sother fortune khullo,mathura’te king holo

                               (The clever became fortunate, he becam the king at Mathura)

                                       Uncle’er pran nashilo,kubujar kuj pele dali

                                          (Killed his uncle……rest not understood)

                                                         Nile dashi re mohishi boli

                                                 (took your maid servant as queen)

                                       Sri nandar boy young lad ,croocked mind hard

                                    (of Sri Nanda)

                                      Kohe R.C.D Bird e pelacard krishnokeli


                                                  Half English half Bangali

(Radha was dumped by Krishna so she was rebuking him and expressing her grief)

In this song, there is not only lexical mixing but also structural mixing. For example, in the word “femalegon”, “gon” is the plural marker of Bangla which is added to the English world “female”. Bangla case marker like “e” or “r” which are added to the noun and make words like “breast-e” and “uncle-r”. We also notice syntactic changes of English sentences, where the changes are inspired by Bangla syntax. Objects follow the subject just like Bangla sentence structure so here we can see “I am for you very sorry” instead of “I am very sorry for you” .

Considering the sociolinguistic aspects of code mixing, we can ask: what are the reasons behind this kind of code switching and code mixing in Rupchand’s song? At first, we should know about the audience of his songs. If there were a few British people in the audience attending the performance, then these kinds of code switching and code mixing were natural. But, one could not be famous among the natives following this policy. So there had to be huge acceptance and demand for these kind of songs. We can assume that Rupchand created his songs for the British audience but the songs became famous gradually among the English loving natives, who could relate to the language of the song with their language of conversation where code switching and code mixing take a great part. So code mixing and code switching can be related to the identity of the natives in the 19th century Bengal.


Bandhopadhyay, Asit Kumar,first edition-1973, Bangla Sahittyer Itibrittyo,vol-4, by (History of Bengali literature by Asit Kumar Bandhopadhyay.) Modern Book Agency Pvt LTD,Kolkata

Chakraborty, Ramakanta, Bismrito Darpan (Forgotten Mirror edited by Ramakanta Chakraborty)Sanskrito Pustok bhandar,Kolkata

Khetrogupto, first edition-1965, Madhusudon Rachanabali (Collected Works of Micheal Madhusudon Dutt edited by khetrogupto) Sahittya samsad, Kolkata

Lahiri, Durgadas, first edition-1905, Bangalir Gaan  (Songs of the Bengalis edited by Durgadas Lahiri) Bangabasi Electric Press, Kolkata

Myers Scotton, C. 1982. ‘The possibility of code switching: Motivation for maintaining multilingualism’ in Anthropological Linguistics, Vol. 24, No. 4, pp. 432-444

How code mixing can be used for education

Dr Dripta Piplai, Jadavpur University

IMG-20170403-WA0003Author, “Nijer bhashaye galpo” (Stories in one’s own tongue)

A close observation of the everyday language use of children in India reveals many instances of code mixing. Children can mix and switch between two or more languages. Children acquire more than one set of codes based on different situations at their surroundings. Acquisition of multiple set of codes is observed in both rural and urban children of India. In reality, absolutely no child will be found as an ideal monolingual in this country. Children regularly get access to multiple codes through school, market, television and playgrounds. In fact, it can be argued that every child is bilingual or multilingual as default. It can be stated that children use one set of grammar and borrow linguistic items from other known languages. It is also possible to claim that instead of simply borrow from a language, children utilize the structures and lexical items of two or more languages and to use mixed codes. As Tom Roeper (1999) has pointed out, there is a ‘Mini Grammar’ inside every child’s head. Thus, every child is bi/multilingual.

There is a need to understand the nature of this bi/multilingual grammar of children. We can assume that there is a multilingual grammar inside ever child’s head. There is an obvious question related to the assumption: how are the different codes arranged inside the head. (Like different emotions were arranged inside Riley’s head in the Disney movie ‘Inside Out’) There are different possibilities. We can argue that there are different slots for different languages in our mental grammar (Universal Grammar, to put in a Chomskyan way). As children modify the building blocks of languages (or features), different set of codes are obtained and the codes are mixed often.

If one observes the playground talk by children, it will be clear that during play children use lot of mixed codes. In reality, code mixing is a strategy for negotiation during play. A detailed understanding of the code mixing in child language can be obtained through playground talk.

Why do children negotiate at playground? How does the negotiation process use code mixing? One important answer, perhaps, is that children mix codes to assert certain identities and deny certain identities while interacting with other children.

Code mixing has a direct relationship with language variation. Children use codes that are variants of certain linguistic items. For example, a rural child uses variants from his/her home language and the regional standard (the so-called ‘prestige language’). The same child also uses a variant from the link language (or language of the marketplace of a village). There are continuous switching and mixing utilizing these three sets of codes or three variants of a same linguistic item.

The following sentence has been uttered by a Rajbanshi speaking child from northern part of Bengal, in India:

  1. EkTa           haS     khacche                  murgiTa           dekhtese

‘One duck is eating and  a hen is watching that’

The sentence above has two verbs. The first verb ‘khacche’ (eats) uses Bangla verb inflection –cche. The second verb ‘dekhtese’ (watches) uses inflection –ese in an inflection which is neither from their home language nor from the regional language. But children are mixing two sets of codes in a single sentence.


  1. Ek hate noukaTi nise ar arek hate ghuRiTa niye dekhche

‘(He/she) has taken the boar in one hand and a kite on the other hand’

The first verb ‘nise’ (has taken) is a so-called non-prestigious verbal form. The second verb ‘dekhcche’ (watching), on the contrary, is used from regional standard.

Negotiation and assertion of identities through playground talk represents instances from a larger domain. It can be assumed that different set of codes are representation of different identities. Thus, when rural children want to identify themselves with a teacher from a city, they tend to use codes from so-called prestigious varieties. When children want to play among close-knit group members, the language use tend to focus on the home language.

The teachers in rural schools (also in urban schools, but I am focusing rural school for the present purpose) are often not aware of this default multilingual nature of the children’s mental grammar. The teacher mostly assumes that children primarily use the regional standard and their home variety (which is a less prestigious form and thus cannot be used in schools). The fact that children naturally mix codes very often in day to day conversation is not considered by many teachers.  So, teachers do not utilize the multilingual codes for classroom tasks.

Apart from that, there is an understanding from the teachers’ side: children should always use one language in classroom. There is a misconception that mixing codes or utilizing multilingual codes can be cognitively ‘bad’ for children. According to Perez and Nordlande (2004): ‘when children switch between or mix their two languages, it may seem that the children do not have good skills in their either language’. But Cummins (2008) has mentioned that multilingual children are cognitively more demanding. It has been found that children naturally tap linguistic resources, using rules and vocabulary from both the languages (Genesee, Paradis and Cargo, 2004). Ironically, the potential for using multilingual codes or utilizing children’s mixed code utterances is not considered as doable task for regular classroom.

There are possibilities of using code mixing utterances of children as resource of the classroom. Recorded peer talk narrative comprising different codes can be used to design activities based on various skills: e.g. listen to the text and answer/discuss. Spontaneous storytelling and retelling, describing an event, pretend play tasks can be designed by teachers. Theatre activities using code mixing can also be done by allowing children to create dialogues using code mixed grammar.

The use of default code mixed constructions of children in classroom has benefits. As the actual utterances of children are the target texts for various uses in classroom, no  imposition of ‘ideal’ text can be feared from these situations. In other words, using code mixed grammar or default grammar of children in classroom can lead to joyful learning experience for the children too.

Interspeech Special Session on Code-switching

We are very pleased to announce the Interspeech 2017 Special Session on Speech Technologies for Code-Switching in Multilingual Communities.

Code-switching provides various interesting challenges to the speech community, such as language modeling for mixed languages, acoustic modeling of mixed language speech, pronunciation modeling and language identification from speech. The special session focusing on speech processing of code-switched speech will include oral presentations and a poster session. We expect participants from academic and industry spanning a wide variety of language pairs and data sets. We also expect discussions on how to create speech and language resources for code-switching and sharing of data.

For more details about the special session, please see

Multilinguals cannot help but code-switch!

Linguists have been saying this for ever, but now we have empirical evidence from cognitive scientists to support this: If you speak two languages and have ever found this task to be difficult—choosing the “right” tongue based on the context you’re in—it’s because both languages are always “on” in the brains of bilinguals.

In fact, “ [exposure to multiple languages] is not confusing them [bilingual babies] or messing them up developmentally—the opposite is true.” says Judith F. Kroll , Professor Emeritus, University of Pennsylvania.

Read more.

Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?

Koustav Rudra, IIT Kharagpur

Code-switching or code-mixing — a fluid alternation between two or more languages in a single conversation is very common in chats, conversations, and messages posted over social media like Twitter and Facebook. In India, around 125 million people in India speak English, half of whom have Hindi as their mother-tongue. The large proportion of the remaining half, especially those residing in the metropolitan cities, also know at least some Hindi. This makes Hindi-English code-switching, commonly called Hinglish, extremely widespread in India.

 It is observed that users generally alter languages in their conversations or posts due to various reasons:

  •  Sometimes users try to post information in one language, and personal opinions in another one, e.g. – “was it a modi wave or just wave? #aapsweep #aapkidilli bjp ki halat congress se jyada achchi ni lag rahi
  •  Users want to emphasize their opinion, e.g. – “best wishes to indian team tiranga aapke saath hai
  •  Users want to curse or use abusive language, e.g. – “Seeing the movie I thought ki kis bandar ke haath me camera de do to wo bhi movie banaa le

 In this study, we try to understand language preference of users for expression of opinions and sentiment. For this, we first build a corpus of 430,000 unique India-specific tweets across four domains (sports, entertainment, politics, and current events) posted by 1, 25,396 unique users and automatically classify the tweets by their language: English, Hindi and Hindi-English code-switched. We then develop an opinion detector for each language class to further categorize them into opinionated and non-opinionated (facts, wishes, questions etc.) tweets. Sentiment detectors further classify the opinionated tweets as positive, negative or neutral.

 We observe several interesting phenomena in the language usage pattern by end users:

  •  In online social media, Hindi is seldom written in the Devanagari script. Instead, loose Roman transliteration, or Romanized Hindi, is common, especially when users code-switch between Hindi and English.
  • There is a strong preference towards Hindi (i.e. the native language) over English for expression of negative opinion. The effect is clearly visible in code-switched tweets, where a switch from English to Hindi is often correlated with a switch from a positive to negative sentiment like “#beefban is a right step forward .. me saala bangal ho jaaye toh behtar hoga .. ye bangladeshi ghuspaithiye isi bahane india aate hai ..”. Table 1 shows distribution of opinion/sentiment and non-opinion across different monolingual languages (English, Devanagari Hindi, Romanized Hindi), and Table 2 shows distribution over English and Romanized-Hindi fragments of code-switched tweets.

Table 1: Non-opinion and Sentiment across languages

English Devanagari-Hindi Romanized-Hindi
Nonopinion 0.35 0.47 0.39
Negative 0.17 0.17 0.25
Negative/Positive 0.38 3.27 1.60

Table 2: Non-opinion and Sentiment across English and Romanized Hindi fragments of code-switched tweets

English Romanized-Hindi
Nonopinion 0.45 0.39
Negative 0.14 0.22
Negative/Positive 0.34 2.2
  •  Native language Hindi is preferred for swearing over English. We also measure the distribution of abusive Romanized-Hindi and English fragments for code-switched tweets. Interestingly, over 90% of the swear words occur in Hindi parts. The figure below shows the distribution of abusive tweets across different languages.

If you want to know more details about this study, please read our EMNLP 2016 paper:

“Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?”, Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury and Niloy Ganguly, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

With ‘Udta Punjab’ there’s more code-mixing in Bollywood

Indrani Medhi Thies, Microsoft Research India

Saare Gabru toh sooiyan lagake tight hai madam, Ab Ladies ko hi kuch karna padega na”, says Assistant Sub-Inspector Sartaj to Dr. Preet Sahni in ‘Udta Punjab’.

(English: ‘All the young hunks are wasted from injecting needles, so it’s the women who’ll have to do something”.)

‘Udta Punjab’ is Bollywood’s attempt at creating awareness about the drug menace of Punjab. In recent years drugs have crippled the youth of this state and things are only getting worse. Udta Punjab looks at the drug problem from the perspective of three intertwining stories of its four protagonists. The stories have one connection, how drugs are affecting lives across the socio-economic strata and how people are fighting back in their own way. However, another connection that you can’t help noticing is the use of code-mixed language across these stories, of Punjabi, Hindi, Bhojpuri, and English.

In Udta Punjab Sartaj’s (Diljit Dosanjh) teenaged brother Balli has turned an addict and the family has no idea about his situation. Sartaj gets sensitized about the drug problem with the help of the beautiful, strong, independent Dr. Preet Sahni (Kareena Kapoor), a crusader against drugs. Together Sartaj and Preet carry out a series of sting operations that will expose the drug racket to the entire nation.

And it’s not just teenaged boys like Balli, but the London-born Punjabi rockstar Tommy Singh, (Shahid Kapoor) is also a drug addict. Known by his fans as ‘Gabru’, he’s a youth icon. He has composed music about drugs:

Enough is enough

Kutte se bhi tough

Life ho jaaye toh maaro 10 puff

Match fix lagega toh pitch khodenge

Hoke besharam bandi rich khojenge

Emo-panti apne ko to suit na kare

Gabru hi kya jo hip se shoot na kare

(English: Enough is enough

Tougher than a dog

When life gets to you, then take in 10 puffs

If we think that a match is fixed, we will dig up the pitch

We’ll be shameless, we will search for a rich girl

Being emotional doesn’t suit us

One who doesn’t shoot from his hip isn’t a young man)

Powder ki line’o ka rakhega kaun hisaab…

Haan ud-daa Punjab

(English: Who’ll keep track of the lines of powder [cocaine]…

Yes, Punjab’s flying).

Tommy’s street-smart ‘Tayaji’ (Uncle, Satish Kaushik) has used every opportunity to make money off Tommy’s popularity, but is also very loyal and supportive. He speaks in a chaste Punjabi accent,

Puttar, Honda hai cocaine chaddo toh withdrawal symptom hote hain“.

(English: Son, it happens, when you stop using cocaine, you experience withdrawal symptoms).

The third story is of migrant Bihari labourer (name not revealed till the end, Alia Bhatt) who gets dragged into drug peddling by accident, and lands into the worst situation amongst all of the film’s protagonists. It’s only later revealed that she’s a district-level hockey champion who is forced to give up her dream to work as a labourer in the fields in Punjab so she can support her family. Tommy and she meet in the most unusual of circumstances. Tommy tells her:

Agar tujhe koi bole jo nikal gaya woh teri life ka sabse accha time tha….Maal khatam…Party over. Go Home“.

(English: If someone were to tell you, what’s past was the best time of your life… Stuff’s over (now), party over, go home).

Alia’s character doesn’t do much code-mixing; there’s just one English word or two in her utterances:

“Kaise na aya acha time re, sala acha time khojte khojte toh ye hal ho gya…”

(English: How will good times not come, idiot, see what situation searching for good times has landed me in…).

‘Udta Punjab’ exposes its audience to gut-wrenching reality about the drug problem in Punjab. Time is running out and something needs to be done soon. The situation is bleak, but a ray of hope flickers. Udta Punjab is also a great instantiation of how code-mixing is becoming increasingly commonplace in films. So if you’re interested in some realistic cinema, or just curious how code-mixing overlaps with Bollywood,

Go, aglaa show book karo, jaldi, if you haven’t already!