Project Mélange

Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?

Koustav Rudra, IIT Kharagpur

Code-switching or code-mixing — a fluid alternation between two or more languages in a single conversation is very common in chats, conversations, and messages posted over social media like Twitter and Facebook. In India, around 125 million people in India speak English, half of whom have Hindi as their mother-tongue. The large proportion of the remaining half, especially those residing in the metropolitan cities, also know at least some Hindi. This makes Hindi-English code-switching, commonly called Hinglish, extremely widespread in India.

 It is observed that users generally alter languages in their conversations or posts due to various reasons:

  •  Sometimes users try to post information in one language, and personal opinions in another one, e.g. – “was it a modi wave or just wave? #aapsweep #aapkidilli bjp ki halat congress se jyada achchi ni lag rahi
  •  Users want to emphasize their opinion, e.g. – “best wishes to indian team tiranga aapke saath hai
  •  Users want to curse or use abusive language, e.g. – “Seeing the movie I thought ki kis bandar ke haath me camera de do to wo bhi movie banaa le

 In this study, we try to understand language preference of users for expression of opinions and sentiment. For this, we first build a corpus of 430,000 unique India-specific tweets across four domains (sports, entertainment, politics, and current events) posted by 1, 25,396 unique users and automatically classify the tweets by their language: English, Hindi and Hindi-English code-switched. We then develop an opinion detector for each language class to further categorize them into opinionated and non-opinionated (facts, wishes, questions etc.) tweets. Sentiment detectors further classify the opinionated tweets as positive, negative or neutral.

 We observe several interesting phenomena in the language usage pattern by end users:

  •  In online social media, Hindi is seldom written in the Devanagari script. Instead, loose Roman transliteration, or Romanized Hindi, is common, especially when users code-switch between Hindi and English.
  • There is a strong preference towards Hindi (i.e. the native language) over English for expression of negative opinion. The effect is clearly visible in code-switched tweets, where a switch from English to Hindi is often correlated with a switch from a positive to negative sentiment like “#beefban is a right step forward .. me saala bangal ho jaaye toh behtar hoga .. ye bangladeshi ghuspaithiye isi bahane india aate hai ..”. Table 1 shows distribution of opinion/sentiment and non-opinion across different monolingual languages (English, Devanagari Hindi, Romanized Hindi), and Table 2 shows distribution over English and Romanized-Hindi fragments of code-switched tweets.

Table 1: Non-opinion and Sentiment across languages

English Devanagari-Hindi Romanized-Hindi
Nonopinion 0.35 0.47 0.39
Negative 0.17 0.17 0.25
Negative/Positive 0.38 3.27 1.60

Table 2: Non-opinion and Sentiment across English and Romanized Hindi fragments of code-switched tweets

English Romanized-Hindi
Nonopinion 0.45 0.39
Negative 0.14 0.22
Negative/Positive 0.34 2.2
  •  Native language Hindi is preferred for swearing over English. We also measure the distribution of abusive Romanized-Hindi and English fragments for code-switched tweets. Interestingly, over 90% of the swear words occur in Hindi parts. The figure below shows the distribution of abusive tweets across different languages.

If you want to know more details about this study, please read our EMNLP 2016 paper:

“Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?”, Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury and Niloy Ganguly, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

Advertisements

Project Mélange

Monojit Choudhury, Microsoft Research India

Have you ever come across a Facebook post or a tweet in some unknown language with the translate button (the tiny globe on the top right corner), but when you click it either it says “could not be translated” or the output seems a simple reproduction of the input words, may be in a slightly different order?

tweet

This translation is powered by Microsoft’s Bing. Laugh you may at the system or its output, but to be fair, the above one is a particularly hard one to crack! The tweet is in Bengali, but not written in the native Bengali script (which looks something like এরকম); rather it is a loose phonetic transcription of the words in the Roman script. Several English words – presstitute, real-life, free, and one Hindi word (again in Roman script) – nautonki have been generously embedded into the Bengali matrix. And we would wish Bing’s output was: “Well done. Now the people of the country will enjoy for free the real-life drama of the #presstitutes.”

Unfortunately, the system thought it was a German tweet, presumably due to der and korbe. Those words were translated to English, to NULL and baskets respectively. It didn’t know what to do with the rest of the words, so they were reproduced verbatim. To better appreciate the system’s plight, imagine that I ask you to translate a Basque sentence written in the Lao script, sprinkled with words from Ojibwe and Önge. Bing doesn’t expect Bengali to be written in the Roman script. Nor does it expect a Bengali sentence to contain English and Hindi words. It doesn’t even have a translator for Bengali!

Interestingly there is nothing unusual here either in the user’s behavior or in the system’s response. Multilingual users from all over the world tweet, text or post comments in code-mixed languages, and often use Roman transliterations. On the other hand, NLP systems almost always assume the input to be in a single language and in the native script.

Project Mélange aims at building NLP techniques and Natural Language Interfaces for mixed-script and mixed-code text and speech. We break this complex problem into three steps:

  1. Detection: Identifying that the input speech or text is code-mixed, and if so, what are the languages that are being mixed.
  2. Basic processing: Building basic NLP tools and resources that can process code-mixed language. The interesting question here is how we can leverage existing monolingual resources. If you know two languages, say Tamil and Swahili, you would understand a Tamil-Swahili code-mixed conversation, even if you had never heard anybody mix these two languages before, isn’t it? By the same logic, can’t we make use of Tamil and Swahili monolingual NLP technologies to create a Tamahili system?
  3. End-user Applications: Building on the basic resources and tools for code-mixing, we would finally like to build end-user applications and experiences that cater to multilinguals’ linguistic behavior.

To build all this technology, we need lots of annotated language data; and of course, a sound theoretical understanding of when, how and why people code-mix.

Though there is very little computational work on code-mixing, linguists have been pondering on these questions for more than half a century now. There are theories as well as empirical studies on what are allowable grammatical structures when two languages are mixed, what are the motivations behind code-mixing, whether teens code-mix more than their grannies, and so on. However, all these theories have been built upon little amounts of data obtained from either a small group of speakers or through self-reporting studies and interviews. People code-mix in speech; nobody writes books or legal articles or news in a code-mixed language. Therefore, it has been quite difficult to conduct large scale studies of code-mixing. Thanks to the growing popularity of social media and other forms of computer-mediated-communication, today we have large volumes of code-mixed text data. We are now using this data to understand the grammatical constraints, pragmatic functions and socio-linguistic issues of code-mixing.

Let’s hope that in the near future we will not have to make that conscious effort to not use any Spanish words or phrases in our English while speaking to Cortana or using Skype translator, and we will not have to toggle the language selection button in order to make the keyboard prediction work when we move from French to Arabic.