Koustav Rudra, IIT Kharagpur
Code-switching or code-mixing — a fluid alternation between two or more languages in a single conversation is very common in chats, conversations, and messages posted over social media like Twitter and Facebook. In India, around 125 million people in India speak English, half of whom have Hindi as their mother-tongue. The large proportion of the remaining half, especially those residing in the metropolitan cities, also know at least some Hindi. This makes Hindi-English code-switching, commonly called Hinglish, extremely widespread in India.
It is observed that users generally alter languages in their conversations or posts due to various reasons:
- Sometimes users try to post information in one language, and personal opinions in another one, e.g. – “was it a modi wave or just wave? #aapsweep #aapkidilli bjp ki halat congress se jyada achchi ni lag rahi”
- Users want to emphasize their opinion, e.g. – “best wishes to indian team tiranga aapke saath hai”
- Users want to curse or use abusive language, e.g. – “Seeing the movie I thought ki kis bandar ke haath me camera de do to wo bhi movie banaa le”
In this study, we try to understand language preference of users for expression of opinions and sentiment. For this, we first build a corpus of 430,000 unique India-specific tweets across four domains (sports, entertainment, politics, and current events) posted by 1, 25,396 unique users and automatically classify the tweets by their language: English, Hindi and Hindi-English code-switched. We then develop an opinion detector for each language class to further categorize them into opinionated and non-opinionated (facts, wishes, questions etc.) tweets. Sentiment detectors further classify the opinionated tweets as positive, negative or neutral.
We observe several interesting phenomena in the language usage pattern by end users:
- In online social media, Hindi is seldom written in the Devanagari script. Instead, loose Roman transliteration, or Romanized Hindi, is common, especially when users code-switch between Hindi and English.
- There is a strong preference towards Hindi (i.e. the native language) over English for expression of negative opinion. The effect is clearly visible in code-switched tweets, where a switch from English to Hindi is often correlated with a switch from a positive to negative sentiment like “#beefban is a right step forward .. me saala bangal ho jaaye toh behtar hoga .. ye bangladeshi ghuspaithiye isi bahane india aate hai ..”. Table 1 shows distribution of opinion/sentiment and non-opinion across different monolingual languages (English, Devanagari Hindi, Romanized Hindi), and Table 2 shows distribution over English and Romanized-Hindi fragments of code-switched tweets.
Table 1: Non-opinion and Sentiment across languages
Table 2: Non-opinion and Sentiment across English and Romanized Hindi fragments of code-switched tweets
- Native language Hindi is preferred for swearing over English. We also measure the distribution of abusive Romanized-Hindi and English fragments for code-switched tweets. Interestingly, over 90% of the swear words occur in Hindi parts. The figure below shows the distribution of abusive tweets across different languages.
If you want to know more details about this study, please read our EMNLP 2016 paper:
“Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?”, Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury and Niloy Ganguly, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.