Sunayana Sitaram, Microsoft Research India
Have you ever wanted to have your texts and WhatsApp messages read out to you? Have you ever used a foreign word while using a system like Cortana, only to find that it does not recognize words that are not in the language it is expecting to hear? Speech Recognition and Synthesis of code-mixed utterances is a very challenging problem. Most speech processing systems are designed to be used with a single language. Moreover, people may pronounce words differently when they are speaking multiple languages at the same time, which may confuse such systems.
Let us look at the problem of reading out a recipe on a popular Hindi recipe website Nishamadhulika. Here’s the link to the recipe, if you want to take a look http://nishamadhulika.com/1064-creamy-mushroom-soup-recipe.html
Now as you can see, most of the text in the recipe description is in Hindi, written in the native script (Devanagari). This should be fairly easy for a Hindi Text to Speech system to read out to the user. However, we see some English words in the title, and also numbers in the Roman script to denote quantities. If you scroll down to the comments, you see that many of the comments are in Hindi, but are not written in native script. Let us look at a couple of comments.
“bahut yammi recipe thi nisha ji ye soup mere baby ne jo ki 15 month ka hai bahut shok se piya hai”
“Nisha ji musroom soup bht acha bna h.mje cooking bhi bht achi lgti h.bus ye btao is e without cream healthi kaise bnaya ja skta h ans jrur dena”
We find that there are many English words in these sentences (“soup”, “yammi”, which is “yummy”, “15 month”, “baby”, “cooking” etc.). We also find that users don’t always follow a standard way of transliterating Hindi into Romanized script. For example, in the first sentence, the word “बहुत” is written as “bahut”, while in the second one, it is shortened to “bht”. Similarly, the word “है” is written as “hai” in the first comment, and only as “h” in the second one!
Now imagine if you are a Text to Speech system and you need to read out such text! You need to identify what languages the words are in, rectify spelling mistakes, expand contractions and then figure out how you are actually going to pronounce the word. This is made even harder by the fact that the training data for most Text to Speech systems today only consists of single language, clean, well-written data.
In a future post, we will talk more about how we make Text to Speech systems capable of synthesizing mixed language text. Meanwhile, you can read this paper:
‘Speech Synthesis of Code Mixed Text’, Sunayana Sitaram and Alan W Black, in Proceedings of LREC 2016, Portoroz, Slovenia