Sunday, September 2, 2007

A Peek Into Google Labs Indic Transliteration Service

Google has unveiled their Hindi transliteration service recently. It is cool, and amazed me like most other Google services. Note that transliteration schemes for Devanagari script has been available already - ITRANS and INSROT. What Google has done is very different from these. Here are some things I could figure out.

Some frequently used words are pre-loaded. For others, it calls a webservice to do the transliteration and adds to the local cache.

The transliterator holds an array of all the English words typed, whether transliteration has been done or not and their transliterated equivalent. It has a limit of 300 words that can be transliterated at one go. Multiple words can be sent to the transliterator at the same time. Whenever transliteration is invoked, it goes through the list of words, accumulates words that have not been translated and makes a request to the server.

It uses a JSON web service at the back to get possible transliterated options. The request parameters seem similar to the Google Translation Service but it must work very differently underneath. The request URL for the Indic Transliterator looks something like this:

http://www.google.com/transliterate/indic?tlqt=1&langpair=en|hi&text=guulaab&tl_app=3
Where
langpair=[language from |language to]
text=[the English text to be transliterated]

The response is of the format as displayed below, ignore the '.', they are Devanagari glyphs in the actual response. The value of 'ew' is the English word sent for transliteration and 'hws' contains a series of suggestions, the default one being the first one.

while(1); [ {
"ew" : "gulaab",
"hws" : [
"गूलाब",
"गउलाब",
"जुलाब",
"गुलआब",
gulaab
] }, ]


There are a set of words that are loaded by default - the global lookups as the script calls it. It is loaded through the call http://www.google.com/transliterate/indic/global_en_hi.js?v=1. The JSON response loads two arrays with variable names 'englishwords' and corresponding 'hindiwords'.

All transliterations are cached. Cache includes multiple suggestions.

The system is designed to learn or at least remember or keep track of corrections done by the user. If you pull down the suggestions list and pick up a different word that the default suggestion, it attempts to send your choice back to the server. It does a POST request to the server at http://www.google.com/transliterate/indic with the following parameters (again '.' is actually some Devanagari glyph):

tlqt=3
langpair=en|hi
uv=test:.....?-0-1:,match:.....?-0-1:,cricket:.....?-0-1:,kaisa:.....-0-1:,hai:.....-0-1:,kamath:.....-0-1:,tathagat:.....-0-0::.....-1-0:,tamanna:.....-0-1:;0;0;
sct=null
tl_app=3

I had changed the suggestion for word 'tathagat'. So it has sent out what it had suggested (and didn't match) and what I had chosen instead. Notice that it sends all the words present in my transliteration box, unlike what it does while transliterating. Some kind of contextual analysis?

There seems to be some kind of facility to store user specific cache of words - it is called 'user lookup'. There is the following URL that gets it. Seems to be disabled for the time being.

http://www.google.com/transliterate/indic?tlqt=2&langpair=en|hi&tl_app=3

The keyboard works on fixed set of rules - probably a tree of English characters mapping into Hindi Unicode values. There are different Unicode values that are possible at different points based on the preceding characters. The rules are neither ITRANS nor INSROT but a different transliteration scheme.

The transliteration service probably works on a combination of statistical data and rules. For example, first consonant क can be represented by 'k', 'ka', 'c' or 'ca' depending on context (e.g. kawaa, car). But at the same time alphabet 'c' can also be used to represent च, or स depending on context (char, cell). So there has to be a strong statistical element apart from the rules.

It is interesting to try seemingly similar representations of the Hindi word for crow - 'kawaa' and 'kauwa'. And also try 'samvidhaan' and 'sambidhaan'. Seemingly similar sounding words, but is transliterated differently.

Try 'tarif' and 'tarun'. The combination 'ta' in 'tarif' is interpreted as ता but the same in 'tarun' is interpreted as त. Why?

Try typing just 's'. It transliterated to एस!

We should soon see Google incorporate this in all their products and bring out such transliterators for other languages - at least where Unicode fonts are well defined and available.

2 comments:

Rashmi Talwar said...

the google indic tranliteration has basically only online use and no offline or desktop use as i see it . wish we could download the service to use it offline for other purposes
rashmi talwar
email rashmitalwarno1@gmail.com

Rashmi Talwar said...

hi
it seems google has offered its indic transliteration only for online use basically for social networking as i see it .
it doesnt seem to have any offline or desktop use . wish google would offer the same for offline use . would be gr8
Rashmi Talwar
rashmitalwarno1@gmail.com