Vineet NLP Blog: The Impact of Entropy of Indian Languages

Indian languages are so called as relatively free word order languages. First of all, it is important to understand, what is term "relative" means here, what is its impact on Indian languages. It means that you can't scramble the words of sentences randomly. But you can scramble the word groups or so called chunks randomly in the Indian languages. English holds the information or meaning in form of structure and syntax of sentences. If you change the order of words in English sentence then sentence meaning will be changed. For example "The cat eats the dog" and "The dog eats the cat." Both sentence have same set of words but meaning is totally different. In Indian languages, you can't move the words randomly but you can move chunks randomly still meaning of sentence remain same, for example "billi_ne chuha_ko khaya" or "chuhe_ko billi_ne ne khaya". Both sentences mean "The cat eat the rat". The meaning of the both sentences at surface level is same. But is it same actually??. There is slightly difference in both of sentences meaning. In first sentence, billi is point of focus or topic. In second sentence, chuhe(rat) is point of focus or topic. This is also known as Topicalization.
Then next question comes in mind, what impact does it make towards the processing of such relatively free word languages. Well for that we should first understand why Indian languages are relatively free order language, then how sentences are encoding the meaning. Indian languages encode the information through morphology or in simple words through case markers inflections or prepositions. Each case marker or preposition of chunks corresponds to some semantic arguments which are known as thematic roles. The example of thematic roles is doer of actions. For example in above the sentence, preposition "ne" corresponds to doer of actions. But , one point is still untouched about the Indian language. Indian languages, in-spite of being free word languages, they are claimed to be verb final languages. The Verb chunk comes in last in Indian languages. If it is not, then sentence looks odd.

So next question comes in mind, how this fact of randomness effect the whole process of language processing. Well first of all, they are two types of approaches used in natural language processing, 1)Rule based approach and 2)Statistical approach.

First, I will talk about Statistical Methods. Since Indian languages are not normalized in terms of word order, so sometimes for statistical method you need more resources to catch the variations. There are two types of variations 1) Word order variations and 2)Morphological or lexical variation. First I will talk about Word order variation. For example, in application like statistical Machine Translation, you need more parallel corpus to cover all the variation of word order. Since you can move chunks still meaning of sentence remain same, which is shown above in an example. The second type of variation is morphological variations. Indian languages store information in the form of inflections. The inflections are added with words and generate more types of word forms. So for handling this type of variation in statistical model you need more intelligent dictionary. For example, in Indian language Machine Translations lexical errors make more impact as compared to word-order errors. Since word-order does not effect the readability of sentence as compared to lexical error. This lexical errors are generated sometime because of morphology.

The second type of method is rule based method, which sometime more appreciated in Indian language. With the help of rules you can take care the word orders and morphology some time added boon to rule based system. Since writing rules are sometime more simpler. Although there are some active research is going on to learn those rules from data automatically. The approaches are more used in statistical parsing, which is motivated by rule based system. In such approach rule templates are learned from the data and given to the statistical parsers.
The second type of impact is related to decision of data and corpus. How this information of randomness can be encoded in data, so it can be process through the machine as well as human. For fixed order language, the phrase structure grammar (My Computer science friends may have seen it in compiler designing and Theory of computation) and phrase structure trees are more successful. As in compiler the syntax of commands is non-variant and phrase structure grammar work well. So in fixed order languages, phrase structure works well. But for free word order languages, dependency tree and dependency grammar work wells. For European languages dependency tree and dependency grammar represents the dependency between the words. For example, 'Ram is good boy.' Then dependency tree shows relations between the ram and boy with is and good and boy. As shown below.
Modifier Modified Relation
Ram is subject
good boy noun-modifier
boy is patient

But in Indian language, chunks act as one unit as compared to words. The dependency between the chunks worked well as compared to the dependency between words.

Vineet NLP Blog

Pages

Sunday, February 21, 2010

The Impact of Entropy of Indian Languages

No comments:

Post a Comment