Vineet NLP Blog: February 2010

Sunday, February 21, 2010

The Impact of Entropy of Indian Languages

Indian languages are so called as relatively free word order languages. First of all, it is important to understand, what is term "relative" means here, what is its impact on Indian languages. It means that you can't scramble the words of sentences randomly. But you can scramble the word groups or so called chunks randomly in the Indian languages. English holds the information or meaning in form of structure and syntax of sentences. If you change the order of words in English sentence then sentence meaning will be changed. For example "The cat eats the dog" and "The dog eats the cat." Both sentence have same set of words but meaning is totally different. In Indian languages, you can't move the words randomly but you can move chunks randomly still meaning of sentence remain same, for example "billi_ne chuha_ko khaya" or "chuhe_ko billi_ne ne khaya". Both sentences mean "The cat eat the rat". The meaning of the both sentences at surface level is same. But is it same actually??. There is slightly difference in both of sentences meaning. In first sentence, billi is point of focus or topic. In second sentence, chuhe(rat) is point of focus or topic. This is also known as Topicalization.
Then next question comes in mind, what impact does it make towards the processing of such relatively free word languages. Well for that we should first understand why Indian languages are relatively free order language, then how sentences are encoding the meaning. Indian languages encode the information through morphology or in simple words through case markers inflections or prepositions. Each case marker or preposition of chunks corresponds to some semantic arguments which are known as thematic roles. The example of thematic roles is doer of actions. For example in above the sentence, preposition "ne" corresponds to doer of actions. But , one point is still untouched about the Indian language. Indian languages, in-spite of being free word languages, they are claimed to be verb final languages. The Verb chunk comes in last in Indian languages. If it is not, then sentence looks odd.

So next question comes in mind, how this fact of randomness effect the whole process of language processing. Well first of all, they are two types of approaches used in natural language processing, 1)Rule based approach and 2)Statistical approach.

First, I will talk about Statistical Methods. Since Indian languages are not normalized in terms of word order, so sometimes for statistical method you need more resources to catch the variations. There are two types of variations 1) Word order variations and 2)Morphological or lexical variation. First I will talk about Word order variation. For example, in application like statistical Machine Translation, you need more parallel corpus to cover all the variation of word order. Since you can move chunks still meaning of sentence remain same, which is shown above in an example. The second type of variation is morphological variations. Indian languages store information in the form of inflections. The inflections are added with words and generate more types of word forms. So for handling this type of variation in statistical model you need more intelligent dictionary. For example, in Indian language Machine Translations lexical errors make more impact as compared to word-order errors. Since word-order does not effect the readability of sentence as compared to lexical error. This lexical errors are generated sometime because of morphology.

The second type of method is rule based method, which sometime more appreciated in Indian language. With the help of rules you can take care the word orders and morphology some time added boon to rule based system. Since writing rules are sometime more simpler. Although there are some active research is going on to learn those rules from data automatically. The approaches are more used in statistical parsing, which is motivated by rule based system. In such approach rule templates are learned from the data and given to the statistical parsers.
The second type of impact is related to decision of data and corpus. How this information of randomness can be encoded in data, so it can be process through the machine as well as human. For fixed order language, the phrase structure grammar (My Computer science friends may have seen it in compiler designing and Theory of computation) and phrase structure trees are more successful. As in compiler the syntax of commands is non-variant and phrase structure grammar work well. So in fixed order languages, phrase structure works well. But for free word order languages, dependency tree and dependency grammar work wells. For European languages dependency tree and dependency grammar represents the dependency between the words. For example, 'Ram is good boy.' Then dependency tree shows relations between the ram and boy with is and good and boy. As shown below.
Modifier Modified Relation
Ram is subject
good boy noun-modifier
boy is patient

But in Indian language, chunks act as one unit as compared to words. The dependency between the chunks worked well as compared to the dependency between words.

Friday, February 19, 2010

Key Phrase Extraction tools

Key Phrase Extraction is used to extract most frequent words which are significant with respect to the applications. Key phrase extraction are most frequently used in search engine for advertisement. Some analysts also project Key phrase as topic/concept/short summery. Some of the tools for key phrase extractions are
1)Carrot2:- It is a great tool for key phrase extraction. It uses two algorithm STC and lingo. Lingo search complete key phrase with some other constraints key phrase. STC is kind of Suffix Trie. Lingo works better than STC. With STC, lingo it also uses TF-IDF, LSA. If you more number of related documents then carrot works great. It also provides flexibility in input. You can give input as indexed documents which are indexed from Nabble, Solr, google search desktop or you can also index yourself from XML.
2)KEA:- It is standard algorithm for Key phrase extraction. It provides provision of learning from RDF dictionary(in SKOS format). The dictionary will contains hierarchical taxonomy. It also gives options for Machine learning. It uses Weka for Machine learning. The document can be less in number but should have large in size. Right now it is plugged in GATE. If you don't use RDF dictionary or large sized documents for training then this tool will not work well.
3)Maui:- It is basic KEA(mentioned above) tool but also gives options to boost taxonomy from Wikipedia.
4)wikiFier:- Like Maui, It also uses wikipedia to boost concept for Key phrase extraction.
5)Stanford topic Modeling tool:- The tools uses LDA for learning topic. It takes input and output in CSV format. It also provides options for Machine learning.
6)Mallet:- It is similar to Stanford, which is used for learning topic words.

Anaphora resolution Tools

Anaphora resolution is the core linguistic problem, it is defined to identify the antecedent of pronouns and nouns. I still remember the computational linguistic-2 class where Dr. Laxmi madam, used to teach such types of typical linguistics problems, sometimes with no answers how computer science can solve the hard-coded linguistics. Anyways in current world many tools try to approach the problem of anaphora resolution but still results are not much promising. I will talk about some of the tools in this post.
1) GuiTAR:- First of all sorry to music lover. This tool is not for you guys. The tool comes in three versions. It uses the Charniak's parser input in Minimally associated XML format(MAS-XML). It mark the output in XML(with tags. It uses generic discourse model which uses partial implementation of Mitov approach for anaphora resolution which is one of oldest approach for anaphora resolution. It also uses Viera & Posesio (Definite Description)DD algorithm which gives it a significant improvement. It also gives lexical chains.
2) BART:- It is John Hopkins University Tool for anaphora resolution. It comes up with Machine learning model and build up in java. Like GuiTAR, BART also take input/output in XML format. It uses alternative Parser/NER for pre-processing.
One can use YumCha or Charniak's Parser and Caracafe/Stanford NER for NER purpose. Similarly it provides flexibility in ML models. One can uses Weka/Maxent/SVM-light.
I suggest you this tool if you good amount of training corpus for ML models.
3) MARS:- It is build on Mitov principle. It uses certain linguistic rules. If your data has noises, then I will not suggest you to use this tool.
4)JAVA RAP:- It is anaphora resolution tool, working on Lipin and Leass rules. It identifies 3rd person pronoun antecedents. Like BART and GuiTAR, it also uses Charniak's parser.

Wednesday, February 17, 2010

Natural Language Processing tools

Here I will share some of the natural language processing(NLP) open source tools, information extraction, NER tools.

NLP Package

NLTK:- . It is python package which is facilitated by many tools and resources.It is easy to use, as it is designed in python.It is provided with book and tutorial. The tools are mainly used for tagging, chunking, splitting, tokenizer and for other functions. It is also provided with resources and readers to interpet the resources.
GATE:- The Gate is the java based tool which is mainly used for corpus creation(known as GATE document), chunking, tagging, NER and information extraction. Many packages and tools are plugged in with GATE. It supports pattern based grammars, so it can be used for languages which don't have much resources.It is used for corpus annotation, information extraction, NER, chunking, tagging(mainly grammar based pattern, which are written in java)
Stanford package:- The package contains CRF based tagger, NER, topic modeling(using LDA) and classifier. The stanford parser gives both dependency and pharse structure parser output.
Opennlp toolkit:- The tool kit is written in JAVA. It supports chunking, tagging and basic NLP application.
Montylingua Package:- The package is used for chunking, tagging, splitting, tokenizing. Their claim is that they use common-sense which is in form of grammar. But their common sense is hard to understand at least for me. The package is provided with java and python.

Mallet package:- It is machine learning toolkit for classification, Sequence tagging and topic modeling. Sequence tagger can be used for any type of tagging application. Topic Modeling tool uses LDA.

2. POS-Tagging

MBT Tagger:- It is memory based tagger and generator.
TreeTagger:-A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. It uses language. It also uses lemma and other morph information to disambiguate the POS-tag.
SVM Toolkit:- is tagger using SVM light. SVM is machine learning algorithm. It is trained on walt-street journal corpus.
Accopost:- Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
MXPOST:- Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included
FnTbl:- A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
MuTBl:- An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager.
YamCha:-SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task.
Lingua Tagger:- Perl POS tagger.
TnT POS tagger:- Trainable for various languages, comes with English and German pre-compiled models. The tagger uses vertibi algorithm for second order Marchov model.
AMALGAM Tagger:- The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).

3. Named Entity Recongnation and Information Extraction:-

Stanford NER:- A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel.
Lingpipe:-Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
ANNIE:-It is grammar based NER plus information extraction tool. It uses pattern based Grammar which is written in JAPE(Java language).
OpenClasis:- Automated information extraction web service from Thomson Reuters (Free limited version)

Pages