Vineet NLP Blog

Wednesday, March 31, 2010

Natural Language Generation and Information Extraction: Sisters by chance

Many natural language processing applications use Natural language generation(NLG) and Information Extraction(IE) together.

Natural language generation is used for generation of natural language sentence from given bag of words.
Information Extraction is used extraction of information to fill up template from given text. Many people get confused between the data-extraction and information extraction. Typical information extraction problem is the extraction of information from news. For example one want to fill up score card automatically from cricket commentary text. In this problem, score card is the information extraction template used. Score card has definite structure, like batman name, run scored, no of sixes, no of fours.

Information Extraction as tagging Problem:- Information extraction can be seen as tagging problem where tags are fields of template of Information extraction. The fields mostly composed of named entity recognition. This is the reason why most of information extraction system (Stanford NER and GATE ANNIE) are the extension of named entity recognition system.
Information Extraction as relation extraction:- Information extraction is not simply Named entity recognition system, it also extracts the relation which exists between the entities. The relation can be implicit or explicit. The anaphora resolution comes under this category, as anaphora resolution represents relation between the reference and entities. The relation can be ontology or domain specific.

They are used together in many natural language processing applications. Some of the applications are

Abstractive Summarization:- Summarization of two basically two types, extractive summarization and abstractive Summarization. Extractive summerization extracts the important sentences, paragraphs from the sentences. Whereas abstractive summerization generates the summary. It uses information extraction to fill up the template and uses Natural language generation to generate the summery.
Question answering system:- Many Question answering system use IE and NLG system together. Question answering system can have following module

Question understanding Module
Information retrieval system to collect text.
Information extraction system to extract the possible candidate answer.
Generation of answer from the template

For example general question types are who,whom,when and others. You can extracts the information and give back the answer.
3. Spoken dialog Manager:-You may have seen Spoken Dialog Manager in many customer care service. They understand your question and give answer interactively. You can divide spoken dialog manager in three parts, 1)Automatic speech recognition, 2)Interactive question answering system 3)speech synthesis.
As it has interactive question answering system in its pipeline so it uses Information extraction and natural language Generation together.
4. Machine Translation:- Machine Translation is one of the applications where Natural language Generation system is used alone. Machine Translation has two basic part structural translation and lexical translation. structural translation module is used to translate target language order from source language order. Most of Machine Translation system use lexical translation module after structural translation module. They use bilingual parameters and rules for structural translation. Anyways, if one is doing structural translation after lexical translation then he can use Natural language generation for structural translation.

Monday, March 29, 2010

Introduction

I am Vineet Yadav from IIIT Hyderabad currently pursuing Master degree (M.tech in computational linguistics) in NLP. I worked at Language Technology Research Center, which is one of the largest NLP research groups in India. This is my personal and Technical blog . I am currently working in Serendio, a text mining company.

Saturday, March 27, 2010

Machine Translation: Divide and Rule

Divide and rule is not new paradigm. Divide and rule technique is first introduced by British people who ruled India and rest of world during Queen Victoria time. Later, technique is used in computer science and known as divide and conquer. In Divide and Conquer approach, first problem is divided into smaller tasks and process each task and combine the result. Same approach is used in natural language processing, breaking down the stories into paragraphs, paragraph into sentences, sentences into clauses, clauses into phrases, phrases into words. Each sub part acts as one unit, and one can divide text into these units and combine the results. There are some restrictions also. For example Context free grammar is capable to generate infinite length sentences or long sentences. But this thing is never used, since human can't understand very long length sentence. While understanding also, human also break down the text in smaller units.
In Machine Translation, also researchers divides the text into different units and process each units and get the translated results. I will talk about each of them one by one.

Statistical Machine Translation(IBM model):- Statistical Machine Translation looks source language sentences as a sequence of words. It uses EM (Expection Maximization) algorithm over the word aligned corpus to learn parameters and initialize the model. The Machine Translation system paneltize for reordering so doesn't work well for languages belonging distant families.
Phrase Based Machine Translation:- The Phrase based Machine Translation system(Kohen et. al, 2003, och et, al, 1999) works at phrases level. The Phrase based Machine Translation learns the source language and target language pair of phrases. The Phrase based Machine Translation is also capable to translate non-conventional phrases.
Hierarchical phrase based Machine Translation System:- Hierarchical phrase based Machine Translation System( Chiang 2005) is same as above system, except that the phrases learned in Machine Translation system are hierarchical.
Syntax based Machine Translation system:- Syntax based Machine Translation system(Yamada and Knight, 2001) is basically parse tree to string translation system. They parsed the sentence using CKY algorithm and then done reordering, insertion and deletion at node level. Each node of phrase structure parse tree can be looked as Hierarchical phrase.
Dependency Treelet Translation:- Dependency Treelet Translation( Quirk, 2005) is Microsoft approach for Machine Translation. Their approach is called a dependency treelet translation system because in contrast to standard phrase based MT system that learns phrase pairs, (Quirk et. al. 2005) learn treelet pairs. They use a source dependency parser and word-aligned source and target sentences. Then, they project source dependency structure to target and learn treelet translation pairs between source and target. They have used Maximum Likelihood method for extracting treelet translation. The advantage is that they can learn non-continous phrases also.
chunk based Machine Translation:- In chunk based Machine Translation ( Watanabe, 2003) does machine translation over chunked sentences. The chunk and phrases are similar expect that chunk doesn't have recursive nature like phrases. In other words chunk doesn't contain another chunk inside it.

Thursday, March 25, 2010

The locus of word sence disambiguation

The word sense disambiguation plays important part in Machine Translation. The word sense disambiguation is more important for the languages which belong to different language families.
For example, Word sense disambiguation is important in Hindi- Panjabi Machine Translation as compared to English-Hindi MT.
Since social and historical aspect of language also plays role. People use word in same sense in the languages which belong to same region/area or same language family,.
For example in Hindi-Urdu, the problem of word sense disambiguation doesn't exists as you can get Urdu sentence from Hindi string by using only transliteration.
The Ambiguity is present in two forms in language.

Lexical Ambiguity:- A word can have more than one sense, e.g. bank can be river bank, and it can be financial bank also. This depends on context in which word is used. But please note word 'bank' is noun in both sense. So word sense disambiguation efficiency will be more if you use it to resolve similar category senses. In other words use the WSD on POS-tagged sentence. Word sense disambiguation is used in mainly two types of problem, Machine Translation and Topic or concept expansion. Machine Translation uses POS-tagging, chunking and others process in pipeline and WSD module come above them, so this thing is handle in ideal way in which I discussed. The second problem is topic expansion, summarization, key phrase extraction. For this problem, Latent Dirichet Allocation, Latent Semantic Analysis and Latent Semantic Indexing . They uses lexical chain for context expansion. These approaches uses only lexical information for disambiguation. I will talk about WSD in Machine Translation.
Structural Ambiguity:- Structural Ambiguity exists because of nature of language. The sentence structure is responsible for ambiguity. For example 'Ram is looking boy with telescope'. In this sentence, it is possible that Ram is using the the girl using telescope. It is also possible that girl is having the telescope. Whenever structural ambiguity exists in the sentence, there are more than one possible parse tree.

Now I will talk about the locus of the problem of ambiguity in Machine Translation. The word sense disambiguation problem solution depends on where ambiguity is present in the language pair.

Source language or target language:-Is Word sense disambiguation part of source language or it is part of target language? Ambiguity can be present in both the languages. If it is part of source language then word sense disambiguation is done by using mono-lingual source language dictionary before lexical translation. Other wise word sense disambiguation should be done after lexical translation and using bilingual dictionary. But it depends on language pair which we are using. For example if languages are closely related then you will find only one type of word sense disambiguation problem. One can mark source type sense with target type sense in parallel sentences, can look how ambiguity varies between language pair and what is default sense.
Writer, Reader or text:- Ambiguity can be present in writer usage. Every writer has different controlled vocabularies. And ambiguity can be present in reader who is understanding it. It may be possible that ambiguity is present as reader is not giving full attention and missing something. It can also possible reader has less or different view or knowledge about the topic. In old days, researchers consider that meaning is inside the text only. But that is not only case. The meaning can be varied with writer and reader. If word sense disambiguation problem is writer specific, then mono-lingual writer text or monolingual writer specific dictionary. You can use them using active learning algorithm. Active learning algorithm is semi-supervised machine learning algorithm and gives support to use plain text with tagged corpus . If it is reader specific then you can use reader favorites in topic, or bookmark texts or reader logs. The similar type of problem is personalized search, where we uses the user logs.
Media:- Media also effect peoples as people generally follow them and sometime they start using non-conventional meaning. Media can be printing media or cinema. I will example of some of movies which gave non-conventional word sense to words. Two year back, one movie came (Dostana), which mean friendship. The movie focus is a lie that two of lead actors in movie are homo-sexual. People starts commenting in social networks as Dostana, on one another pictures and others. Although Dostana doesn't mean homo-sexual relationship. Then later one movie came, love aaj kal, they used word 'Mango people' as 'ordinary people' as fancy word. As in English-Hindi language pair.
mango=>AAM(noun), ordinary=> AAM(adjective). So some people start calling ordinary people as Mango people. The latest movie which is biggest hit comedy in Indian cinema, is 3 idiots. The film is not mad or crazy people but it is story about the people who follows there heart. The movie changed the definition of idiots. And people liked to be called as idiots. This types of non-conventional senses since social networking websites have such examples.

Sunday, March 21, 2010

Singularity in Natural language Processing

The Singularity is very old phenomena in Science and physics. For example electromagnetism shows singularity behavior between the electricity and magnetism. Similarly, Ebert Einstein later on, came up with the string theory. He combined all four different fundamental forces(gravitational, electromagnetic, weak and strong interactions) together and explained all with one formula. The Singularity phenomena brings the simplicity and independence in system. The physics and natural language are not much different. As researchers look both of them in terms of mathematics. You can treat the collations unit, word group as mass and attractions between the words can be treated as semantics which revolve around them and distance(Physics) is equivalent to word, sentence difference and even syntax.
Coming to the point, we have n-numbers of natural language tools, which extract or tag different type of information. How intelligently can we used maximum of them together. The whole system architecture some times depends on the order of execution and what are all the resources and tools one going to used. Can't we used them in parallel? Can't we able to design the perfect data structure which hold all type of information and which has fundamental operations like update, insert and delete? A data-structure which can hold the information at word level, word group level, parse tree level, sentence level. A data-structure which can hold information given be basic NLP toolkit like tagger, morph-analyzer to complex NLP toolkit like information extraction. A data-structure where we can map different resources like word net, frame net, propbank and others.

Friday, March 19, 2010

The Context Free Grammar and Machine Translation

My friends who worked with me might know all these fancy words like Machine Translation. But most of my friends who don't know can refer my previous post. Here I will talk about the Synchronous context free grammar, chart parsing and Machine Translation. It is extension of my previous post 'syntax in MT'. Well some of computer science friends may also know about context free grammar and chart parsing. Some of them might seen chart parsing in compiler design and context free grammar in Theory of Computation. Well context free grammar, can be looked as simple conversion rules.
The context free grammar for infix expression looks like
A->A+A|A-A|A*A|A/A|a|b|c

where A is non-terminal and a,b,c,+,-,*,/ are terminal symbols.
The parse tree of operations 'a-b+c' will looks for the context free grammar is shown below.

The process of getting parse tree from the set of operation('a-b+c') is known as parsing. There are many parsing techniques but they are divided into basically two types of approaches

TopDown Parsing:-
Bottom up Parsing:-

But how context free grammar is useful for machine translate. The above context free grammar is for infix operations. Suppose i want to convert infix string to post-fix string, using the context free grammar. Then post-fix grammar will look like
A->AA+|AA-|AA*|AA/|a|b|c
And we should combine both infix and post-fix grammar such that they can be use for translation of infix to postfix.
Then the grammar is known as synchronous context free grammar and looks like
A->A+A;AA+|A-A;AA-|A*A|AA*|A+A;AA+|A/A;AA/|a;a|b;b|c;c
where symbols before ; represents the infix operations and symbols after ; represents the post-fix.
So translation process is shown below.

In the translation, rules are used for reordering operation only and first A+A is converted to AA+ using rule A->A+A;AA+ and then second rule is applied A->A-A;AA- . But Machine Translation supports insertion, deletion also, which are done at the node level only.
We get target postfix string as ab-c+

Friday, March 12, 2010

syntax in SMT

Statistical Machine Translation is widely used in European and Chinese languages. The idea behind Statistical Machine Translation is pretty simple, how human can decode or translate unknown language?. If you give some large thousand parallel word aligned corpus or a huge dictionary to a person who is not familiar with the language. Then, how person will understand or learn the language. Similarly you can give parallel corpus to machine and machine will learn the word alignment and process it. The machine can learn and read the huge parallel corpus which for human takes years to understand. But, question arises who is best, human or machine, weather human learns only lexical mapping. I still remember, the last class of course of natural language processing, when head of my department, Prof Rajeev Sangal sir asked students for their last doubts. And one of my dear friend GVS Reddy asked, why we build these parse trees?, why machine needs parse trees?. Is this is usual way, how human learns the language, weather humans also design the parse trees in their mind. Well at that Prof. Rajeev Sangal sir, had given answer in favor of question. Right now I know, there is whole branch of cognitive parsing which deals with this area, and one of my friend Phani Gadde doing working in it.
The same thing is realized by machine translation researchers, and new branch of statistical machine translation emerges which is known as syntax based machine translation. As human also learns bilingual grammar with learning word mapping. Similarly Syntax based Machine Translation also uses bilingual synchronous grammar and word mapping. Most of syntax based machine translation uses synchronous context free grammar. There are different variations of syntax based machine translations 1)string to parse Tree 2)parse Tree to parse Tree 3) parse tree to string. Although, most of the syntax based machine translation system does reordering, insertion and deletion of words in parse tree. The benefit of syntax based machine translation is that it supports long distance reordering. As most of the statistical machine translation system penalties the long distance reordering, but in syntax based SMT reordering is done at node level of parse tree. So it supports long distance reordering.

Pages