Pages

Saturday, March 27, 2010

Machine Translation: Divide and Rule

Divide and rule is not new paradigm. Divide and rule technique is first introduced by British people who ruled India and rest of world during Queen Victoria time. Later, technique is used in computer science and known as divide and conquer. In Divide and Conquer approach, first problem is divided into smaller tasks and process each task and combine the result. Same approach is used in natural language processing, breaking down the stories into paragraphs, paragraph into sentences, sentences into clauses, clauses into phrases, phrases into words. Each sub part acts as one unit, and one can divide text into these units and combine the results. There are some restrictions also. For example Context free grammar is capable to generate infinite length sentences or long sentences. But this thing is never used, since human can't understand very long length sentence. While understanding also, human also break down the text in smaller units.
In Machine Translation, also researchers divides the text into different units and process each units and get the translated results. I will talk about each of them one by one.

  1. Statistical Machine Translation(IBM model):- Statistical Machine Translation looks source language sentences as a sequence of words. It uses EM (Expection Maximization) algorithm over the word aligned corpus to learn parameters and initialize the model. The Machine Translation system paneltize for reordering so doesn't work well for languages belonging distant families.
  2. Phrase Based Machine Translation:- The Phrase based Machine Translation system(Kohen et. al, 2003, och et, al, 1999) works at phrases level. The Phrase based Machine Translation learns the source language and target language pair of phrases. The Phrase based Machine Translation is also capable to translate non-conventional phrases.
  3. Hierarchical phrase based Machine Translation System:- Hierarchical phrase based Machine Translation System( Chiang 2005) is same as above system, except that the phrases learned in Machine Translation system are hierarchical.
  4. Syntax based Machine Translation system:- Syntax based Machine Translation system(Yamada and Knight, 2001) is basically parse tree to string translation system. They parsed the sentence using CKY algorithm and then done reordering, insertion and deletion at node level. Each node of phrase structure parse tree can be looked as Hierarchical phrase.
  5. Dependency Treelet Translation:- Dependency Treelet Translation( Quirk, 2005) is Microsoft approach for Machine Translation. Their approach is called a dependency treelet translation system because in contrast to standard phrase based MT system that learns phrase pairs, (Quirk et. al. 2005) learn treelet pairs. They use a source dependency parser and word-aligned source and target sentences. Then, they project source dependency structure to target and learn treelet translation pairs between source and target. They have used Maximum Likelihood method for extracting treelet translation. The advantage is that they can learn non-continous phrases also.
  6. chunk based Machine Translation:- In chunk based Machine Translation ( Watanabe, 2003) does machine translation over chunked sentences. The chunk and phrases are similar expect that chunk doesn't have recursive nature like phrases. In other words chunk doesn't contain another chunk inside it.

No comments:

Post a Comment