Pages

Thursday, March 4, 2010

From Stop-words to Grammatical words

Stop words are the words which occurs most of time in documents and carries no general meaning. Function words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker. This definition of Function words is taken from Wikipedia. Stop words are similar to function words. Function words are also called as grammatical words by the linguistics and stop words by Information retrieval guys. Linguists believe that they are important participant of sentence but Information Retrieval and access guys usually avoid them since they don't carry much information or lexical meaning. They are usually also known as closed class words, since one can't keep on adding new function words every time. Indian language being morphological rich language, function words are more important as compared to English language. In Indian languages, function words should be treated more like grammatical words rather than stop words.
The function words in a falls into following categories in following categories.
1)Article
2)Pronoun
3)Adposition
4)conjunctions
5)auxiliary verbs
6)pro-sentences
I would like to discuss few myths and facts related to stops words.
1) Are stop words language specific:- Generally stop word list is available for English and European related language. So can we translate the stop word list for the language which does not have stops words list. I think we can translate in some language pair which are closely related but not in all the languages. For example Indian language is morphological rich language and in some language stop words acts like inflections and become part of other words. Some category of stop words does not exists in some language or some language may have some new categories in stop words. For example, in Indian languages articles don't exists. Indians language uses definite pronouns and numbers in place of articles. But maximum words of stop words are same. So we can translate stop words list from one language to another.
2) Are stop words domain specific:- Some scientist generally get confused between the domain specific keywords and stop words. domain specific keywords are words which have high occurrence in domain specific text but in normal text they have negligible occurrences. Stop words are words which have high occurrence in all type of language documents. So in my view stop words are domain independent. If one has domain dependent stop words then make sure he/she is not losing any information. Since domain specific keywords are important for the domain. If one neglect them than he may lose some of information.
3) Are stop words source type dependent:- In my point of source can taken as writer of corpus or type of corpus. Type of corpus can be email, news corpus. Some time stop words are source-type dependent. Since sometime different source-type has different vocabularies. For example chat and email vocabulary is different from normal text vocabulary.
4) Are stop words Unambiguous:- This statement may be true for English language, But for morphological rich languages like Indian languages, stop words are generally ambiguous. Take a example of post-position. Some post-positions have more than four-fives senses. But on the average post-positions have more one sense.

2 comments:

  1. I find your blogging ideas and post very interesting, thanks for taking the time of explaining this things

    ReplyDelete