Each week, we select a recently published Open Access article to feature. This week’s article comes from Statistical Analysis and Data Mining and proposes an auxiliary POS tagger to accommodate the increasing usage of slang in Web 2.0.
The article’s abstract is given below, with the full article available to read here.
The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state-of-the-art Part-of-Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens’ information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.More Details