Type

Text

Type

Dissertation

Advisor

Akoglu, Leman | Skiena, Steven | Choi, Yejin | Bottou, Leon.

Date

2015-05-01

Keywords

Machine Learning, Multilingual, Natural Language Processing | Computer science

Department

Department of Computer Science.

Language

en_US

Source

This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.

Identifier

http://hdl.handle.net/11401/77810

Publisher

The Graduate School, Stony Brook University: Stony Brook, NY.

Format

application/pdf

Abstract

We built a Natural Language Processing (NLP) pipeline for each of Wikipedia languages through semi-supervised learning. Each pipeline consists of a language specific tokenizer, sentence segmenter, morphological analyzer, Part of Speech tagger, sentiment analysis, and Named Entity Recognition (NER) annotator. We automatically learn features (embedding) for each word in each language using continuous space language models, which capture syntactic and semantic characteristics of the language. We use these embeddings as features to train part of speech taggers with the help of human annotated datasets. To enable larger coverage of languages, we use these features with automatically-extracted annotations from Wikipedia to build a semi-supervised NER system. With strong prior (word embeddings) and simple statistical methods, we overcome the noise and bias introduced by the Wikipedia style guidelines. To demonstrate the quality of our work, we propose new evaluation metrics to accommodate to the large scale of languages we are targeting. Furthermore, all the pipelines are available to the community to use and study through the software package polyglot (available at http://polyglot-nlp.com). | 108 pages

Recommended Citation

Al-Rfou, Rami, "Polyglot: A Massive Multilingual Natural Language Processing Pipeline" (2015). Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions). 3581.
https://commons.library.stonybrook.edu/stony-brook-theses-and-dissertations-collection/3581

Download

COinS

Academic Commons

Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions)

Polyglot: A Massive Multilingual Natural Language Processing Pipeline

Type

Type

Advisor

Date

Keywords

Department

Language

Source

Identifier

Publisher

Format

Abstract

Recommended Citation

Browse

Search

Author Corner

Academic Commons

Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions)

Polyglot: A Massive Multilingual Natural Language Processing Pipeline

Authors

Type

Type

Advisor

Date

Keywords

Department

Language

Source

Identifier

Publisher

Format

Abstract

Recommended Citation

Share

Browse

Search

Author Corner