Type

Text

Type

Thesis

Advisor

Skiena, Steven | Akoglu, Leman | Choi, Yejin | Ramakrishnan, I.V.

Date

2014-12-01

Keywords

Computer science | big data, complex networks, machine learning, named entity recognition, natural language processing

Department

Department of Computer Science.

Language

en_US

Source

This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.

Identifier

http://hdl.handle.net/11401/77292

Publisher

The Graduate School, Stony Brook University: Stony Brook, NY.

Format

application/pdf

Abstract

With the massive amounts of unannotated text available from myriad sources, learning representations useful for natural language processing(NLP) tasks is an increasingly popular research area. Using deep learning techniques, we learn distributed representations for words (word embeddings) using Wikipedia as the source of text for 40 languages. These distributed representations represent each word as a point in feature space and capture useful semantic and syntactic properties of words amd have been shown to be useful in NLP Tasks like Part of Speech Tagging(POS) etc. We have built 2 classes of word embeddings namely Polyglot and Skipgram for these languages. We build a named entity recognition (NER) system that supports 40 languages using the word embeddings we have generated as features and seek to use freely available Wikipedia text as training data. This involves training language models to obtain the word embeddings, understanding the properties of the learnt word embeddings and culminates in learning models for named entity classification. We also present a novel technique for evaluating our performance on the myriad languages for which no gold data set for testing exists. Our results demonstrate that word embeddings exhibit nice community structure and can be used effectively for NER with no explicit hand crafted feature engineering and perform competitively with existing baselines when coupled with simple language agnostic techniques. | 54 pages

Recommended Citation

Kulkarni, Vivek V., "Multilingual Named Entity Recognition" (2014). Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions). 3113.
https://commons.library.stonybrook.edu/stony-brook-theses-and-dissertations-collection/3113

Download

COinS

Academic Commons

Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions)

Multilingual Named Entity Recognition

Type

Type

Advisor

Date

Keywords

Department

Language

Source

Identifier

Publisher

Format

Abstract

Recommended Citation

Browse

Search

Author Corner

Academic Commons

Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions)

Multilingual Named Entity Recognition

Authors

Type

Type

Advisor

Date

Keywords

Department

Language

Source

Identifier

Publisher

Format

Abstract

Recommended Citation

Share

Browse

Search

Author Corner