Type
Text
Type
Thesis
Advisor
Skiena, Steven | Akoglu, Leman | Choi, Yejin | Ramakrishnan, I.V.
Date
2014-12-01
Keywords
Computer science | big data, complex networks, machine learning, named entity recognition, natural language processing
Department
Department of Computer Science.
Language
en_US
Source
This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.
Identifier
http://hdl.handle.net/11401/77292
Publisher
The Graduate School, Stony Brook University: Stony Brook, NY.
Format
application/pdf
Abstract
With the massive amounts of unannotated text available from myriad sources, learning representations useful for natural language processing(NLP) tasks is an increasingly popular research area. Using deep learning techniques, we learn distributed representations for words (word embeddings) using Wikipedia as the source of text for 40 languages. These distributed representations represent each word as a point in feature space and capture useful semantic and syntactic properties of words amd have been shown to be useful in NLP Tasks like Part of Speech Tagging(POS) etc. We have built 2 classes of word embeddings namely Polyglot and Skipgram for these languages. We build a named entity recognition (NER) system that supports 40 languages using the word embeddings we have generated as features and seek to use freely available Wikipedia text as training data. This involves training language models to obtain the word embeddings, understanding the properties of the learnt word embeddings and culminates in learning models for named entity classification. We also present a novel technique for evaluating our performance on the myriad languages for which no gold data set for testing exists. Our results demonstrate that word embeddings exhibit nice community structure and can be used effectively for NER with no explicit hand crafted feature engineering and perform competitively with existing baselines when coupled with simple language agnostic techniques. | 54 pages
Recommended Citation
Kulkarni, Vivek V., "Multilingual Named Entity Recognition" (2014). Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions). 3113.
https://commons.library.stonybrook.edu/stony-brook-theses-and-dissertations-collection/3113