Type

Text

Type

Thesis

Advisor

Skiena, Steven | Akoglu, Leman | Choi, Yejin | Ramakrishnan, I.V.

Date

2014-12-01

Keywords

Computer science | big data, complex networks, machine learning, named entity recognition, natural language processing

Department

Department of Computer Science.

Language

en_US

Source

This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.

Identifier

http://hdl.handle.net/11401/77292

Publisher

The Graduate School, Stony Brook University: Stony Brook, NY.

Format

application/pdf

Abstract

With the massive amounts of unannotated text available from myriad sources, learning representations useful for natural language processing(NLP) tasks is an increasingly popular research area. Using deep learning techniques, we learn distributed representations for words (word embeddings) using Wikipedia as the source of text for 40 languages. These distributed representations represent each word as a point in feature space and capture useful semantic and syntactic properties of words amd have been shown to be useful in NLP Tasks like Part of Speech Tagging(POS) etc. We have built 2 classes of word embeddings namely Polyglot and Skipgram for these languages. We build a named entity recognition (NER) system that supports 40 languages using the word embeddings we have generated as features and seek to use freely available Wikipedia text as training data. This involves training language models to obtain the word embeddings, understanding the properties of the learnt word embeddings and culminates in learning models for named entity classification. We also present a novel technique for evaluating our performance on the myriad languages for which no gold data set for testing exists. Our results demonstrate that word embeddings exhibit nice community structure and can be used effectively for NER with no explicit hand crafted feature engineering and perform competitively with existing baselines when coupled with simple language agnostic techniques. | 54 pages

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.