Séminaire Alpage : Yoav Goldberg

Vendredi 25 Septembre 2015, 11:00 to 13:00
Organisation: 
Marie Candito (Alpage)
Lieu: 

ODG – Salle 127

Yoav Goldberg (Bar Ilan University)
Understanding Neural Word Embeddings

Neural word embeddings, such as word2vec (Mikolov et al., 2013), have become increasingly popular in both academic and industrial NLP. These methods attempt to capture the semantic meanings of words by processing huge unlabeled corpora with methods inspired by neural networks and the recent onset of Deep Learning. The result is a vector representation of every word in a low-dimensional continuous space. These word vectors exhibit interesting arithmetic properties (e.g. king - man + woman = queen) (Mikolov et al., 2013), and seemingly outperform traditional vector-space models of meaning inspired by Harris's Distributional Hypothesis (Baroni et al., 2014). Our work attempts to demystify word embeddings, and understand what makes them so much better than traditional methods at capturing semantic properties.

Our main result shows that state-of-the-art word embeddings are actually "more of the same". In particular, we show that skip-grams with negative sampling, the latest algorithm in word2vec, is implicitly factorizing a word-context PMI matrix, which has been thoroughly used and studied in the NLP community for the past 20 years. We also identify that the root of word2vec's perceived superiority is a collection of design choices and hyperparameter settings, which can be ported to distributional methods, yielding similar gains. Among our qualitative results are a method for observing the salient contexts of a given word-vector, and the answer to why king - man + woman = queen. We also show task-specific extensions to the word2vec model, achieving improved accuracy for specific tasks.