GloVe (Global Vectors for Word Representation) is a tool recently released by Stanford NLP Group researchers Jeffrey Pennington, Richard Socher, and Chris Manning for learning continuous-space vector representations of words.
(jump to: theory, implementation)
Introduction
These real-valued word vectors have proven to be useful for all sorts of natural language processing tasks, including parsing,1 named entity recognition,2 and (very recently!) machine translation.34
It’s been shown (and widely shared by this point) that these word vectors exhibit interesting semantic and syntactic regularities. For example, we find that claims like the following hold for the associated word vectors:
\[\begin{align*}\text{king} - \text{man} + \text{woman} &\approx \text{queen} \\ \text{brought} - \text{bring} + \text{seek} &\approx \text{sought}\end{align*}\]
There’s quite a bit of buzz around the tools which build these word vectors at the moment, so I figured it would be worth it to provide a down-to-earth coverage of GloVe, one of the newest methods.
The GloVe authors present some results which suggest that their
tool is competitive with Google’s popular word2vec
package. In
order to better understand how GloVe works and to make available a nice
learning resource, I decided to port the open-source (yay!) but somewhat
difficult-to-read (no!) GloVe source code from C to Python.
In this post I’ll give an explanation by intuition of how the GloVe method
works5 and then provide a quick overview of the implementation in Python. You
can find the complete Python code (just 187 SLOC, including command-line
argument processing, IO, etc.) in the glove.py
GitHub repo.
A quick disclaimer before we begin: I wrote this code for tutorial purposes. It is nowhere near production-ready in terms of efficiency. If you would like to parallelize and optimize it as an exercise, be my guest — just be sure to share the results!
Theory
The GloVe model learns word vectors by examining word co-occurrences within a text corpus. Before we train the actual model, we need to construct a co-occurrence matrix \(X\), where a cell \(X_{ij}\) is a “strength” which represents how often the word \(i\) appears in the context of the word \(j\). We run through our corpus just once to build the matrix \(X\), and from then on use this co-occurrence data in place of the actual corpus. We will construct our model based only on the values collected in \(X\).
Once we’ve prepared \(X\), our task is to decide vector values in continuous space for each word we observe in the corpus. We will produce vectors with a soft constraint that for each word pair of word \(i\) and word \(j\),6
\[\begin{equation}\vec{w}_i^T \vec{w}_j + b_i + b_j = \log X_{ij}.\end{equation}\]
where \(b_i\) and \(b_j\) are scalar bias terms associated with words \(i\) and \(j\), respectively. Intuitively speaking, we want to build word vectors that retain some useful information about how every pair of words \(i\) and \(j\) co-occur.
We’ll do this by minimizing an objective function \(J\), which evaluates the sum of all squared errors based on the above equation, weighted with a function \(f\):
\[\begin{equation}J = \sum_{i=1}^V \sum_{j=1}^V \; f\left(X_{ij}\right) \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij} \right)^2 \end{equation}\]
We choose an \(f\) that helps prevents common word pairs (i.e., those with large \(X_{ij}\) values) from skewing our objective too much:
\[\begin{equation}f\left(X_{ij}\right) = \left\{ \begin{array}{cl}\left(\frac{X_{ij}}{x_{\text{max}}}\right)^\alpha & \text{if } X_{ij} < x_{\text{max}} \\ 1 & \text{otherwise.} \end{array}\right. \end{equation} \]
When we encounter extremely common word pairs (where \(X_{ij} > x_{\text{max}}\)) this function will cut off its normal output and simply return \(1\). For all other word pairs, we return some weight in the range \((0, 1)\), where the distribution of weights in this range is decided by \(\alpha\).
Implementation
Now for the code! I’ll skip the boring parts which do things like model saving and argument parsing, and focus on the three most meaty functions in the code:
build_cooccur
accepts a corpus and yields a list of co-occurrence blobs (the \(X_{ij}\) values). It calculates co-occurrences by moving a sliding n-gram window over each sentence in the corpus.train_glove
, which prepares the parameters of the model and manages training at a high level, andrun_iter
, which runs a single parameter update step.
First, our build_cooccur
function accepts a vocabulary (mapping words to
integer word IDs), a corpus (a simple iterator over sentences), and some
optional parameters: a context window size and a minimum count (used to
drop rare word co-occurrence pairs). We’ll start by building a sparse
matrix for collecting cooccurrences \(X_{ij}\) and some simple helper
data.
For each line in the corpus, we’ll conjure up a sequence of word IDs:
Now for each word ID \(i\) in the sentence, we’ll extract a window of context words to the left of the word.
For each word ID \(j\) in the context, we’ll add on weight to the cell \(X_{ij}\). The increment for the word pair is inversely related to the distance between the two words in question. This means word instances which appear next to each other see higher \(X_{ij}\) increments than word instances which appear with many words in between.
One last technical point: we build this matrix \(X_{ij}\) symmetrically. This means that we treat word co-occurrences where the context word is to the left of the main word exactly the same as co-occurrences where the context word is to the right of the main word.
That’s about it — build_cooccur
finishes with
a bit more code to yield co-occurrence pairs from this sparse matrix, but
I won’t bother to show it here.
Next, train_glove
initializes the model parameters given the fully
constructed co-occurrence data. We expect the same vocab
object as
before as a first parameter. The second parameter, cooccurrences
,
is a co-occurrence iterator produced in build_cooccur
, which
yields co-occurrence tuples of the form (main_word_id,
context_word_id, x_ij)
, where x_ij
is an \(X_{ij}\)
co-occurrence value as introduced above.
We next prepare the primary model parameters: the word vector matrix \(W\) and
a collection of bias scalars. Note that our word matrix has twice as many rows
as the number of words in the corpus. We will find out why later in describing
the run_iter
function.
We will be training using adaptive gradient descent (AdaGrad),7 and so we’ll also need to initialize helper matrices for \(W\) and the bias vector which track gradient histories. Note that these all are initialized as blocks of ones. By starting with every gradient history equal to one, our first training step in AdaGrad will simply use the global learning rate for each example. (See footnote 77 to work this out from the AdaGrad definition.)
Next, we begin training by iteratively calling the run_iter
function.
run_iter
accepts this pre-fetched data and begins by shuffling it and
establishing a global cost for the iteration:
Now for every co-occurrence data tuple, we compute the weighted cost as described in the above theory section. Each tuple has the following elements:
v_main
: the word vector for the main word in the co-occurrencev_context
: the word vector for the context word in the co-occurrenceb_main
: bias scalar for main wordb_context
: bias scalar for context wordgradsq_W_main
: a vector storing the squared gradient history for the main word vector (for use in the AdaGrad update)gradsq_W_context
: a vector gradient history for the context word vectorgradsq_b_main
: a scalar gradient history for the main word biasgradsq_b_context
: a scalar gradient history for the context word biascooccurrence
: the \(X_{ij}\) value for the co-occurrence pair, described at length above
We retain an intermediate “inner” cost (not squared or weighted) for use in calculating the gradient in the next section.
With the cost calculated, we now need to compute gradients. From our original cost function \(J\) we derive gradients with respect to the relevant parameters \(\vec w_i\), \(\vec w_j\), \(b_i\), and \(b_j\). (Note that since \(f(X_{ij})\) doesn’t depend on any of these parameters, the derivations are quite simple.) Below we use the operator \(\odot\) to denote elementwise vector multiplication.
\[\begin{align*}J &= \sum_{i=1}^V \sum_{j=1}^V \; f\left(X_{ij}\right) \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij} \right)^2 \\ \nabla_{\vec{w}_i} J &= \sum_{j=1}^V f\left(X_{ij}\right) \vec{w}_j \odot \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij}\right) \\ \frac{\partial J}{\partial b_i} &= \sum_{j=1}^V f\left(X_{ij}\right) \left(\vec w_i^T \vec w_j + b_i + b_j - \log X_{ij}\right) \end{align*} \]
Now let’s put that in code! We use the earlier-calculated intermediate
value cost_inner
, which stores the value being squared and weighted in
the full cost function.
Finally, we update weights with AdaGrad7 and add the calculated gradients to the gradient history variables.
After we’ve processed all data for the iteration, we return the global cost and relax for a while.
That’s it for code! If you’d like to see word vectors produced by this Python code in action, check out this IPython notebook.
If you found this all fascinating, I highly recommend digging into the official GloVe documentation, especially the original paper, which is due to be published at this year’s EMNLP conference. A quality general coverage of word representations and their uses is Peter Turney and Patrick Pantel’s paper, “From frequency to meaning: Vector space models of semantics.”
Distributed word representations such as those which GloVe produces are really revolutionizing natural language processing. I’m excited to see what happens as more and more tools of this sort are disseminated outside of academia and put to real-world use.
If you’re making use of GloVe or similar tools in your own projects, let me know. Until next time, happy coding!
-
Richard Socher et al., “Parsing with Compositional Vector Grammars,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Sofia, Bulgaria: Association for Computational Linguistics, 2013), 455–65. ↩
-
Joseph Turian, Lev Ratinov, and Yoshua Bengio, “Word Representations: A Simple and General Method for Semi-Supervised Learning,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2010), 384–94. ↩
-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 [cs, Stat], September 1, 2014.
This is what I’m working on right now—if this sounds interesting to you, get in touch! ↩ -
There is still quite a bit of debate, however, over the best way to construct these vectors. The popular tool
word2vec
, which has seen wide use and wide success in the past year, builds so-called neural word embeddings, whereas GloVe and others construct word vectors based on counts. I won’t get into the controversy in this post, but feel free to read up and pick a side.
See e.g. Marco Baroni, Georgiana Dinu, and Germán Kruszewski, “Don’t Count, Predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Baltimore, Maryland: Association for Computational Linguistics, 2014), 238–47; Omer Levy and Yoav Goldberg, “Linguistic Regularities in Sparse and Explicit Word Representations,” in Proceedings of the Eighteenth Conference on Computational Natural Language Learning (Ann Arbor, Michigan: Association for Computational Linguistics, 2014), 171–80. ↩ -
I hope this post is a useful supplement to the original paper. If you have the time, read the original too — it has a lot of useful and well-stated insights about the task of word representations in general. ↩
-
I am skipping over a lot of interesting / beautiful details here — please read the paper if you are interested in more than the implementation! ↩
-
AdaGrad is a modified form of stochastic gradient descent which attempts to guide learning in the proper direction by weighting rarely occurring features more often than those that always fire. Briefly, for a gradient component \(g_{t,i}\) at training step \(t\), AdaGrad defines the gradient descent update to be \[x_{t+1, i} = x_{t, i} - \dfrac{\eta}{\sqrt{\sum_{t'=1}^{t-1} g_{t', i}^2}} g_{t, i}.\] For a more thorough coverage see this AdaGrad tutorial. ↩ ↩2 ↩3