After a summer replete with feature-engineering and corpus processing, the Stanford NLP Group has just released CoreNLP 3.4.1, which includes support for Spanish-language text. In this post I’ll show how to make use of these tools to make a dead-simple document summarizer.1
Our end goal will be to take a news article of significant length and reduce it to its two or three most important points. We’ll run through each sentence and assign it a score based on two factors:
-
tf–idf weights. The tf–idf metric is a formula which explains how important a particular word is in the context of its containing document. We’ll calculate the sum of tf–idf scores for all nouns in each sentence, and consider those sentences with the greatest sums to be the most important.
The tf–idf metric is the product of two factors:
\[\text{tf–idf}_{t, d} = tf_{t, d} \; idf_t\]The first is a term frequency factor, which tracks how often the word appears in its containing document. It is some scaled version of the number of times the word appears in the given document. We’ll use a logarithm form here:
\[\text{tf}_{t, d} = \log(1 + \text{count of $t$ in $d$})\]The second is an inverse document frequency (IDF) factor. This measures the informativeness of the word based on how often it appears in total across an entire corpus. The inverse document frequency factor is a logarithm as well:
\[\text{idf}_{t} = \log\left( \frac{\text{count of total documents}}{\text{count of documents containing $t$}} \right)\]Note that IDF values will be exactly 0 for common words like “the,” as they are likely to appear in every document in the corpus. Meaningful and less common words like “transmogrify” and “incinerate” will yield higher IDF values.
-
Positional weight. For news articles, another easy measure of the importance of a sentence is its position in the document: important sentences tend to appear before less crucial ones. We can model this by scaling our original tf–idf score by the index of the sentence within the document.
With theory over, let’s get to the code. I’m going to walk through a Java class
Summarizer
, the full source code of which is available in a GitHub repo.
Our only dependency here is Stanford CoreNLP 3.4.1. We begin by
instantiating the CoreNLP pipeline statically.
As we discussed earlier, the summarizer depends upon document frequency data,
which must be precalculated from a corpus of Spanish text. In the constructor of
the Summarizer
, we receive a prebuilt dfCounter
and determine the total
number of documents in the training corpus.2
Our main routine, summarize
, accepts a document string and a number of
sentences to return.
The method rankSentences
sorts the provided sentence collection using a custom
comparator SentenceComparator
, which contains the bulk of our actual logic for
sentence importance. Here’s the framework:
score
and the following methods are the meat of the entire code. score
accepts a sentence and returns a floating-point value indicating the sentence’s
importance.
score
calls a method tfIDFWeights
, which determines the total tf–idf scores
for all the nouns in the given sentence:
That’s it for the code. You can see the entire class in this public GitHub repo.
I’ll end with a quick unscientific test of the code. I built document-frequency
counts (using a helper DocumentFrequencyCounter
class) from the
Spanish Gigaword, which contains about 1.5 billion words of Spanish. It
took several days (running on a 16-core machine) to POS-tag each sentence and
collect the nouns in a global counter.3
I next tested with a few recent Spanish news articles, requesting a two-sentence summary of each. Here’s the output summary of an article on the Laniakea supercluster:
Las galaxias no están distribuidas al azar en todo el universo, sino que se encuentran en grupos, al igual que nuestro propio Grupo Local, que contiene docenas de galaxias, y en cúmulos masivos, que poseen cientos de galaxias, todas interconectadas en una red de filamentos en la que se ensartan como perlas. Estos expertos han bautizado al supercúmulo con el nombre de ‘Laniakea’, que significa “cielo inmenso” en hawaiano, como informan en un artículo de la edición de este jueves de Nature. Una galaxia entre dos estructuras de este tipo puede quedar atrapada en un tira y afloja gravitacional en el que el equilibrio de las fuerzas gravitacionales que rodean las estructuras a gran escala determina el movimiento de la galaxia.
And another on Argentinian debt:
La inclusión de la capital de Francia como nueva jurisdicción para hacer efectivos los desembolsos a los acreedores ha sido una iniciativa del bloque ‘cristinista’ para ganar los votos de algunos legisladores opositores. Por ejemplo, los legisladores del Frente Renovador, también peronista pero no ‘cristinista’, según la prensa, acordarían con la inclusión de París, por considerar que allí los pagos estarían a salvo de los fondos especulativos o ‘buitre’. Con esta iniciativa el gobierno de la presidenta Cristina Fernández, viuda de Kirchner, pretende esquivar a la justicia de los Estados Unidos y a los fondos especulativos o ‘buitre’ que ganaron a Argentina un juicio y colocaron al país en ‘default’ parcial.
I hope this code serves as a useful example for using basic CoreNLP tools in Spanish. Feel free to follow up below in the comments or by email!
-
I won’t claim this will always give fantastic summarizations, but it’s definitely a quick and easy-to-grasp algorithm. ↩
-
If you are interested in how this helper data is constructed, see the
DocumentFrequencyCounter
class in the GitHub repo. ↩ -
This probably could have been optimized quite a bit down to the level of hours – but when you’ve got the time… ↩