Document Summarization using Clustering and Submodular Functions


What do we do ?


How did we generate the summary from a collection of documents?

Extractive text summarization process can be divided into two steps:

  1. Pre Processing step and
  2. Processing step.


In Pre-Processing is structured representation of the original text. It usually includes:


In Processing step, the coverage and diversity measures are calculated for each sentence using their tf-idf scores. Then a weight-age is appropriately given to these parameters and the final score for a sentence is calculated. The top ranking sentences are selected for the final summary.


How do we evaluate the generated Summary?

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation,



We used the DUC-2004 dataset which contains 50 TDT topics/events/timespan and a subset of the documents TDT annotators found for each topic/event/timespan. The documents were taken from the AP newswire and the New York Times newswire and subsets were formed with an average of 10 documents per subset. The dataset also contains very short summaries of each document ( 75 bytes) and a short summary
( 665 bytes) of each cluster.


Submodular functions

Submodular functions share many properties in common with convex functions, one of which is that they are closed under a number of common combination operations (summation, certain compositions, restrictions, and so on). These operations give us the tools necessary to design a powerful submodular objective for submodular document summarization that extends beyond any previous work.

We are given a set of objects V = {v1, ..., vn} and a function F : 2V → R that returns a real value for any subset S ⊆ V. We are interested in finding the subset of bounded size |S| ≤ k that maximizes the function, e.g, argmax S⊆V F(S). Sub-modular functions are those that satisfy the property of diminishing returns: for any A ⊆ B ⊆ V v, a sub-modular function F must satisfy F(A + v) − F(A) ≥ F(B + v) − F(B). That is, the incremental value of v decreases as the context in which v is considered grows from A to B. F is monotone non-decreasing if ∀A ⊆ B, F(A) ≥ F(B)


Properties of a good Summary

Two properties of a good summary are

  1. Relevance and
  2. Diversity i.e., non- redundancy.

Objective functions for extractive summarization usually measure these two separately and then mix them together trading off encouraging relevance and penalizing redundancy. The redundancy penalty usually violates the monotonicity of the objective functions. We therefore propose to positively reward diversity instead of negatively penalizing redundancy. In particular, we model the summary quality as

F(S) = L(S) + λR(S)

L(S) measures the coverage, or fidelity, of summary set S to the document,
R(S) rewards diversity in S, and
λ is a trade-off coefficient

Coverage Measure,L(S):

L(S) can be interpreted either as a set function that measures the similarity of summary set S to the document to be summarized, or as a function representing some form of coverage of V by S.
Most naturally, L(S) should be monotone, as coverage improves with a larger summary.
L(S) should also be submodular: consider adding a new sentence into two summary sets, one a subset of the other. Intuitively, the increment when adding a new sentence to the small summary set should be larger than the increment when adding it to the larger set, as the information carried by the new sentence might have already been covered by those sentences that are in the larger summary but not in the smaller
summary. This is exactly the property of diminishing returns.

Diversity Measure,R(S):

Instead of penalizing redundancy by subtracting from the objective, we propose to reward diversity. The function R(S) rewards diversity in that there is usually more benefit to selecting a sentence from a cluster not yet having one of its elements already chosen. As soon as an element is selected from a cluster, other elements from the same cluster start having diminishing gain, thanks to the square root function.


Clustering Approaches:

  1. K-means Clustering

We ran a Grid Search on the values of α and λ to get the best optimal value to maximize the sub modular function.


  1. Summary → φ
  2. allowedClusters ← allClusters
  3. while size(Summary) ≤ 665:
    pick the cluster most similar to corpus from
    allowedClusters → chosenCluster
    chosenSentence ← highest ranking sentence of chosenCluster based on coverage and diversity measure
    Summary ← chosenSentenceHierarchical Clustering
  1. Hierarchical clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with
only one sample.

Each observation(here sentence) starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy. In computing the clusters, we used ”Ward” criteria which minimizes the sum of squared distances within all clusters.

  1. Chinese Whispers Graph-Clustering

The algorithm works in the following way in an undirected unweighted graph:

  1. All nodes are assigned to a random class. The number of initial classes equals the number of nodes.
  2. Then all of the network nodes are selected one by one in a random order. Every node moves to the class which the given node connects with the most links. In the case of equality the cluster is randomly chosen from the equally linked classes.
  3. Step two repeats itself until a predetermined number of iteration or until the process converges. In the end the emerging classes represent the clusters of the network.

The predetermined threshold for the number of the iterations is needed because it is possible, that process does not converge.On the other hand in a network with approximately 10000 nodes the clusters does not change significantly after 40-50 iterations even if there is no convergence. Here we used similarity between nodes to be 0.05 atleast, only then we connect them with a weighted edge. And we set the number of iterations to 10.



The following score is obtained on running the code with λ = 1 and α = 0

DatasetK-meansChinese WhisperHierarchical
d30001t 0.3458823529411765 0.3458823529411765 0.3458823529411765
d30002t 0.2717149220489978 0.2717149220489978 0.2717149220489978
d30003t 0.3463414634146341 0.3463414634146341 0.3463414634146341
d30005t 0.21541950113378686 0.21541950113378686 0.21541950113378686
d30006t 0.3160270880361174 0.3160270880361174 0.3160270880361174
d30007t 0.3373493975903614 0.3373493975903614 0.3373493975903614
d30008t 0.30260047281323876 0.30260047281323876 0.30260047281323876
d30010t 0.31234866828087166 0.3341404358353511 0.3341404358353511
d30011t 0.26715686274509803 0.26715686274509803 0.19852941176470587
d30015t 0.31220657276995306 0.31220657276995306 0.31220657276995306
d30017t 0.28893905191873587 0.28893905191873587 0.28893905191873587
d30020t 0.3091334894613583 0.3091334894613583 0.3091334894613583
d30024t 0.2639225181598063 0.2639225181598063 0.2639225181598063
d30026t 0.3530864197530864 0.3530864197530864 0.3530864197530864
d30027t 0.22249388753056235 0.22249388753056235 0.22493887530562348
d30028t 0.2857142857142857 0.2857142857142857 0.2833333333333333
d30029t 0.2982062780269058 0.2982062780269058 0.2982062780269058
d30031t 0.31096196868008946 0.31096196868008946 0.31096196868008946
d30033t 0.3480278422273782 0.3480278422273782 0.3480278422273782
d30034t 0.2961165048543689 0.2961165048543689 0.24271844660194175
d30036t 0.28438228438228436 0.28438228438228436 0.28438228438228436
d30037t 0.35514018691588783 0.32242990654205606 0.32242990654205606
d30038t 0.35944700460829493 0.35944700460829493 0.35944700460829493
d30040t 0.3024390243902439 0.3024390243902439 0.32195121951219513
d30042t 0.2725274725274725 0.2725274725274725 0.2725274725274725
d30044t 0.32860520094562645 0.32860520094562645 0.32860520094562645
d30045t 0.29464285714285715 0.29464285714285715 0.29464285714285715
d30047t 0.3835920177383592 0.3835920177383592 0.3835920177383592
d30048t 0.2636363636363636 0.2636363636363636 0.27045454545454545
d30049t 0.38137472283813745 0.38137472283813745 0.38137472283813745
d30050t 0.2932692307692308 0.2932692307692308 0.2932692307692308
Average 0.3071840 0.306831 0.303747