Extractive text summarization process can be divided into two steps:
In Pre-Processing is structured representation of the original text. It usually includes:
In Processing step, the coverage and diversity measures are calculated for each sentence using their tf-idf scores. Then a weight-age is appropriately given to these parameters and the final score for a sentence is calculated. The top ranking sentences are selected for the final summary.
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation,
We used the DUC-2004 dataset which contains 50 TDT topics/events/timespan and a subset of the documents TDT annotators found for each topic/event/timespan. The documents were taken from the AP newswire and the New York Times newswire and subsets were formed with an average of 10 documents per subset. The dataset also contains very short summaries of each document ( 75 bytes) and a short summary
( 665 bytes) of each cluster.
Submodular functions share many properties in common with convex functions, one of which is that they are closed under a number of common combination operations (summation, certain compositions, restrictions, and so on). These operations give us the tools necessary to design a powerful submodular objective for submodular document summarization that extends beyond any previous work.
We are given a set of objects V = {v1, ..., vn} and a function F : 2V → R that returns a real value for any subset S ⊆ V. We are interested in finding the subset of bounded size |S| ≤ k that maximizes the function, e.g, argmax S⊆V F(S). Sub-modular functions are those that satisfy the property of diminishing returns: for any A ⊆ B ⊆ V v, a sub-modular function F must satisfy F(A + v) − F(A) ≥ F(B + v) − F(B). That is, the incremental value of v decreases as the context in which v is considered grows from A to B. F is monotone non-decreasing if ∀A ⊆ B, F(A) ≥ F(B)
Two properties of a good summary are
Objective functions for extractive summarization usually measure these two separately and then mix them together trading off encouraging relevance and penalizing redundancy. The redundancy penalty usually violates the monotonicity of the objective functions. We therefore propose to positively reward diversity instead of negatively penalizing redundancy. In particular, we model the summary quality as
F(S) = L(S) + λR(S)
L(S) measures the coverage, or fidelity, of summary set S to the document,
R(S) rewards diversity in S, and
λ is a trade-off coefficient
L(S) can be interpreted either as a set function that measures the similarity of summary set S to the document to be summarized, or as a function representing some form of coverage of V by S.
Most naturally, L(S) should be monotone, as coverage improves with a larger summary.
L(S) should also be submodular: consider adding a new sentence into two summary sets, one a subset of the other. Intuitively, the increment when adding a new sentence to the small summary set should be larger than the increment when adding it to the larger set, as the information carried by the new sentence might have already been covered by those sentences that are in the larger summary but not in the smaller
summary. This is exactly the property of diminishing returns.
Instead of penalizing redundancy by subtracting from the objective, we propose to reward diversity. The function R(S) rewards diversity in that there is usually more benefit to selecting a sentence from a cluster not yet having one of its elements already chosen. As soon as an element is selected from a cluster, other elements from the same cluster start having diminishing gain, thanks to the square root function.
We ran a Grid Search on the values of α and λ to get the best optimal value to maximize the sub modular function.
Algorithm:
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with
only one sample.
Each observation(here sentence) starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy. In computing the clusters, we used ”Ward” criteria which minimizes the sum of squared distances within all clusters.
The algorithm works in the following way in an undirected unweighted graph:
The predetermined threshold for the number of the iterations is needed because it is possible, that process does not converge.On the other hand in a network with approximately 10000 nodes the clusters does not change significantly after 40-50 iterations even if there is no convergence. Here we used similarity between nodes to be 0.05 atleast, only then we connect them with a weighted edge. And we set the number of iterations to 10.
The following score is obtained on running the code with λ = 1 and α = 0
Dataset | K-means | Chinese Whisper | Hierarchical |
---|---|---|---|
d30001t | 0.3458823529411765 | 0.3458823529411765 | 0.3458823529411765 |
d30002t | 0.2717149220489978 | 0.2717149220489978 | 0.2717149220489978 |
d30003t | 0.3463414634146341 | 0.3463414634146341 | 0.3463414634146341 |
d30005t | 0.21541950113378686 | 0.21541950113378686 | 0.21541950113378686 |
d30006t | 0.3160270880361174 | 0.3160270880361174 | 0.3160270880361174 |
d30007t | 0.3373493975903614 | 0.3373493975903614 | 0.3373493975903614 |
d30008t | 0.30260047281323876 | 0.30260047281323876 | 0.30260047281323876 |
d30010t | 0.31234866828087166 | 0.3341404358353511 | 0.3341404358353511 |
d30011t | 0.26715686274509803 | 0.26715686274509803 | 0.19852941176470587 |
d30015t | 0.31220657276995306 | 0.31220657276995306 | 0.31220657276995306 |
d30017t | 0.28893905191873587 | 0.28893905191873587 | 0.28893905191873587 |
d30020t | 0.3091334894613583 | 0.3091334894613583 | 0.3091334894613583 |
d30024t | 0.2639225181598063 | 0.2639225181598063 | 0.2639225181598063 |
d30026t | 0.3530864197530864 | 0.3530864197530864 | 0.3530864197530864 |
d30027t | 0.22249388753056235 | 0.22249388753056235 | 0.22493887530562348 |
d30028t | 0.2857142857142857 | 0.2857142857142857 | 0.2833333333333333 |
d30029t | 0.2982062780269058 | 0.2982062780269058 | 0.2982062780269058 |
d30031t | 0.31096196868008946 | 0.31096196868008946 | 0.31096196868008946 |
d30033t | 0.3480278422273782 | 0.3480278422273782 | 0.3480278422273782 |
d30034t | 0.2961165048543689 | 0.2961165048543689 | 0.24271844660194175 |
d30036t | 0.28438228438228436 | 0.28438228438228436 | 0.28438228438228436 |
d30037t | 0.35514018691588783 | 0.32242990654205606 | 0.32242990654205606 |
d30038t | 0.35944700460829493 | 0.35944700460829493 | 0.35944700460829493 |
d30040t | 0.3024390243902439 | 0.3024390243902439 | 0.32195121951219513 |
d30042t | 0.2725274725274725 | 0.2725274725274725 | 0.2725274725274725 |
d30044t | 0.32860520094562645 | 0.32860520094562645 | 0.32860520094562645 |
d30045t | 0.29464285714285715 | 0.29464285714285715 | 0.29464285714285715 |
d30047t | 0.3835920177383592 | 0.3835920177383592 | 0.3835920177383592 |
d30048t | 0.2636363636363636 | 0.2636363636363636 | 0.27045454545454545 |
d30049t | 0.38137472283813745 | 0.38137472283813745 | 0.38137472283813745 |
d30050t | 0.2932692307692308 | 0.2932692307692308 | 0.2932692307692308 |
Average | 0.3071840 | 0.306831 | 0.303747 |