Unsupervised Single Document Abstractive Summarization using Semantic Units
TL;DR: The paper discusses the importance of content frequency in abstractive summarization and proposes a two-stage training framework for the model to learn the frequency of each semantic unit in the source text. The model is trained in an unsupervised manner and identifies sentences with high-frequency semantic units during inference to generate summaries. The model outperforms other unsupervised methods on the CNN/Daily Mail summarization task and achieves competitive ROUGE scores with fewer parameters than pre-trained models. It can be trained under low-resource language settings and is a potential solution for real-world applications where pre-trained models are not applicable.
Problems & Solutions
Lack of sufficient training pairs is a common issue in real-world applications of supervised summarization models.
The authors propose an unsupervised summarization method that utilizes the frequency of contents in the source text to automatically learn semantic unit frequency and discriminate salient parts in source documents for abstractive summarization.
Large pre-trained models for language generation or summarization may require less data for fine-tuning, but they are often trained on English corpus only and thus are not suitable for low-resource languages.
The authors propose an unsupervised summarization method that does not rely on pre-trained models and can be applied to any language.
Creating high-quality training pairs for supervised summarization models can be costly.
The authors propose an unsupervised summarization method that does not require any human-written summaries during training or inference, making it suitable for real-world applications where human-written summaries are rarely accessible.
It is difficult to identify the most salient information in a source text for summarization.
The authors propose dividing and enumerating all text spans with a fixed-size sliding window to create "semantic units" (SUs) that contain brief semantic concepts. They argue that a refined summary should at least contain the semantic units frequently occurring in the original articles since the high-frequency semantic units should be the topic or contain key descriptions.
It is difficult to retrieve frequency information from source documents only.
The authors propose a model that automatically learns semantic unit frequency and uses the learned frequency information to filter the sentences in the source text and generate a summary.
It is difficult to decide how much to focus on each semantic unit when generating text.
The authors propose using the attention mechanism to obtain semantic unit frequency in the inference stage, which helps the model decide how much to focus on each semantic unit when generating text. The recorded attention weights are used to assign weights to the semantic units, which are considered the semantic unit frequency.