Text Summarization Techniques
A brief overview of different extractive and abstractive approaches
Text summarization is the process of distilling the most important information from a source(or sources) to produce an abridged version for a particular user(or users) and task(or task). The hierarchical chart below shows the different types of text summarization.
As the chart above shows, when the output type comes to the text summarization, there are two different summarizers.
- Extractive: This technique attempts to score the phrases or sentences in a document and return only the most highly informative blocks of text
- Abstractive: This method creates a new text which does not exist in that form in the document. To date, there are not any abstractive summarization techniques that work suitably well on long documents. The best performers are usually built based on neural networks. They create a sentence based on a single paragraph or cut the length of a sentence in half while maintaining as much information as possible. In this method, grammar suffers horribly.
Extractive summarizer
To implement an extractive summarizer, we need to import the necessary libraries. There are several algorithms packaged together in the Sumy and Spacy for python. In the following, I will describe some of them.
- LexRank Summarizer: This is an unsupervised approach inspired by Google’s PageRank algorithm. It finds the relative importance of all words in a document and selects the sentences which contain most of those high-scoring words. The scoring of sentences is determined by using the graph matrix. This connectivity matrix is based on intra-sentence cosine similarity. The sentences are ranked according to their similarities. To implement the LexRank technique, python contains a library LexRank. After importing STOPWORDS and LexRank sub-libraries, we can generate a summary by using the latter one as a function and the former one as an input parameter.
- TextRank Summarizer: This algorithm also gets inspiration from the page rank concept. It is more simplistic than LexRank. This platform removes sentences with highly duplicitous by using the post-preprocessing step.
- Luhn Summarizer: It was published in 1958 by IBM researcher, Peter Luhn. It looks at the window-size of non-important words between words of high importance. Also, the sentences occurring near the beginning of a document gets higher weight. First, this algorithm determines which words are more significant towards the meaning of the document. Finding the most common words in the document and taking a subset of those that are not common words but still important are the next steps. To implement this technique, we can use the Sumy library in python.
- LSA(Latent Semantic Analysis) Summarizer: This brand-new algorithm combines term frequency with singular value decomposition. In python, this algorithm works as follows: — Convert a document into a vectorized bag of words using CountVectorizer library — Encode the original data into topic encoded data — Fit and transform single value decomposition on the bag of words using TruncatedSVD library — Determine the strength of each part of the sentence effectively — Extract out the final sentences from topics
- KL(Kullback-Lieber) Summarizer: This greedy method adds sentences to the summary as long as the KL Divergence decreases, i.e, it focuses on the minimization of summary vocabulary by checking the divergence from the input vocabulary. To implement it in python, we can use the Sumy library. After loading AbstractSummarizer and stopwords from sumy and nltk respectively, we follow this procedure. — Normalize all the words and remove the stopwords — Compute the word frequency — Calculate the kl divergence — Find the minimum kl divergence — Generate the summary
In the following paragraphs, I will discuss the implementation of a summarizer in detail. After selecting the summarizer, we need to import the necessary libraries. Let’s suppose, We want to use spacy.en_core_web_sm library as summarizer. So, we should follow this pseudocode.
- Text Cleaning
- Word Tokenization
- Word — Frequency Table
- Sentence Tokenization
- Summarization
To get started, we need to import packages from the nltk and string including stop_words and punctuation respectively. The former one gives us the list of stop words such as again, first, above, across, all, almost, alone, and so on and the latter one provides us a list of punctuations such as (, \n, — etc. To import the STOP_WORDS library, we can use the spacy.en_core_web_sm sub-library as well. Next, we create the NLP model. Then text cleaning comes to play. This step includes removing the punctuations and stop words from our text. The normalization frequency of each word is the next step. To achieve it, we create a word-frequency table by tokenizing the words in the whole text. Later, sentence tokenization enables us to score each sentence based on the words represented. Finally, we build the summarization. The procedure is as follows.
1- Create an NLP model
2- Pass the text into the model
3- Tokenize the text. As we expect, our tokenization results consist of punctuations and stop words. So we need to remove them
4- Add the new line punctuation(\n) to the punctuation list( the original list does not include this punctuation)
5- Calculate the number of occurrences of each word but punctuations and stop words in the whole text
6- To achieve the normalization frequency of each word, we will divide all the frequencies values by the max number of occurrence
7- Tokenize the sentences
8- Calculate the score of each sentence by adding up the normalized frequency of the words represented in that sentence
9- Calculate the number of sentences representing the percent of the text as summary size
10- Find the sentences with the highest scores
11- To get the summary, join all those sentences together using the join function
Abstractive summarizer
To implement the abstractive method, several algorithms have been proposed. BERT (Bidirectional Encoder Representations from Transformers) is an evolving technique in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others proposed by researchers at Google AI Language recently. Its key technical innovation allows bidirectional training in models whereas single-direction language models only read text input sequentially either from right to left or vice versa. In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models. This model includes two separate mechanisms including encoder and decoder. The former one reads the entire sequence of words at once and the latter one produces a prediction. This attribute of the encoder enables the model to learn the context of a word based on all of its surroundings. In other words, we pass in some vectors as inputs and expect some vectors as outputs. To create vectors, we follow two ideas.
- Continuous bag of words: two words are similar if they both appear in the same context(previous words)
- Skip-gram: two words are similar if they generate the same context
The vector of similar and different words would be very close together and far apart respectively. The mapping of words to vectors is called word embeddings that enable us to apply numerical functions on all kinds of texts. The combination of word embeddings and RNNs/LSTMs helps us build neural networks based on textual input datasets. The procedure is as follows.
- Map of sequence words to sequence vectors (word2vec approach)
- 2- Set up an Autoencoder structure to capture the meaning of the passage
- Train two separate RNNs or LSTMs to encode the sequence into a single matrix/vector
- Perform the decoder to decode the matrix/vector into a transformed sequence of words
- Convert the sequence of vectors (output of decoder) back into words using the word embedding
Thank you for reading this article. This was an article focused on briefly explaining the most extractive and abstractive summarizers.
I am Sina Shariati, a data-savvy business analyst from San Francisco. Feel free to leave any ideas, comments, or concerns here or on my LinkedIn page.
Resources:
- https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-fornlp-f8b21a9b6270
- https://www.youtube.com/watch?v=9PoKellNrBc&t=1185s
- https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-textsummarization-using-deep-learning-python/
- https://medium.com/swlh/abstractive-text-summarization-with-nlp-ec3924c0b1d5
- https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-anoverview-68ded5717a25
- advances in automatic text summarization 1999 book