How to do Text Summarization in Python?

3 min readJul 1, 2021

Text Summarization is the process of converting long text documents into short and precise content. The shortened content must not miss any of the important information from the whole document. This can become a tedious task to humans if the original content is very large. Hence, programming geeks have come up with many algorithms and techniques to find the summary of the content. The below photo clearly explains the expression of people who are new to this :P

Let us start off by giving the basic outline of the program. In this, each word is given a weight based on its frequencies and the average weight of all the words in a sentence is checked with the threshold value, if the value is greater, that particular sentence is considered in the summary.
We first need to import all the required libraries before starting the actual code.
1. wikipedia-This library is used to obtain the long text document that needs to be summarized from the Wikipedia page.
2. nltk-This library is used for handling the textual data received from the Wikipedia page. ‘nltk’ stands for Natural Language Toolkit.
3. collections-The collections library is used to count the number of times each word has appeared in the document. Although this task can be done with the help of dictionaries and loops, the library makes the task easier.

import wikipedia
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter

We first extract the original text content from the Wikipedia webpage using the respective library. We print the content just to make sure the whole content is retrieved.

wikipage = wikipedia.page('Artificial_intelligence').content
print(wikipage)

The below code snippet is used to break the whole content into sentences and then break the sentences into words. ‘punkt’ is an unsupervised learning algorithm that is used for tokenization.

nltk.download('punkt')
wiki_sent = sent_tokenize(wikipage)
print(wiki_sent)wiki_words = []
for sent in wiki_sent:
    wiki_words.extend(word_tokenize(sent))
print(wiki_words)

As we gave the values to words based on frequencies, we need to remove the stopwords like a, the, it, etc. as there are going to be a lot of them but are not that important to us as they do not give much information on the topic.

nltk.download('stopwords')
stopwords = stopwords.words('english')
for i in wiki_words:
    if i in stopwords or i in '!"#$%&\'\'()*+, -./:;<=>?@[\]^_``{|}~':
        wiki_words.remove(i)

We now need to find the number of times each word has been repeated in the whole program.

wiki_words_count = Counter(wiki_words)print(wiki_words_count)

Now we find the weight/value of each word by dividing the frequency of the word by the maximum frequency of all words combined. Hence, the weights associated with the words are going to range from 0 to 1,

max_frequency = max(wiki_words_count.values())
for word in wiki_words_count:
    wiki_words_count[word] = wiki_words_count[word] / max_frequencyprint(wiki_words_count)

We now find the scores of the sentence and store them in a form of a dictionary where the key is the sentence and the value is its score.

sentence_scores = {}
for sent in wiki_sent:
    scores = 0
    count = 0
    for word in sent:
        try:
            scores += wiki_words_count[word]
            count += 1    
        except:
            continue
    sentence_scores[sent] = scores / count

After this, we just need to set an optimal threshold value and then select the sentences if their respective scores are more than the optimal value.

Hope this writing was helpful. You can find the whole text summarization code here. Keep learning :)

How to do Text Summarization in Python?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Thejas Kiran

No responses yet