How to Prepare News Articles for Text Summarization

By Jason Brownlee on August 7, 2019 in Deep Learning for Natural Language Processing 46

Text summarization is the task of creating a short, accurate, and fluent summary of an article.

A popular and free dataset for use in text summarization experiments with deep learning methods is the CNN News story dataset.

In this tutorial, you will discover how to prepare the CNN News Dataset for text summarization.

After completing this tutorial, you will know:

About the CNN News dataset and how to download the story data to your workstation.
How to load the dataset and split each article into story text and highlights.
How to clean the dataset ready for modeling and save the cleaned data to file for later use.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Prepare News Articles for Text Summarization
Photo by DieselDemon, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

CNN News Story Dataset
Inspect the Dataset
Load Data
Data Cleaning
Save Clean Data

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

CNN News Story Dataset

The DeepMind Q&A Dataset is a large collection of news articles from CNN and the Daily Mail with associated questions.

The dataset was developed as a question and answering task for deep learning and was presented in the 2015 paper “Teaching Machines to Read and Comprehend.”

This dataset has been used in text summarization where sentences from the news articles are summarized. Notable examples are the papers:

Kyunghyun Cho is an academic at New York University and has made the dataset available for download:

DeepMind Q&A Dataset

In this tutorial, we will work with the CNN dataset, specifically the download of the ASCII text of the news stories available here:

cnn_stories.tgz (151 Megabytes)

This dataset contains more than 93,000 news articles where each article is stored in a single “.story” file.

Download this dataset to your workstation and unzip it. Once downloaded, you can unzip the archive on your command line as follows:

tar xvf cnn_stories.tgz

1	tar xvf cnn_stories.tgz

This will create a cnn/stories/ directory filled with .story files.

For example, we can count the number of story files on the command line as follows:

ls -ltr | wc -l

1	ls -ltr \| wc -l

Which shows us that we have a total of 92,580 stores.

92580

92580

Inspect the Dataset

Using a text editor, review some of the stories and note down some ideas for preparing this data.

For example, below is an example of a story, with the body truncated for brevity.

(CNN) -- If you travel by plane and arriving on time makes a difference, try to book on Hawaiian Airlines. In 2012, passengers got where they needed to go without delay on the carrier more than nine times out of 10, according to a study released on Monday.

In fact, Hawaiian got even better from 2011, when it had a 92.8% on-time performance. Last year, it improved to 93.4%.

[...]

@highlight

Hawaiian Airlines again lands at No. 1 in on-time performance

@highlight

The Airline Quality Rankings Report looks at the 14 largest U.S. airlines

@highlight

ExpressJet and American Airlines had the worst on-time performance

@highlight

Virgin America had the best baggage handling; Southwest had lowest complaint rate

(CNN) -- If you travel by plane and arriving on time makes a difference, try to book on Hawaiian Airlines. In 2012, passengers got where they needed to go without delay on the carrier more than nine times out of 10, according to a study released on Monday.

In fact, Hawaiian got even better from 2011, when it had a 92.8% on-time performance. Last year, it improved to 93.4%.

[...]

@highlight

Hawaiian Airlines again lands at No. 1 in on-time performance

@highlight

The Airline Quality Rankings Report looks at the 14 largest U.S. airlines

@highlight

ExpressJet and American Airlines had the worst on-time performance

@highlight

Virgin America had the best baggage handling; Southwest had lowest complaint rate

I note that the general structure of the dataset is to have the story text followed by a number of “highlight” points.

Reviewing articles on the CNN website, I can see that this pattern is still common.

Example of a CNN News Article With Highlights from cnn.com

The ASCII text does not include the article titles, but we can use these human-written “highlights” as multiple reference summaries for each news article.

I can also see that many articles start with source information, presumably the CNN office that produced the story; for example:

(CNN) --
Gaza City (CNN) --
Los Angeles (CNN) --

(CNN) --

Gaza City (CNN) --

Los Angeles (CNN) --

These can be removed completely.

Data cleaning is a challenging problem and must be tailored for the specific application of the system.

If we are generally interested in developing a news article summarization system, then we may clean the text in order to simplify the learning problem by reducing the size of the vocabulary.

Some data cleaning ideas for this data include.

Normalize case to lowercase (e.g. “An Italian”).
Remove punctuation (e.g. “on-time”).

We could also further reduce the vocabulary to speed up testing models, such as:

Remove numbers (e.g. “93.4%”).
Remove low-frequency words like names (e.g. “Tom Watkins”).
Truncating stories to the first 5 or 10 sentences.

Load Data

The first step is to load the data.

We can start by writing a function to load a single document given a filename. The data has some unicode characters, so we will load the dataset by forcing the encoding to be UTF-8.

The function below named load_doc() will load a single document as text given a filename.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, encoding='utf-8')

# read all text

text = file.read()

# close the file

file.close()

return text

Next, we need to step over each filename in the stories directory and load them.

We can use the listdir() function to load all filenames in the directory, then load each one in turn. The function below named load_stories() implements this behavior and provides a starting point for preparing the loaded documents.

# load all stories in a directory
def load_stories(directory):
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)

# load all stories in a directory

def load_stories(directory):

for name in listdir(directory):

filename = directory + '/' + name

# load document

doc = load_doc(filename)

Each document can be separated into the news story text and the highlights or summary text.

The split for these two points is the first occurrence of the ‘@highlight‘ token. Once split, we can organize the highlights into a list.

The function below named split_story() implements this behavior and splits a given loaded document text into a story and list of highlights.

# split a document into news story and highlights
def split_story(doc):
	# find first highlight
	index = doc.find('@highlight')
	# split into story and highlights
	story, highlights = doc[:index], doc[index:].split('@highlight')
	# strip extra white space around each highlight
	highlights = [h.strip() for h in highlights if len(h) > 0]
	return story, highlights

# split a document into news story and highlights

def split_story(doc):

# find first highlight

index = doc.find('@highlight')

# split into story and highlights

story, highlights = doc[:index], doc[index:].split('@highlight')

# strip extra white space around each highlight

highlights = [h.strip() for h in highlights if len(h) > 0]

return story, highlights

We can now update the load_stories() function to call the split_story() function for each loaded document and then store the results in a list.

# load all stories in a directory
def load_stories(directory):
	all_stories = list()
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)
		# split into story and highlights
		story, highlights = split_story(doc)
		# store
		all_stories.append({'story':story, 'highlights':highlights})
	return all_stories

# load all stories in a directory

def load_stories(directory):

all_stories = list()

for name in listdir(directory):

filename = directory + '/' + name

# load document

doc = load_doc(filename)

# split into story and highlights

story, highlights = split_story(doc)

# store

all_stories.append({'story':story, 'highlights':highlights})

return all_stories

Tying all of this together, the complete example of loading the entire dataset is listed below.

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a document into news story and highlights
def split_story(doc):
	# find first highlight
	index = doc.find('@highlight')
	# split into story and highlights
	story, highlights = doc[:index], doc[index:].split('@highlight')
	# strip extra white space around each highlight
	highlights = [h.strip() for h in highlights if len(h) > 0]
	return story, highlights

# load all stories in a directory
def load_stories(directory):
	stories = list()
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)
		# split into story and highlights
		story, highlights = split_story(doc)
		# store
		stories.append({'story':story, 'highlights':highlights})
	return stories

# load stories
directory = 'cnn/stories/'
stories = load_stories(directory)
print('Loaded Stories %d' % len(stories))

from os import listdir

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, encoding='utf-8')

# read all text

text = file.read()

# close the file

file.close()

return text

# split a document into news story and highlights

def split_story(doc):

# find first highlight

index = doc.find('@highlight')

# split into story and highlights

story, highlights = doc[:index], doc[index:].split('@highlight')

# strip extra white space around each highlight

highlights = [h.strip() for h in highlights if len(h) > 0]

return story, highlights

# load all stories in a directory

def load_stories(directory):

stories = list()

for name in listdir(directory):

filename = directory + '/' + name

# load document

doc = load_doc(filename)

# split into story and highlights

story, highlights = split_story(doc)

# store

stories.append({'story':story, 'highlights':highlights})

return stories

# load stories

directory = 'cnn/stories/'

stories = load_stories(directory)

print('Loaded Stories %d' % len(stories))

Running the example prints the number of loaded stories.

Loaded Stories 92,579

1	Loaded Stories 92,579

We can now access the loaded story and highlight data, for example:

print(stories[4]['story'])
print(stories[4]['highlights'])

1 2	print(stories[4]['story']) print(stories[4]['highlights'])

Data Cleaning

Now that we can load the story data, we can pre-process the text by cleaning it.

We can process the stories line-by line and use the same cleaning operations on each highlight line.

For a given line, we will perform the following operations:

Remove the CNN office information.

# strip source cnn office if it exists
index = line.find('(CNN) -- ')
if index > -1:
	line = line[index+len('(CNN)'):]

# strip source cnn office if it exists

index = line.find('(CNN) -- ')

if index > -1:

line = line[index+len('(CNN)'):]

Split the line using white space tokens:

# tokenize on white space
line = line.split()

1 2	# tokenize on white space line = line.split()

Normalize the case to lowercase.

# convert to lower case
line = [word.lower() for word in line]

1 2	# convert to lower case line = [word.lower() for word in line]

Remove all punctuation characters from each token (Python 3 specific).

# prepare a translation table to remove punctuation
table = str.maketrans('', '', string.punctuation)
# remove punctuation from each token
line = [w.translate(table) for w in line]

# prepare a translation table to remove punctuation

table = str.maketrans('', '', string.punctuation)

# remove punctuation from each token

line = [w.translate(table) for w in line]

Remove any words that have non-alphabetic characters.

# remove tokens with numbers in them
line = [word for word in line if word.isalpha()]

1 2	# remove tokens with numbers in them line = [word for word in line if word.isalpha()]

Putting this all together, below is a new function named clean_lines() that takes a list of lines of text and returns a list of clean lines of text.

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare a translation table to remove punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# strip source cnn office if it exists
		index = line.find('(CNN) -- ')
		if index > -1:
			line = line[index+len('(CNN)'):]
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [w.translate(table) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	# remove empty strings
	cleaned = [c for c in cleaned if len(c) > 0]
	return cleaned

# clean a list of lines

def clean_lines(lines):

cleaned = list()

# prepare a translation table to remove punctuation

table = str.maketrans('', '', string.punctuation)

for line in lines:

# strip source cnn office if it exists

index = line.find('(CNN) -- ')

if index > -1:

line = line[index+len('(CNN)'):]

# tokenize on white space

line = line.split()

# convert to lower case

line = [word.lower() for word in line]

# remove punctuation from each token

line = [w.translate(table) for w in line]

# remove tokens with numbers in them

line = [word for word in line if word.isalpha()]

# store as string

cleaned.append(' '.join(line))

# remove empty strings

cleaned = [c for c in cleaned if len(c) > 0]

return cleaned

We can call this for a story, by first converting it to a line of text. The function can be called directly on the list of highlights.

example['story'] = clean_lines(example['story'].split('\n'))
example['highlights'] = clean_lines(example['highlights'])

1 2	example['story'] = clean_lines(example['story'].split('\n')) example['highlights'] = clean_lines(example['highlights'])

The complete example of loading and cleaning the dataset is listed below.

from os import listdir
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a document into news story and highlights
def split_story(doc):
	# find first highlight
	index = doc.find('@highlight')
	# split into story and highlights
	story, highlights = doc[:index], doc[index:].split('@highlight')
	# strip extra white space around each highlight
	highlights = [h.strip() for h in highlights if len(h) > 0]
	return story, highlights

# load all stories in a directory
def load_stories(directory):
	stories = list()
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)
		# split into story and highlights
		story, highlights = split_story(doc)
		# store
		stories.append({'story':story, 'highlights':highlights})
	return stories

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare a translation table to remove punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# strip source cnn office if it exists
		index = line.find('(CNN) -- ')
		if index > -1:
			line = line[index+len('(CNN)'):]
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [w.translate(table) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	# remove empty strings
	cleaned = [c for c in cleaned if len(c) > 0]
	return cleaned

# load stories
directory = 'cnn/stories/'
stories = load_stories(directory)
print('Loaded Stories %d' % len(stories))

# clean stories
for example in stories:
	example['story'] = clean_lines(example['story'].split('\n'))
	example['highlights'] = clean_lines(example['highlights'])

from os import listdir

import string

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, encoding='utf-8')

# read all text

text = file.read()

# close the file

file.close()

return text

# split a document into news story and highlights

def split_story(doc):

# find first highlight

index = doc.find('@highlight')

# split into story and highlights

story, highlights = doc[:index], doc[index:].split('@highlight')

# strip extra white space around each highlight

highlights = [h.strip() for h in highlights if len(h) > 0]

return story, highlights

# load all stories in a directory

def load_stories(directory):

stories = list()

for name in listdir(directory):

filename = directory + '/' + name

# load document

doc = load_doc(filename)

# split into story and highlights

story, highlights = split_story(doc)

# store

stories.append({'story':story, 'highlights':highlights})

return stories

# clean a list of lines

def clean_lines(lines):

cleaned = list()

# prepare a translation table to remove punctuation

table = str.maketrans('', '', string.punctuation)

for line in lines:

# strip source cnn office if it exists

index = line.find('(CNN) -- ')

if index > -1:

line = line[index+len('(CNN)'):]

# tokenize on white space

line = line.split()

# convert to lower case

line = [word.lower() for word in line]

# remove punctuation from each token

line = [w.translate(table) for w in line]

# remove tokens with numbers in them

line = [word for word in line if word.isalpha()]

# store as string

cleaned.append(' '.join(line))

# remove empty strings

cleaned = [c for c in cleaned if len(c) > 0]

return cleaned

# load stories

directory = 'cnn/stories/'

stories = load_stories(directory)

print('Loaded Stories %d' % len(stories))

# clean stories

for example in stories:

example['story'] = clean_lines(example['story'].split('\n'))

example['highlights'] = clean_lines(example['highlights'])

Note that the story is now stored as a list of clean lines, nominally, separated by sentences.

Save Clean Data

Finally, now that the data has been cleaned, we can save it to file.

An easy way to save the cleaned data is to Pickle the list of stories and highlights.

For example:

# save to file
from pickle import dump
dump(stories, open('cnn_dataset.pkl', 'wb'))

# save to file

from pickle import dump

dump(stories, open('cnn_dataset.pkl', 'wb'))

This will create a new file named cnn_dataset.pkl with all of the cleaned data. This file will be about 374 Megabytes in size.

We can then load it later and use it with a text summarization model as follows:

# load from file
stories = load(open('cnn_dataset.pkl', 'rb'))
print('Loaded Stories %d' % len(stories))

# load from file

stories = load(open('cnn_dataset.pkl', 'rb'))

print('Loaded Stories %d' % len(stories))

Summary

In this tutorial, you discovered how to prepare the CNN News Dataset for text summarization.

Specifically, you learned:

About the CNN News dataset and how to download the story data to your workstation.
How to load the dataset and split each article into story text and highlights.
How to clean the dataset ready for modeling and save the cleaned data to file for later use.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

46 Responses to How to Prepare News Articles for Text Summarization

kiran December 8, 2017 at 4:38 pm #

while executing on a windows system “load data” code fragment getting the following error: “‘utf-8’ codec can’t decode byte 0xc0 in position 12: invalid start byte”.

Reply
- Jason Brownlee December 9, 2017 at 5:37 am #
  
  Interesting, perhaps try changing the encoding, try ‘ascii’?
  
  Or perhaps try loading the final as binary, then converting the text to ascii?
  
  Reply
- Surya December 21, 2017 at 5:44 am #
  
  You can try something like this –
  
  file = open(filename, "rb") text = file.read() text = text.decode("utf-8", errors = "ignore") # ignore bytes that can't be decoded file.close() return text
  
  1
  2
  3
  4
  5
  
  file = open(filename, "rb")
  text = file.read()
  text = text.decode("utf-8", errors = "ignore") # ignore bytes that can't be decoded
  file.close()
  return text
  
  Reply
jjreddick December 18, 2017 at 12:41 pm #

Hi Jason, do you have a tutorial that does the text summarization?

Reply
- Jason Brownlee December 18, 2017 at 3:28 pm #
  
  Not at this stage, but I have some ideas for models here:
  https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/
  
  And more here:
  https://machinelearningmastery.com/encoder-decoder-deep-learning-models-text-summarization/
  
  Reply
  - sara January 12, 2024 at 5:02 pm #
    
    Hi, thanks for this tutorial, may I know is there any tutorial on building text summarizer was released till this time?
    
    Reply
Nathan December 29, 2017 at 11:03 am #

For the “Data Cleaning” part, when you look for the source (CNN) in the text you use two different strings (look for “(CNN) — ” and filter out “(CNN)”).

They should both be just “(CNN)” as that is what shows up after loading the stories. Otherwise CNN gets lumped together with the first word of the story

Reply
- Jason Brownlee December 29, 2017 at 2:36 pm #
  
  Yes, that could be an improvement, try it and see. There is a lot of room for improvement for sure!
  
  Reply
Linbo January 23, 2018 at 11:36 am #

hello Jason,
Thanks for the post. But I have some questions:
1. Do you think such text summarisation techniques work for other European languages as well?
2. Should our vocabulary storage only include the words that appear in our dataset? What if we include words which have never appeared in our dataset? Can those words be picked up when doing abstractive text summarization even though they’ve never been trained? thanks

Reply
- Jason Brownlee January 24, 2018 at 9:48 am #
  
  I don’t see why not.
  
  It makes sense to only model words you expect to be encountered in the data.
  
  Reply
Yash Kimtani July 21, 2018 at 6:06 pm #

Hello Jason Brownlee,
Can we use cleaned data as input for the encoder block in seq2seq model with doing word level tokenization?

Reply
- Jason Brownlee July 22, 2018 at 6:21 am #
  
  Sure, you can design the model however you wish.
  
  Reply
Ayush Tomar February 21, 2019 at 12:03 am #

Hii, thank you so much for your blogs. It really helped me alot in understanding various concepts. I just need one more help, Can you please explain the full implementation of text summarization from the paper “Get To The Point: Summarization with Pointer-Generator Networks, 2017.”? In case if you have alreday done it, then please provide me the link.

Thank you.

Reply
- Jason Brownlee February 21, 2019 at 8:11 am #
  
  Perhaps contact the authors directly and ask them about their paper?
  
  Reply
  - Ayush Tomar February 22, 2019 at 5:09 am #
    
    okay thanks!
    
    Reply
Dhannanjai Nautiyal February 24, 2019 at 11:30 pm #

Hello sir,
So I have been trying to create a text summarizer using abstraction.Also, I am really a novice at this field. I have studied the above article and have been able to clean the data in pickle format. I would like to know where to go from here next.

Thank you!

Reply
- Jason Brownlee February 25, 2019 at 6:44 am #
  
  Yes, I have some suggestions:
  https://machinelearningmastery.com/?s=text+summarization&post_type=post&submit=Search
  
  Reply
Koushik J February 27, 2019 at 5:36 am #

hey have u built keras model for this dataset?

Reply
- Jason Brownlee February 27, 2019 at 7:37 am #
  
  I have some suggestions here:
  https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/
  
  Reply
  - Koushik J March 2, 2019 at 4:03 am #
    
    Thank you
    
    Reply
Anshuman Pattanaik March 29, 2019 at 8:42 pm #

I am working on Extractive summarization. I am unable to find the gold(human-made)extractive summary for the CNN Daily Mail Dataset. Can you suggest me where to find?

Reply
- Jason Brownlee March 30, 2019 at 6:27 am #
  
  No sorry, perhaps try a google search?
  
  Reply
- pravallika June 25, 2021 at 4:59 pm #
  
  iam also facing the same problem did u find the reference summaries for articles in cnn/dailymai
  
  Reply
anonym guy June 4, 2019 at 1:49 am #

Love you

Reply
- Jason Brownlee June 4, 2019 at 7:56 am #
  
  Thanks, I’m happy the tutorial helped!
  
  Reply
Darlene August 30, 2019 at 12:29 pm #

@JustGlowing was using the NLTK text summerizer a couple of
weeks back

Reply
azad September 18, 2019 at 10:46 pm #

hi, very useful information and i just found these details here
but i have a question
after reading and splitting the data i see that the number of sentences are much less than expected
i mean the average number of sentences of an article here for me are 22 sentences
and for the summary i have to choose between them
am i doing something wrong or is it normal?

Reply
- Jason Brownlee September 19, 2019 at 5:59 am #
  
  The summary is separate from the article.
  
  Perhaps I don’t understand your question?
  
  Reply
  - azad September 20, 2019 at 12:49 am #
    
    i mean the length of articles in average are 22 sentences, is it normal?
    
    Reply
    - Jason Brownlee September 20, 2019 at 5:47 am #
      
      No idea, it is better to focus on the data in front of you.
      
      Reply
sabbes February 5, 2020 at 2:02 am #

Thank you for your post! I have a question. The highlighters, in one file, are multiple summaries for each story or each highlighter is a sentence of the summary?

Reply
- Jason Brownlee February 5, 2020 at 8:18 am #
  
  I believe each story has multiple “summaries”.
  
  Reply
  - sabbes February 6, 2020 at 3:11 am #
    
    Thank you for your response. I am just feel confused. We aim to get a summary as model’s output so our target in the training data should be one summary. Should we divide the “highlighters” before training the model?
    
    Reply
    - Jason Brownlee February 6, 2020 at 8:32 am #
      
      Model performance would be reported using perplexity or bleu scores.
      
      Reply
Chetna November 7, 2020 at 9:29 pm #

How long did it take you to save the cleaned data as a pickle file? It is taking me a lot of time for me.

Reply
- Jason Brownlee November 8, 2020 at 6:41 am #
  
  It should take seconds.
  
  Ensure you are running from the command line and not a notebook.
  
  Reply
SK April 13, 2021 at 7:39 am #

Hi Jason, Thank you for your post!

You mentioned that We could reduce the vocabulary to speed up testing models, such as:

Remove numbers (e.g. “93.4%”).
Remove low-frequency words like names (e.g. “Tom Watkins”).

However, I thought numbers and names are the important information for some types of document. Such as the number is for the profit, accuracy, crime rate and the name is “Joe Biden”. So how do we decide if these need to be removed or not. Thanks!

Reply
- Jason Brownlee April 14, 2021 at 6:16 am #
  
  You’re welcome.
  
  Good question, it is really a question of the goals of your project. Start with a good idea of what your model needs to do, then remove elements that are not critical to that goal. Or perhaps use a little trial and error.
  
  Reply
Mary August 5, 2021 at 7:25 am #

Hi Jason,
If we want to use any of the Algorithms from ‘https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/’ don’t we need to add “BOS” and “EOS” at the beginning and at the end of each story/highlight?

Reply
- Jason Brownlee August 6, 2021 at 4:53 am #
  
  It depends on the model, in some cases yes.
  
  Reply
Anon September 16, 2021 at 7:21 pm #

Hi. Wanted a little help. I wanted to calculate rouge score between story and highlights? How to do that?

Reply
- Adrian Tam September 17, 2021 at 12:40 am #
  
  Take a look at this python package: https://pypi.org/project/rouge-score/
  Or you can also implement your own score function.
  
  Reply
Tanya March 5, 2022 at 5:27 pm #

HI ..done with cleaning data as shown above.
Now my task is to compare stories and highlights and score stories as 0 or 1 based on whether the sentence in the story is present in highlight as well.
Could you suggest how to proceed?

Reply
- James Carmichael March 6, 2022 at 1:09 pm #
  
  Hi Tanya…Please see my previous response.
  
  https://machinelearningmastery.com/multi-label-classification-with-deep-learning/
  
  Reply
sh March 5, 2022 at 5:28 pm #

HI ..done with cleaning data as shown above.
Now my task is to compare stories and highlights and score stories as 0 or 1 based on whether the sentence in the story is present in highlight as well.
Could you suggest how to proceed?

Reply
- James Carmichael March 6, 2022 at 1:09 pm #
  
  Hi sh…the following may help clarify next steps:
  
  https://machinelearningmastery.com/multi-label-classification-with-deep-learning/
  
  Reply

Navigation

How to Prepare News Articles for Text Summarization

Tutorial Overview

Need help with Deep Learning for Text Data?

CNN News Story Dataset

Inspect the Dataset

Load Data

Data Cleaning

Save Clean Data

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

46 Responses to How to Prepare News Articles for Text Summarization

Leave a Reply Click here to cancel reply.