Datasets for Natural Language Processing

By Jason Brownlee on August 14, 2020 in Deep Learning for Natural Language Processing 68

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

Text Classification
Language Modeling
Image Captioning
Machine Translation
Question Answering
Speech Recognition
Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.

1. Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
IMDB Movie Review Sentiment Classification (stanford). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.

For more, see the post:

Datasets for single-label text categorization.

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

There are more formal corpora that are well studied; for example:

Brown University Standard Corpus of Present-Day American English. A large sample of English words.
Google 1 Billion Word Corpus.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

3. Image Captioning

Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
Flickr 30K. A collection of 30 thousand described images taken from flickr.com.

For more see the post:

Exploring Image Captioning Datasets, 2016

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and French.
European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of European languages.

There are a ton of standard datasets used for the annual machine translation challenges; see:

Statistical Machine Translation

5. Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia articles.
Deepmind Question Answering Corpus. Question answering about news articles from the Daily Mail.
Amazon question/answer data. Question answering about Amazon products.

For more, see the post:

Datasets: How can I get corpus of a question-answering website like Quora or Yahoo Answers or Stack Overflow for analyzing answer quality?

6. Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its wide use. Spoken American English and associated transcription.
VoxForge. Project to build an open source database for speech recognition.
LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.

Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.

7. Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

Legal Case Reports Data Set. A collection of 4 thousand legal cases and their summarization.
TIPSTER Text Summarization Evaluation Conference Corpus. A collection of nearly 200 documents and their summaries.
The AQUAINT Corpus of English News Text. Not free, but widely used. A corpus of news articles.

For more see:

Summary

In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.

Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.

68 Responses to Datasets for Natural Language Processing

birol September 28, 2017 at 3:51 am #

https://github.com/karthikncode/nlp-datasets
https://github.com/caesar0301/awesome-public-datasets#natural-language

Reply
- Jason Brownlee September 28, 2017 at 5:28 am #
  
  Fantastic, thanks for sharing!
  
  Reply
Anthony Rousseau October 8, 2017 at 5:09 pm #

For ASR, you also have the TEDLIUM corpus, based on TED talks.
The second release from 2014 is here : http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
(of course, I’m lobbying for myself a bit, but hey!)

Reply
- Jason Brownlee October 9, 2017 at 5:47 am #
  
  Great, thanks Anthony.
  
  Reply
Ahmed December 12, 2017 at 7:12 pm #

What about informal text datasets for text normalization?

Reply
- Jason Brownlee December 13, 2017 at 5:30 am #
  
  What do you mean Ahmed?
  
  Reply
  - Ahmed December 21, 2017 at 9:02 am #
    
    Informal text is basically text used in social media like twitter or even SMSs. It has informal abbreviations, different spellings for different words as well as spelling mistakes.
    
    Reply
    - Jason Brownlee December 21, 2017 at 3:34 pm #
      
      Nice.
      
      Reply
AB February 4, 2018 at 6:17 pm #

Hi, all of these seem to be full articles or full documents that have a single label, all stored in folders of their respective labels. What I’m looking for is files where single lines of text that each have a label, of the file format:

line1 label
line2 label
line3 label
…

Because I’ve written a preprocessing algorithm that takes in files specifically of this format to feed into the toy logreg model I’ve been building. Do you know of any datasets of this specific file format? Or will I pretty much have to do it myself?

Thanks.

Reply
- Jason Brownlee February 5, 2018 at 7:44 am #
  
  You can write code to load any data you wish.
  
  Reply
  - pierre March 25, 2019 at 10:49 pm #
    
    Hello Mister Jason Brownlee, currently I work on a project of sentimental analysis on the data of Facebook, my big problem is that the data is not labeled, so I nhappen not applied a model of machine learning
    do you have a technique for that thank you really your help will be very valuable to me
    
    Reply
    - Jason Brownlee March 26, 2019 at 8:07 am #
      
      If the data is not labelled, then you can use it prepare an unsupervised model that might be a useful starting point for a supervised model later.
      
      Or you can label some of the data?
      
      Reply
Riyadh March 20, 2018 at 11:26 pm #

Please sir, what is the mean ” What are the right query words to obtain a representative(!) raw
dataset”

Sincerely

Reply
- Jason Brownlee March 21, 2018 at 6:36 am #
  
  Sorry, I don’t understand your question, can you rephrase it?
  
  Reply
Manuel April 1, 2018 at 5:14 am #

Hi,

You are missing entailment datasets.

Congratulations of the forum. I always find good material here.

Reply
- Jason Brownlee April 1, 2018 at 5:51 am #
  
  Thanks for the suggestion.
  
  Reply
s.murugesh April 19, 2018 at 11:35 pm #

Dear Mr. Jason,

Thanks for the initiative & congrats.

Im currently working on gathering requirements from plain text, can you please help in finding dataset that contains descriptions in natural language, about how a software is developed, i.e. it may be any application like banking, library management system, course management system etc.

thank you in advance
kind regards s.murugesh

Reply
- Jason Brownlee April 20, 2018 at 5:54 am #
  
  I answer this question here:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
turistinfo July 28, 2018 at 7:33 am #

If you are going for best contents like me, only pay a quick
visit this web site every day as it presents feature contents, thanks – Beth

Reply
- Jason Brownlee July 28, 2018 at 7:37 am #
  
  I’m happy it helps.
  
  Reply
Lamin Dibba July 29, 2018 at 5:54 am #

Jason your posts are always very informative as well as self-explanatory

Reply
- Jason Brownlee July 29, 2018 at 6:13 am #
  
  Thank you.
  
  Reply
Roland Fernandez July 29, 2018 at 9:50 am #

Great article. I would also add Dialog Systems as another key NLP task, and the bAbI QA dataset for the QA TASK https://research.fb.com/downloads/babi/
– Roland

Reply
- Jason Brownlee July 30, 2018 at 5:43 am #
  
  Thanks Roland.
  
  Reply
Parth Pandya September 4, 2018 at 4:07 pm #

Is there any dataset available for categorization like the document is related to meeting or task update or response required etc. ??

Reply
- Jason Brownlee September 5, 2018 at 6:28 am #
  
  You could define this as a predictive modeling problem.
  
  Reply
HUI-YING LU September 18, 2018 at 1:17 am #

I looked at 1) Text Classification and 2nd link you provided IMDB movie review (Large Movie Review Dataset). Looks like there are quite a lot posting for the RNN models build – almost everyone got good result (accuracy above 85%). However, almost everyone use the Keras API imdb.load_data to get back training/testing sets which already formatted with word ID (index to the vocabulary come along with imdb.load_data). This means, we don’t have to do our own data processing, build vocabulary, and embedding matrix, etc. I wonder is there any posting shows the tricks to build the vocabulary and embedding matrix (including forming the X_train, y_train, X_test, y_test data good to feed to embedding layer of Keras). Creating predictive model looks not very difficult if we have good embedding matrix, and well formed training/testing data. I tried to formed my own training/testing data and use pretrained Glove, but get poor result. If not filling in those missing pieces, we can’t generalize the techniques since we rely on Keras to provide good API which already gives us foundation for good predict result.

Reply
- Jason Brownlee September 18, 2018 at 6:19 am #
  
  Yes, I have a ton on this as well as a book on the subject, get started here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
HUI-YING LU September 18, 2018 at 3:55 am #

Have another question regarding 1) Text Classification, 2nd link: IMDB Movie Review Sentiment Classification: using Keras API imdb.load_data: how do we look at the vocabulary built by Keras imdb API? If this IMDB review is a large dataset, is it possible that the vocabulary built can be used for other sentiment classification problem, not specific to this movie review only? (that means, we have a large corpus, we can use this to make a general purpose sentiment classifier).

Reply
- Jason Brownlee September 18, 2018 at 6:21 am #
  
  I’m not sure you can. I believe it is just a demonstration dataset.
  
  Reply
Debayan Chakraborty November 12, 2018 at 5:07 pm #

Sir, how can i make my own dataset for language translation?

Reply
- Jason Brownlee November 13, 2018 at 5:43 am #
  
  You must collect examples that you have permission to use or use an existing dataset.
  
  Reply
MD December 5, 2018 at 2:49 am #

Hi jason, have you an idea how can I get a french dataset for sentiment analysis please ?

Reply
- Jason Brownlee December 5, 2018 at 6:20 am #
  
  Not off-hand, perhaps try a google search?
  
  Reply
Ramendra Singla February 3, 2019 at 9:44 pm #

Hi Jason, I needed a dataset to classify english dataset based on the vocabulary quality-good, very good , excellent. Any suggestions?

Reply
- Jason Brownlee February 4, 2019 at 5:46 am #
  
  No, sorry. Perhaps search for a suitable dataset?
  
  Reply
Chris March 10, 2019 at 9:23 pm #

Hey,

i am searching for a dataset, which contain documents and labels. For both i need textinfo and networkinfo. For example the bioasq dataset has documents and labels, but only for labels textinfo and networl info. For documents only textinfo.

Do somebody know such a dataset?

Best regards,
Chris

Reply
- Jason Brownlee March 11, 2019 at 6:51 am #
  
  Perhaps start here:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
Marta Schilling March 31, 2019 at 6:00 am #

Dear Mr Janson,

Would you have any dataset about phrasal verbs?

Your post is just fantastic! Thank you very much for sharing your awesome knowledge.

God bless,

Marta

Reply
- Jason Brownlee March 31, 2019 at 9:32 am #
  
  Not off hand, sorry.
  
  Reply
Nart September 9, 2019 at 10:36 pm #

Common voice is a Mozilla project for CC0 speech-to-text data-sets
https://voice.mozilla.org/en

Reply
- Jason Brownlee September 10, 2019 at 5:46 am #
  
  Thanks for sharing.
  
  Reply
Nisaruddin February 5, 2020 at 7:43 pm #

Dear Mr. Janson,

Would you have any dataset about intent extraction and classification?

Your post is just fantastic! Thank you very much for sharing your awesome knowledge.

Nisaruddin

Reply
- Jason Brownlee February 6, 2020 at 8:22 am #
  
  Great suggestion, thanks!
  
  Reply
Ana Rodríguez February 18, 2020 at 3:41 pm #

Dear Mr. Janson,

Would you have any dataset about intent automatic text summarization for
texts in spanish by using deep learning methods ? (
large volume of data)

Thank you very much for sharing…

Ana

Reply
- Jason Brownlee February 19, 2020 at 7:56 am #
  
  Perhaps search here:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
Anirudh Kumar February 28, 2020 at 11:25 pm #

Excellent Jason. I am looking for text dataset which I can use to train my model for PII(Personally Identifiable Information) classification. Do u know where can I get it from?

Reply
- Jason Brownlee February 29, 2020 at 7:13 am #
  
  This might help:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
  - Simna Ashraf June 5, 2020 at 11:03 pm #
    
    I need code to pick sentences from a text dataset for a project of social media news generation
    
    Reply
    - Jason Brownlee June 6, 2020 at 7:50 am #
      
      Sounds great, good luck!
      
      Reply
Moteel February 29, 2020 at 9:25 am #

https://wiki.korpus.cz/doku.php/en:cnk:uvod
Czech language corpora. Their tools are just impressive.

Reply
- Jason Brownlee March 1, 2020 at 5:18 am #
  
  Thanks for sharing.
  
  Reply
Artur Poniedzialek March 16, 2020 at 2:57 pm #

It is worth to say also about Mozilla Voice project that was run in 2018. You can join to the project and help to build the biggest voice datasets for many languages. More details you can reach here: https://bestin-it.com/help-to-build-common-voice-datasets-with-mozilla/. Try to find your langauge and request for new one if you cannot find. In the bookmark Datasets you can download packages with audio samples of selected language.

Reply
- Jason Brownlee March 17, 2020 at 8:09 am #
  
  Thanks for sharing.
  
  Reply
dodo April 1, 2020 at 12:53 am #

good job!

I am searching for a text classification dataset that has category of war, sport, economy, etc.

Does somebody know such a dataset?

thanks!

Reply
- Jason Brownlee April 1, 2020 at 5:52 am #
  
  This might help:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
mutaz July 10, 2020 at 4:53 pm #

Hello

am working on a SDN conflict project and I want to use a Dataset but I face a problem , I would like to use the dataset but I could not get an SDN dataset . my data am looking for in SDN it has to include the following features ( priority, action, IP source , IP distention , MAC and protocol of flow ) . I am looking if you can help me with this matter .
i want to use machine learning algorithm to detection the conflict in SDN so I have to have dataset in content the following features in dataset (Priority, Protocol, Action, Source IP /space address, mac address )in all dataset I found through websites and research they are not mention this features am looking for .

· In this case I have to create and form a data set of conflict flow entries or flow rules in SDN. like the generation in Brew: A Security Policy Analysis Framework for Distributed SDN-Based Cloud Environments document. (100000 flow) from Stanford topology. I attached to you the paper I used as a benchmark to my project .

The generation flow must include the following conflict policy showing in table:

· Summary of Conflict policy type:

best regards

Reply
- Jason Brownlee July 11, 2020 at 6:04 am #
  
  Perhaps this will help you to locate an appropriate dataset:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
Rafael November 22, 2020 at 10:53 pm #

Hi! I was looking for NLP datasets, and I found nearly 1000 datasets from Curated NLP Database at https://metatext.io/datasets

Reply
- Jason Brownlee November 23, 2020 at 6:13 am #
  
  Thanks for sharing.
  
  Reply
Arushi December 16, 2020 at 1:27 pm #

Hi,

i was wondering about the differences in datasets for language modeling, masked language modeling and machine translation. Under language modeling, you have mentioned that “It is a pre-cursor task in tasks like speech recognition and machine translation”
Does that mean you can pre-train and model on a language modeling learning objective and fine tune it using a parallel corpus or something similar? Although I’m not sure how that would work, would it be trained on the target language?

Reply
- Jason Brownlee December 16, 2020 at 1:44 pm #
  
  Yes, you can train a general language model and reuse and refine it in specific problem domains.
  
  Reply
George Benetti December 22, 2020 at 4:40 am #

https://metatext.io/datasets NLP repository. 1000+ datasets… Their tools are just impressive.

Reply
- Jason Brownlee December 22, 2020 at 6:50 am #
  
  Thanks for sharing.
  
  Reply
Abdullahi Abba Abdullahi August 8, 2021 at 10:08 pm #

Hello, Jason Brownlee please i want to work low resource language like my language in Africa called Hausa. the problem I am having is with datasets or corpora to build a model or NLP application. please how do I go about this task or build a dataset? i will be glad if you can guide me. thanks.

Reply
- Jason Brownlee August 9, 2021 at 5:56 am #
  
  The first step is to define your problem, what you want to predict. Then collect your dataset or find someone who has a dataset you can use/license.
  
  Reply
maryam January 29, 2022 at 12:17 am #

hi
i need raw dateset for my project , my project about do research paper clustering , where i can find it

Reply
- James Carmichael January 29, 2022 at 1:33 pm #
  
  Hi Maryam…The following may be of interest to you:
  
  https://pub.towardsai.net/best-datasets-for-machine-learning-data-science-computer-vision-nlp-ai-c9541058cf4f
  
  Reply

Navigation

Datasets for Natural Language Processing

Overview

1. Text Classification

2. Language Modeling

Need help with Deep Learning for Text Data?

3. Image Captioning

4. Machine Translation

5. Question Answering

6. Speech Recognition

7. Document Summarization

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

68 Responses to Datasets for Natural Language Processing

Leave a Reply Click here to cancel reply.