Datasets for Natural Language Processing

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

  1. Text Classification
  2. Language Modeling
  3. Image Captioning
  4. Machine Translation
  5. Question Answering
  6. Speech Recognition
  7. Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Let’s get started.

Datasets for Natural Language Processing

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.

1. Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

For more, see the post:

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

  • Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

There are more formal corpora that are well studied; for example:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

3. Image Captioning

Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

  • Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
  • Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
  • Flickr 30K. A collection of 30 thousand described images taken from flickr.com.

For more see the post:

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

There are a ton of standard datasets used for the annual machine translation challenges; see:

5. Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

For more, see the post:

6. Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.

7. Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

For more see:

Further Reading

This section provides additional lists of datasets if you are looking to go deeper.

Do you know of any other good lists of natural language processing datasets?
Let me know in the comments below.

Summary

In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.

Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


28 Responses to Datasets for Natural Language Processing

  1. Anthony Rousseau October 8, 2017 at 5:09 pm #

    For ASR, you also have the TEDLIUM corpus, based on TED talks.
    The second release from 2014 is here : http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
    (of course, I’m lobbying for myself a bit, but hey!)

  2. Ahmed December 12, 2017 at 7:12 pm #

    What about informal text datasets for text normalization?

    • Jason Brownlee December 13, 2017 at 5:30 am #

      What do you mean Ahmed?

      • Ahmed December 21, 2017 at 9:02 am #

        Informal text is basically text used in social media like twitter or even SMSs. It has informal abbreviations, different spellings for different words as well as spelling mistakes.

  3. AB February 4, 2018 at 6:17 pm #

    Hi, all of these seem to be full articles or full documents that have a single label, all stored in folders of their respective labels. What I’m looking for is files where single lines of text that each have a label, of the file format:

    line1 label
    line2 label
    line3 label

    Because I’ve written a preprocessing algorithm that takes in files specifically of this format to feed into the toy logreg model I’ve been building. Do you know of any datasets of this specific file format? Or will I pretty much have to do it myself?

    Thanks.

  4. Riyadh March 20, 2018 at 11:26 pm #

    Please sir, what is the mean ” What are the right query words to obtain a representative(!) raw
    dataset”

    Sincerely

    • Jason Brownlee March 21, 2018 at 6:36 am #

      Sorry, I don’t understand your question, can you rephrase it?

  5. Manuel April 1, 2018 at 5:14 am #

    Hi,

    You are missing entailment datasets.

    Congratulations of the forum. I always find good material here.

  6. s.murugesh April 19, 2018 at 11:35 pm #

    Dear Mr. Jason,

    Thanks for the initiative & congrats.

    Im currently working on gathering requirements from plain text, can you please help in finding dataset that contains descriptions in natural language, about how a software is developed, i.e. it may be any application like banking, library management system, course management system etc.

    thank you in advance
    kind regards s.murugesh

  7. turistinfo July 28, 2018 at 7:33 am #

    If you are going for best contents like me, only pay a quick
    visit this web site every day as it presents feature contents, thanks – Beth

  8. Lamin Dibba July 29, 2018 at 5:54 am #

    Jason your posts are always very informative as well as self-explanatory

  9. Roland Fernandez July 29, 2018 at 9:50 am #

    Great article. I would also add Dialog Systems as another key NLP task, and the bAbI QA dataset for the QA TASK https://research.fb.com/downloads/babi/
    – Roland

  10. Parth Pandya September 4, 2018 at 4:07 pm #

    Is there any dataset available for categorization like the document is related to meeting or task update or response required etc. ??

    • Jason Brownlee September 5, 2018 at 6:28 am #

      You could define this as a predictive modeling problem.

  11. HUI-YING LU September 18, 2018 at 1:17 am #

    I looked at 1) Text Classification and 2nd link you provided IMDB movie review (Large Movie Review Dataset). Looks like there are quite a lot posting for the RNN models build – almost everyone got good result (accuracy above 85%). However, almost everyone use the Keras API imdb.load_data to get back training/testing sets which already formatted with word ID (index to the vocabulary come along with imdb.load_data). This means, we don’t have to do our own data processing, build vocabulary, and embedding matrix, etc. I wonder is there any posting shows the tricks to build the vocabulary and embedding matrix (including forming the X_train, y_train, X_test, y_test data good to feed to embedding layer of Keras). Creating predictive model looks not very difficult if we have good embedding matrix, and well formed training/testing data. I tried to formed my own training/testing data and use pretrained Glove, but get poor result. If not filling in those missing pieces, we can’t generalize the techniques since we rely on Keras to provide good API which already gives us foundation for good predict result.

  12. HUI-YING LU September 18, 2018 at 3:55 am #

    Have another question regarding 1) Text Classification, 2nd link: IMDB Movie Review Sentiment Classification: using Keras API imdb.load_data: how do we look at the vocabulary built by Keras imdb API? If this IMDB review is a large dataset, is it possible that the vocabulary built can be used for other sentiment classification problem, not specific to this movie review only? (that means, we have a large corpus, we can use this to make a general purpose sentiment classifier).

    • Jason Brownlee September 18, 2018 at 6:21 am #

      I’m not sure you can. I believe it is just a demonstration dataset.

Leave a Reply