Datasets for Natural Language Processing

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

  1. Text Classification
  2. Language Modeling
  3. Image Captioning
  4. Machine Translation
  5. Question Answering
  6. Speech Recognition
  7. Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Datasets for Natural Language Processing

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.

1. Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

For more, see the post:

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

  • Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

There are more formal corpora that are well studied; for example:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

3. Image Captioning

Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

  • Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
  • Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
  • Flickr 30K. A collection of 30 thousand described images taken from flickr.com.

For more see the post:

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

There are a ton of standard datasets used for the annual machine translation challenges; see:

5. Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

For more, see the post:

6. Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.

7. Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

For more see:

Further Reading

This section provides additional lists of datasets if you are looking to go deeper.

Do you know of any other good lists of natural language processing datasets?
Let me know in the comments below.

Summary

In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.

Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

68 Responses to Datasets for Natural Language Processing

  1. Avatar
    Anthony Rousseau October 8, 2017 at 5:09 pm #

    For ASR, you also have the TEDLIUM corpus, based on TED talks.
    The second release from 2014 is here : http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
    (of course, I’m lobbying for myself a bit, but hey!)

  2. Avatar
    Ahmed December 12, 2017 at 7:12 pm #

    What about informal text datasets for text normalization?

    • Avatar
      Jason Brownlee December 13, 2017 at 5:30 am #

      What do you mean Ahmed?

      • Avatar
        Ahmed December 21, 2017 at 9:02 am #

        Informal text is basically text used in social media like twitter or even SMSs. It has informal abbreviations, different spellings for different words as well as spelling mistakes.

  3. Avatar
    AB February 4, 2018 at 6:17 pm #

    Hi, all of these seem to be full articles or full documents that have a single label, all stored in folders of their respective labels. What I’m looking for is files where single lines of text that each have a label, of the file format:

    line1 label
    line2 label
    line3 label

    Because I’ve written a preprocessing algorithm that takes in files specifically of this format to feed into the toy logreg model I’ve been building. Do you know of any datasets of this specific file format? Or will I pretty much have to do it myself?

    Thanks.

    • Avatar
      Jason Brownlee February 5, 2018 at 7:44 am #

      You can write code to load any data you wish.

      • Avatar
        pierre March 25, 2019 at 10:49 pm #

        Hello Mister Jason Brownlee, currently I work on a project of sentimental analysis on the data of Facebook, my big problem is that the data is not labeled, so I nhappen not applied a model of machine learning
        do you have a technique for that thank you really your help will be very valuable to me

        • Avatar
          Jason Brownlee March 26, 2019 at 8:07 am #

          If the data is not labelled, then you can use it prepare an unsupervised model that might be a useful starting point for a supervised model later.

          Or you can label some of the data?

  4. Avatar
    Riyadh March 20, 2018 at 11:26 pm #

    Please sir, what is the mean ” What are the right query words to obtain a representative(!) raw
    dataset”

    Sincerely

    • Avatar
      Jason Brownlee March 21, 2018 at 6:36 am #

      Sorry, I don’t understand your question, can you rephrase it?

  5. Avatar
    Manuel April 1, 2018 at 5:14 am #

    Hi,

    You are missing entailment datasets.

    Congratulations of the forum. I always find good material here.

  6. Avatar
    s.murugesh April 19, 2018 at 11:35 pm #

    Dear Mr. Jason,

    Thanks for the initiative & congrats.

    Im currently working on gathering requirements from plain text, can you please help in finding dataset that contains descriptions in natural language, about how a software is developed, i.e. it may be any application like banking, library management system, course management system etc.

    thank you in advance
    kind regards s.murugesh

  7. Avatar
    turistinfo July 28, 2018 at 7:33 am #

    If you are going for best contents like me, only pay a quick
    visit this web site every day as it presents feature contents, thanks – Beth

  8. Avatar
    Lamin Dibba July 29, 2018 at 5:54 am #

    Jason your posts are always very informative as well as self-explanatory

  9. Avatar
    Roland Fernandez July 29, 2018 at 9:50 am #

    Great article. I would also add Dialog Systems as another key NLP task, and the bAbI QA dataset for the QA TASK https://research.fb.com/downloads/babi/
    – Roland

  10. Avatar
    Parth Pandya September 4, 2018 at 4:07 pm #

    Is there any dataset available for categorization like the document is related to meeting or task update or response required etc. ??

    • Avatar
      Jason Brownlee September 5, 2018 at 6:28 am #

      You could define this as a predictive modeling problem.

  11. Avatar
    HUI-YING LU September 18, 2018 at 1:17 am #

    I looked at 1) Text Classification and 2nd link you provided IMDB movie review (Large Movie Review Dataset). Looks like there are quite a lot posting for the RNN models build – almost everyone got good result (accuracy above 85%). However, almost everyone use the Keras API imdb.load_data to get back training/testing sets which already formatted with word ID (index to the vocabulary come along with imdb.load_data). This means, we don’t have to do our own data processing, build vocabulary, and embedding matrix, etc. I wonder is there any posting shows the tricks to build the vocabulary and embedding matrix (including forming the X_train, y_train, X_test, y_test data good to feed to embedding layer of Keras). Creating predictive model looks not very difficult if we have good embedding matrix, and well formed training/testing data. I tried to formed my own training/testing data and use pretrained Glove, but get poor result. If not filling in those missing pieces, we can’t generalize the techniques since we rely on Keras to provide good API which already gives us foundation for good predict result.

  12. Avatar
    HUI-YING LU September 18, 2018 at 3:55 am #

    Have another question regarding 1) Text Classification, 2nd link: IMDB Movie Review Sentiment Classification: using Keras API imdb.load_data: how do we look at the vocabulary built by Keras imdb API? If this IMDB review is a large dataset, is it possible that the vocabulary built can be used for other sentiment classification problem, not specific to this movie review only? (that means, we have a large corpus, we can use this to make a general purpose sentiment classifier).

    • Avatar
      Jason Brownlee September 18, 2018 at 6:21 am #

      I’m not sure you can. I believe it is just a demonstration dataset.

  13. Avatar
    Debayan Chakraborty November 12, 2018 at 5:07 pm #

    Sir, how can i make my own dataset for language translation?

    • Avatar
      Jason Brownlee November 13, 2018 at 5:43 am #

      You must collect examples that you have permission to use or use an existing dataset.

  14. Avatar
    MD December 5, 2018 at 2:49 am #

    Hi jason, have you an idea how can I get a french dataset for sentiment analysis please ?

  15. Avatar
    Ramendra Singla February 3, 2019 at 9:44 pm #

    Hi Jason, I needed a dataset to classify english dataset based on the vocabulary quality-good, very good , excellent. Any suggestions?

    • Avatar
      Jason Brownlee February 4, 2019 at 5:46 am #

      No, sorry. Perhaps search for a suitable dataset?

  16. Avatar
    Chris March 10, 2019 at 9:23 pm #

    Hey,

    i am searching for a dataset, which contain documents and labels. For both i need textinfo and networkinfo. For example the bioasq dataset has documents and labels, but only for labels textinfo and networl info. For documents only textinfo.

    Do somebody know such a dataset?

    Best regards,
    Chris

  17. Avatar
    Marta Schilling March 31, 2019 at 6:00 am #

    Dear Mr Janson,

    Would you have any dataset about phrasal verbs?

    Your post is just fantastic! Thank you very much for sharing your awesome knowledge.

    God bless,

    Marta

  18. Avatar
    Nart September 9, 2019 at 10:36 pm #

    Common voice is a Mozilla project for CC0 speech-to-text data-sets
    https://voice.mozilla.org/en

  19. Avatar
    Nisaruddin February 5, 2020 at 7:43 pm #

    Dear Mr. Janson,

    Would you have any dataset about intent extraction and classification?

    Your post is just fantastic! Thank you very much for sharing your awesome knowledge.

    Nisaruddin

  20. Avatar
    Ana Rodríguez February 18, 2020 at 3:41 pm #

    Dear Mr. Janson,

    Would you have any dataset about intent automatic text summarization for
    texts in spanish by using deep learning methods ? (
    large volume of data)

    Thank you very much for sharing…

    Ana

  21. Avatar
    Anirudh Kumar February 28, 2020 at 11:25 pm #

    Excellent Jason. I am looking for text dataset which I can use to train my model for PII(Personally Identifiable Information) classification. Do u know where can I get it from?

  22. Avatar
    Moteel February 29, 2020 at 9:25 am #

    https://wiki.korpus.cz/doku.php/en:cnk:uvod
    Czech language corpora. Their tools are just impressive.

  23. Avatar
    Artur Poniedzialek March 16, 2020 at 2:57 pm #

    It is worth to say also about Mozilla Voice project that was run in 2018. You can join to the project and help to build the biggest voice datasets for many languages. More details you can reach here: https://bestin-it.com/help-to-build-common-voice-datasets-with-mozilla/. Try to find your langauge and request for new one if you cannot find. In the bookmark Datasets you can download packages with audio samples of selected language.

  24. Avatar
    dodo April 1, 2020 at 12:53 am #

    good job!

    I am searching for a text classification dataset that has category of war, sport, economy, etc.

    Does somebody know such a dataset?

    thanks!

  25. Avatar
    mutaz July 10, 2020 at 4:53 pm #

    Hello

    am working on a SDN conflict project and I want to use a Dataset but I face a problem , I would like to use the dataset but I could not get an SDN dataset . my data am looking for in SDN it has to include the following features ( priority, action, IP source , IP distention , MAC and protocol of flow ) . I am looking if you can help me with this matter .
    i want to use machine learning algorithm to detection the conflict in SDN so I have to have dataset in content the following features in dataset (Priority, Protocol, Action, Source IP /space address, mac address )in all dataset I found through websites and research they are not mention this features am looking for .

    · In this case I have to create and form a data set of conflict flow entries or flow rules in SDN. like the generation in Brew: A Security Policy Analysis Framework for Distributed SDN-Based Cloud Environments document. (100000 flow) from Stanford topology. I attached to you the paper I used as a benchmark to my project .

    The generation flow must include the following conflict policy showing in table:

    · Summary of Conflict policy type:

    best regards

  26. Avatar
    Rafael November 22, 2020 at 10:53 pm #

    Hi! I was looking for NLP datasets, and I found nearly 1000 datasets from Curated NLP Database at https://metatext.io/datasets

  27. Avatar
    Arushi December 16, 2020 at 1:27 pm #

    Hi,

    i was wondering about the differences in datasets for language modeling, masked language modeling and machine translation. Under language modeling, you have mentioned that “It is a pre-cursor task in tasks like speech recognition and machine translation”
    Does that mean you can pre-train and model on a language modeling learning objective and fine tune it using a parallel corpus or something similar? Although I’m not sure how that would work, would it be trained on the target language?

    • Avatar
      Jason Brownlee December 16, 2020 at 1:44 pm #

      Yes, you can train a general language model and reuse and refine it in specific problem domains.

  28. Avatar
    George Benetti December 22, 2020 at 4:40 am #

    https://metatext.io/datasets NLP repository. 1000+ datasets… Their tools are just impressive.

  29. Avatar
    Abdullahi Abba Abdullahi August 8, 2021 at 10:08 pm #

    Hello, Jason Brownlee please i want to work low resource language like my language in Africa called Hausa. the problem I am having is with datasets or corpora to build a model or NLP application. please how do I go about this task or build a dataset? i will be glad if you can guide me. thanks.

    • Avatar
      Jason Brownlee August 9, 2021 at 5:56 am #

      The first step is to define your problem, what you want to predict. Then collect your dataset or find someone who has a dataset you can use/license.

  30. Avatar
    maryam January 29, 2022 at 12:17 am #

    hi
    i need raw dateset for my project , my project about do research paper clustering , where i can find it

Leave a Reply