Last Updated on August 14, 2020
You need datasets to practice on when getting started with deep learning for natural language processing tasks.
It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.
In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.
Overview
This post is divided into 7 parts; they are:
- Text Classification
- Language Modeling
- Image Captioning
- Machine Translation
- Question Answering
- Speech Recognition
- Document Summarization
I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.
Almost all datasets are freely available for download today.
If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.
1. Text Classification
Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.
Below are some good beginner text classification datasets.
- Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
- IMDB Movie Review Sentiment Classification (stanford). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
- News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
For more, see the post:
2. Language Modeling
Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.
It is a pre-cursor task in tasks like speech recognition and machine translation.
Below are some good beginner language modeling datasets.
- Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.
There are more formal corpora that are well studied; for example:
- Brown University Standard Corpus of Present-Day American English. A large sample of English words.
- Google 1 Billion Word Corpus.
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
3. Image Captioning
Image captioning is the task of generating a textual description for a given image.
Below are some good beginner image captioning datasets.
- Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
- Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
- Flickr 30K. A collection of 30 thousand described images taken from flickr.com.
For more see the post:
4. Machine Translation
Machine translation is the task of translating text from one language to another.
Below are some good beginner machine translation datasets.
- Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and French.
- European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of European languages.
There are a ton of standard datasets used for the annual machine translation challenges; see:
5. Question Answering
Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.
Below are some good beginner question answering datasets.
- Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia articles.
- Deepmind Question Answering Corpus. Question answering about news articles from the Daily Mail.
- Amazon question/answer data. Question answering about Amazon products.
For more, see the post:
6. Speech Recognition
Speech recognition is the task of transforming audio of a spoken language into human readable text.
Below are some good beginner speech recognition datasets.
- TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its wide use. Spoken American English and associated transcription.
- VoxForge. Project to build an open source database for speech recognition.
- LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.
Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.
7. Document Summarization
Document summarization is the task of creating a short meaningful description of a larger document.
Below are some good beginner document summarization datasets.
- Legal Case Reports Data Set. A collection of 4 thousand legal cases and their summarization.
- TIPSTER Text Summarization Evaluation Conference Corpus. A collection of nearly 200 documents and their summaries.
- The AQUAINT Corpus of English News Text. Not free, but widely used. A corpus of news articles.
For more see:
- Document Understanding Conference (DUC) Tasks.
- Where can I find good data sets for text summarization?
Further Reading
This section provides additional lists of datasets if you are looking to go deeper.
- Text Datasets Used in Research on Wikipedia
- Datasets: What are the major text corpora used by computational linguists and natural language processing researchers?
- Stanford Statistical Natural Language Processing Corpora
- Alphabetical list of NLP Datasets
- NLTK Corpora
- Open Data for Deep Learning on DL4J
Do you know of any other good lists of natural language processing datasets?
Let me know in the comments below.
Summary
In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.
Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.
https://github.com/karthikncode/nlp-datasets
https://github.com/caesar0301/awesome-public-datasets#natural-language
Fantastic, thanks for sharing!
For ASR, you also have the TEDLIUM corpus, based on TED talks.
The second release from 2014 is here : http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
(of course, I’m lobbying for myself a bit, but hey!)
Great, thanks Anthony.
What about informal text datasets for text normalization?
What do you mean Ahmed?
Informal text is basically text used in social media like twitter or even SMSs. It has informal abbreviations, different spellings for different words as well as spelling mistakes.
Nice.
Hi, all of these seem to be full articles or full documents that have a single label, all stored in folders of their respective labels. What I’m looking for is files where single lines of text that each have a label, of the file format:
line1 label
line2 label
line3 label
…
Because I’ve written a preprocessing algorithm that takes in files specifically of this format to feed into the toy logreg model I’ve been building. Do you know of any datasets of this specific file format? Or will I pretty much have to do it myself?
Thanks.
You can write code to load any data you wish.
Hello Mister Jason Brownlee, currently I work on a project of sentimental analysis on the data of Facebook, my big problem is that the data is not labeled, so I nhappen not applied a model of machine learning
do you have a technique for that thank you really your help will be very valuable to me
If the data is not labelled, then you can use it prepare an unsupervised model that might be a useful starting point for a supervised model later.
Or you can label some of the data?
Please sir, what is the mean ” What are the right query words to obtain a representative(!) raw
dataset”
Sincerely
Sorry, I don’t understand your question, can you rephrase it?
Hi,
You are missing entailment datasets.
Congratulations of the forum. I always find good material here.
Thanks for the suggestion.
Dear Mr. Jason,
Thanks for the initiative & congrats.
Im currently working on gathering requirements from plain text, can you please help in finding dataset that contains descriptions in natural language, about how a software is developed, i.e. it may be any application like banking, library management system, course management system etc.
thank you in advance
kind regards s.murugesh
I answer this question here:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/where-can-i-get-a-dataset-on-___
If you are going for best contents like me, only pay a quick
visit this web site every day as it presents feature contents, thanks – Beth
I’m happy it helps.
Jason your posts are always very informative as well as self-explanatory
Thank you.
Great article. I would also add Dialog Systems as another key NLP task, and the bAbI QA dataset for the QA TASK https://research.fb.com/downloads/babi/
– Roland
Thanks Roland.
Is there any dataset available for categorization like the document is related to meeting or task update or response required etc. ??
You could define this as a predictive modeling problem.
I looked at 1) Text Classification and 2nd link you provided IMDB movie review (Large Movie Review Dataset). Looks like there are quite a lot posting for the RNN models build – almost everyone got good result (accuracy above 85%). However, almost everyone use the Keras API imdb.load_data to get back training/testing sets which already formatted with word ID (index to the vocabulary come along with imdb.load_data). This means, we don’t have to do our own data processing, build vocabulary, and embedding matrix, etc. I wonder is there any posting shows the tricks to build the vocabulary and embedding matrix (including forming the X_train, y_train, X_test, y_test data good to feed to embedding layer of Keras). Creating predictive model looks not very difficult if we have good embedding matrix, and well formed training/testing data. I tried to formed my own training/testing data and use pretrained Glove, but get poor result. If not filling in those missing pieces, we can’t generalize the techniques since we rely on Keras to provide good API which already gives us foundation for good predict result.
Yes, I have a ton on this as well as a book on the subject, get started here:
https://machinelearningmastery.mystagingwebsite.com/start-here/#nlp
Have another question regarding 1) Text Classification, 2nd link: IMDB Movie Review Sentiment Classification: using Keras API imdb.load_data: how do we look at the vocabulary built by Keras imdb API? If this IMDB review is a large dataset, is it possible that the vocabulary built can be used for other sentiment classification problem, not specific to this movie review only? (that means, we have a large corpus, we can use this to make a general purpose sentiment classifier).
I’m not sure you can. I believe it is just a demonstration dataset.
Sir, how can i make my own dataset for language translation?
You must collect examples that you have permission to use or use an existing dataset.
Hi jason, have you an idea how can I get a french dataset for sentiment analysis please ?
Not off-hand, perhaps try a google search?
Hi Jason, I needed a dataset to classify english dataset based on the vocabulary quality-good, very good , excellent. Any suggestions?
No, sorry. Perhaps search for a suitable dataset?
Hey,
i am searching for a dataset, which contain documents and labels. For both i need textinfo and networkinfo. For example the bioasq dataset has documents and labels, but only for labels textinfo and networl info. For documents only textinfo.
Do somebody know such a dataset?
Best regards,
Chris
Perhaps start here:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/where-can-i-get-a-dataset-on-___
Dear Mr Janson,
Would you have any dataset about phrasal verbs?
Your post is just fantastic! Thank you very much for sharing your awesome knowledge.
God bless,
Marta
Not off hand, sorry.
Common voice is a Mozilla project for CC0 speech-to-text data-sets
https://voice.mozilla.org/en
Thanks for sharing.
Dear Mr. Janson,
Would you have any dataset about intent extraction and classification?
Your post is just fantastic! Thank you very much for sharing your awesome knowledge.
Nisaruddin
Great suggestion, thanks!
Dear Mr. Janson,
Would you have any dataset about intent automatic text summarization for
texts in spanish by using deep learning methods ? (
large volume of data)
Thank you very much for sharing…
Ana
Perhaps search here:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/where-can-i-get-a-dataset-on-___
Excellent Jason. I am looking for text dataset which I can use to train my model for PII(Personally Identifiable Information) classification. Do u know where can I get it from?
This might help:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/where-can-i-get-a-dataset-on-___
I need code to pick sentences from a text dataset for a project of social media news generation
Sounds great, good luck!
https://wiki.korpus.cz/doku.php/en:cnk:uvod
Czech language corpora. Their tools are just impressive.
Thanks for sharing.
It is worth to say also about Mozilla Voice project that was run in 2018. You can join to the project and help to build the biggest voice datasets for many languages. More details you can reach here: https://bestin-it.com/help-to-build-common-voice-datasets-with-mozilla/. Try to find your langauge and request for new one if you cannot find. In the bookmark Datasets you can download packages with audio samples of selected language.
Thanks for sharing.
good job!
I am searching for a text classification dataset that has category of war, sport, economy, etc.
Does somebody know such a dataset?
thanks!
This might help:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/where-can-i-get-a-dataset-on-___
Hello
am working on a SDN conflict project and I want to use a Dataset but I face a problem , I would like to use the dataset but I could not get an SDN dataset . my data am looking for in SDN it has to include the following features ( priority, action, IP source , IP distention , MAC and protocol of flow ) . I am looking if you can help me with this matter .
i want to use machine learning algorithm to detection the conflict in SDN so I have to have dataset in content the following features in dataset (Priority, Protocol, Action, Source IP /space address, mac address )in all dataset I found through websites and research they are not mention this features am looking for .
· In this case I have to create and form a data set of conflict flow entries or flow rules in SDN. like the generation in Brew: A Security Policy Analysis Framework for Distributed SDN-Based Cloud Environments document. (100000 flow) from Stanford topology. I attached to you the paper I used as a benchmark to my project .
The generation flow must include the following conflict policy showing in table:
· Summary of Conflict policy type:
best regards
Perhaps this will help you to locate an appropriate dataset:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/where-can-i-get-a-dataset-on-___
Hi! I was looking for NLP datasets, and I found nearly 1000 datasets from Curated NLP Database at https://metatext.io/datasets
Thanks for sharing.
Hi,
i was wondering about the differences in datasets for language modeling, masked language modeling and machine translation. Under language modeling, you have mentioned that “It is a pre-cursor task in tasks like speech recognition and machine translation”
Does that mean you can pre-train and model on a language modeling learning objective and fine tune it using a parallel corpus or something similar? Although I’m not sure how that would work, would it be trained on the target language?
Yes, you can train a general language model and reuse and refine it in specific problem domains.
https://metatext.io/datasets NLP repository. 1000+ datasets… Their tools are just impressive.
Thanks for sharing.
Hello, Jason Brownlee please i want to work low resource language like my language in Africa called Hausa. the problem I am having is with datasets or corpora to build a model or NLP application. please how do I go about this task or build a dataset? i will be glad if you can guide me. thanks.
The first step is to define your problem, what you want to predict. Then collect your dataset or find someone who has a dataset you can use/license.
hi
i need raw dateset for my project , my project about do research paper clustering , where i can find it
Hi Maryam…The following may be of interest to you:
https://pub.towardsai.net/best-datasets-for-machine-learning-data-science-computer-vision-nlp-ai-c9541058cf4f