Skip to content
Search
Generic filters
Exact matches only

Datasets for Natural Language Processing

Last Updated on August 7, 2019

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

  1. Text Classification
  2. Language Modeling
  3. Image Captioning
  4. Machine Translation
  5. Question Answering
  6. Speech Recognition
  7. Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Discover how to develop deep learning models for text classification, translation, photo captioning and more in my new book, with 30 step-by-step tutorials and full source code.

Let’s get started.

Datasets for Natural Language Processing

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.

1. Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

For more, see the post:

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

  • Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

There are more formal corpora that are well studied; for example:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

3. Image Captioning

Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

  • Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
  • Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
  • Flickr 30K. A collection of 30 thousand described images taken from flickr.com.

For more see the post:

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

There are a ton of standard datasets used for the annual machine translation challenges; see:

5. Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

For more, see the post:

6. Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.

7. Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

For more see:

Further Reading

This section provides additional lists of datasets if you are looking to go deeper.

Do you know of any other good lists of natural language processing datasets?
Let me know in the comments below.

Summary

In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.

Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What’s Inside

About Jason Brownlee

Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern machine learning methods via hands-on tutorials.

error: Content is protected !!