Generic filters
Exact matches only

# Data Preparation for Variable Length Input Sequences

Last Updated on August 14, 2019

Deep learning libraries assume a vectorized representation of your data.

In the case of variable length sequence prediction problems, this requires that your data be transformed such that each sequence has the same length.

This vectorization allows code to efficiently perform the matrix operations in batch for your chosen deep learning algorithms.

In this tutorial, you will discover techniques that you can use to prepare your variable length sequence data for sequence prediction problems in Python with Keras.

After completing this tutorial, you will know:

• How to pad variable length sequences with dummy values.
• How to pad variable length sequences to a new longer desired length.
• How to truncate variable length sequences to a shorter desired length.

Discover how to develop LSTMs such as stacked, bidirectional, CNN-LSTM, Encoder-Decoder seq2seq and more in my new book, with 14 step-by-step tutorials and full code.

Let’s get started.

Data Preparation for Variable-Length Input Sequences for Sequence Prediction
Photo by Adam Bautz, some rights reserved.

## Overview

This section is divided into 3 parts; they are:

1. Contrived Sequence Problem
3. Sequence Truncation

### Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras (v2.0.4+) installed with either the TensorFlow (v1.1.0+) or Theano (v0.9+) backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

## Contrived Sequence Problem

We can contrive a simple sequence problem for the purposes of this tutorial.

The problem is defined as sequences of integers. There are three sequences with a length between 4 and 1 timesteps, as follows:

These can be defined as a list of lists in Python as follows (with spacing for readability):

sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]

sequences = [

[1, 2, 3, 4],

[1, 2, 3],

[1]

]

We will use these sequences as the basis for exploring sequence padding in this tutorial.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

The pad_sequences() function in the Keras deep learning library can be used to pad variable length sequences.

The default padding value is 0.0, which is suitable for most applications, although this can be changed by specifying the preferred value via the “value” argument. For example:

The padding to be applied to the beginning or the end of the sequence, called pre- or post-sequence padding, can be specified by the “padding” argument, as follows.

The example below demonstrates pre-padding 3-input sequences with 0 values.

# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]

# define sequences

sequences = [

[1, 2, 3, 4],

[1, 2, 3],

[1]

]

Running the example prints the 3 sequences pre-pended with zero values.

[[1 2 3 4]
[0 1 2 3]
[0 0 0 1]

[[1 2 3 4]

[0 1 2 3]

[0 0 0 1]

Padding can also be applied to the end of the sequences, which may be more appropriate for some problem domains.

Post-sequence padding can be specified by setting the “padding” argument to “post”.

# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]

# define sequences

sequences = [

[1, 2, 3, 4],

[1, 2, 3],

[1]

]

Running the example prints the same sequences with zero-values appended.

[[1 2 3 4]
[1 2 3 0]
[1 0 0 0]]

[[1 2 3 4]

[1 2 3 0]

[1 0 0 0]]

The pad_sequences() function can also be used to pad sequences to a preferred length that may be longer than any observed sequences.

This can be done by specifying the “maxlen” argument to the desired length. Padding will then be performed on all sequences to achieve the desired length, as follows.

# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]

# define sequences

sequences = [

[1, 2, 3, 4],

[1, 2, 3],

[1]

]

Running the example pads each sequence to the desired length of 5 timesteps, even though the maximum length of an observed sequence is only 4 timesteps.

[[0 1 2 3 4]
[0 0 1 2 3]
[0 0 0 0 1]]

[[0 1 2 3 4]

[0 0 1 2 3]

[0 0 0 0 1]]

## Sequence Truncation

The length of sequences can also be trimmed to a desired length.

The desired length for sequences can be specified as a number of timesteps with the “maxlen” argument.

There are two ways that sequences can be truncated: by removing timesteps from the beginning or the end of sequences.

### Pre-Sequence Truncation

The default truncation method is to remove timesteps from the beginning of sequences. This is called pre-sequence truncation.

The example below truncates sequences to a desired length of 2.

# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]
# truncate sequence
print(truncated)

# define sequences

sequences = [

[1, 2, 3, 4],

[1, 2, 3],

[1]

]

# truncate sequence

print(truncated)

Running the example removes the first two timesteps from the first sequence, the first timestep from the second sequence, and pads the final sequence.

### Post-Sequence Truncation

Sequences can also be trimmed by removing timesteps from the end of the sequences.

This approach may be more desirable for some problem domains.

Post-sequence truncation can be configured by changing the “truncating” argument from the default ‘pre’ to ‘post’, as follows:

# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]
# truncate sequence
print(truncated)

# define sequences

sequences = [

[1, 2, 3, 4],

[1, 2, 3],

[1]

]

# truncate sequence

print(truncated)

Running the example removes the last two timesteps from the first sequence, the last timestep from the second sequence, and again pads the final sequence.

## Summary

In this tutorial, you discovered how to prepare variable length sequence data for use with sequence prediction problems in Python.

Specifically, you learned:

• How to pad variable length sequences with dummy values.
• How to pad out variable length sequences to a new desired length.
• How to truncate variable length sequences to a new desired length.

Do you have any questions about preparing variable length sequences?

## Develop LSTMs for Sequence Prediction Today!

#### Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…