How to Difference a Time Series Dataset with Python

Differencing is a popular and widely used data transform for time series.

In this tutorial, you will discover how to apply the difference operation to your time series data with Python.

After completing this tutorial, you will know:

  • About the differencing operation, including the configuration of the lag difference and the difference order.
  • How to develop a manual implementation of the differencing operation.
  • How to use the built-in Pandas differencing function.

Let’s get started.

  • Updated Apr/2019: Updated the link to dataset.
How to Difference a Time Series Dataset with Python

How to Difference a Time Series Dataset with Python
Photo by Marcus, some rights reserved.

Why Difference Time Series Data?

Differencing is a method of transforming a time series dataset.

It can be used to remove the series dependence on time, so-called temporal dependence. This includes structures like trends and seasonality.

Differencing can help stabilize the mean of the time series by removing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality.

— Page 215, Forecasting: principles and practice

Differencing is performed by subtracting the previous observation from the current observation.

In this way, a series of differences can be calculated.

Lag Difference

Taking the difference between consecutive observations is called a lag-1 difference.

The lag difference can be adjusted to suit the specific temporal structure.

For time series with a seasonal component, the lag may be expected to be the period (width) of the seasonality.

Difference Order

Temporal structure may still exist after performing a differencing operation, such as in the case of a nonlinear trend.

As such, the process of differencing can be repeated more than once until all temporal dependence has been removed.

The number of times that differencing is performed is called the difference order.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

The example below loads and creates a plot of the loaded dataset.

Running the example creates the plot that shows a clear linear trend in the data.

Shampoo Sales Dataset Plot

Shampoo Sales Dataset Plot

Manual Differencing

We can difference the dataset manually.

This involves developing a new function that creates a differenced dataset. The function would loop through a provided series and calculate the differenced values at the specified interval or lag.

The function below named difference() implements this procedure.

We can see that the function is careful to begin the differenced dataset after the specified interval to ensure differenced values can, in fact, be calculated. A default interval or lag value of 1 is defined. This is a sensible default.

One further improvement would be to also be able to specify the order or number of times to perform the differencing operation.

The example below applies the manual difference() function to the Shampoo Sales dataset.

Running the example creates the differenced dataset and plots the result.

Manually Differenced Shampoo Sales Dataset

Manually Differenced Shampoo Sales Dataset

Automatic Differencing

The Pandas library provides a function to automatically calculate the difference of a dataset.

This diff() function is provided on both the Series and DataFrame objects.

Like the manually defined difference function in the previous section, it takes an argument to specify the interval or lag, in this case called the periods.

The example below demonstrates how to use the built-in difference function on the Pandas Series object.

As in the previous section, running the example plots the differenced dataset.

A benefit of using the Pandas function, in addition to requiring less code, is that it maintains the date-time information for the differenced series.

Automatic Differenced Shampoo Sales Dataset

Automatic Differenced Shampoo Sales Dataset

Summary

In this tutorial, you discovered how to apply the difference operation to time series data with Python.

Specifically, you learned:

  • About the difference operation, including the configuration of lag and order.
  • How to implement the difference transform manually.
  • How to use the built-in Pandas implementation of the difference transform.

Do you have any questions about differencing, or about this post?
Ask your questions in the comments below.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

34 Responses to How to Difference a Time Series Dataset with Python

  1. Srinath Jayachandran February 26, 2017 at 3:58 pm #

    Hi there, here is a recent work on time series that gives a time series a symbolic representation.
    https://arxiv.org/ftp/arxiv/papers/1611/1611.01698.pdf

  2. NL May 3, 2017 at 1:54 am #

    Have a question. What if the difference is negative?

    • Jason Brownlee May 3, 2017 at 7:39 am #

      Some differences will be positive, some negative.

      • Manuel Jose Alvarez December 4, 2017 at 12:06 pm #

        Hi, which will be the most pythonic way to set the negative difeferece as zero. Let say that I have some bookings for t+1 and a forecast.

        • Jason Brownlee December 4, 2017 at 4:58 pm #

          My approach is make it work first, then make it readable.

  3. Hans June 16, 2017 at 1:06 pm #

    Are difference functions only useful to remove structures like trends and seasonality,
    or can they also be used to build features from trends in data sets?

    What other techniques are available to use trends and seasonality in a constructive way in time series predictions?

  4. Chris June 22, 2017 at 2:04 pm #

    Thanks for these posts, Dr. Brownlee! I like the picture of the beach

  5. proxy list September 27, 2017 at 7:10 pm #

    Hi there,I log on to your new stuff named “How to Difference a Time Series Dataset with Python – Machine Learning Mastery” regularly.Your humoristic style is awesome, keep up the good work! And you can look our website about proxy list.

  6. Kathy November 13, 2017 at 2:04 pm #

    Thank you for valuable insights. Could you please explain how would it be possible to take the third or second difference ?

    • Jason Brownlee November 14, 2017 at 10:07 am #

      You apply the difference operation to the already differenced series.

  7. Qian December 3, 2017 at 1:16 pm #

    for “value = int(dataset[i])-int(dataset[i-interval])”
    why it shows “TypeError: only length-1 arrays can be converted to Python scalars”
    thanks in advance!

    • Jason Brownlee December 4, 2017 at 7:45 am #

      Perhaps ensure that you have copied all of the code from the example?

  8. Mohammed March 24, 2018 at 12:07 pm #

    Hi Jason, thanks for posting this, but I’m curious what to do about the NAs after using the diff() function? I’m guessing that data should just be removed? Or should they just be imputed?

  9. HMEDA May 12, 2018 at 3:26 pm #

    I TRIED TO RUN YOUR CODE, BUT I RECEIVED THIS MASSAGE
    (data_string, format))
    ValueError: time data ‘190Sales of shampoo over a three year period’ does not match format ‘%Y-%m’

    THANK YOU IN ADVANCE

  10. Adamson June 8, 2018 at 1:22 am #

    Do you perform differencing on just the output data or do you difference the features if they are time dependent as well?

  11. Abhay Sharma June 14, 2018 at 1:57 pm #

    How does one invert the differencing after the residual forecast has been made to get back to a forecast including the trend and seasonality that was differenced out?

  12. Elo November 10, 2018 at 1:12 am #

    Copy Paste ?
    https://www.m-asim.com/2018/10/12/how-to-difference-a-time-series-dataset-with-python/

    Thanks for this awesome content by the way !

    • Jason Brownlee November 10, 2018 at 6:07 am #

      That’s a shame. I’ll ask him to take it down. Google will also penalize him ferociously.

  13. ENI November 27, 2018 at 12:37 am #

    Doing this, I will have no value for the first observation, I mean Yt-Yt-1 will be my first value and I will have an observation less?

  14. mohammad February 12, 2019 at 5:19 am #

    How to undifference?

  15. Tayyab May 23, 2019 at 1:26 am #

    Hi Jason! As always a great tutorial.

    I need to know, how to get the forecast values of unseen data if the data were differenced by first_order.

    Detail:

    I am doing univariate ARIMA forecasting for oil prices 3 times a day. The data was uneven so interpolated with forward-fill with an hourly rate. I did forecasting using first-order-differencing. To compare test_data and predictions, I reversed the predictions and test-data (integration).

    Now the question is what I do when I don’t have test data but I have forecast unseen data. How would I integrate the predictions back to normal then the different predictions?

  16. puneeth June 6, 2019 at 9:22 pm #

    can u please tell me hoe to extract forecasted value in graph.i got predicted value,but not able to extract forecasted value in python using arima model

    predictions_ARIMA_diff=pd.Series(results_ARIMA.fittedvalues, copy=True)
    print(predictions_ARIMA_diff.head())

    predictions_ARIMA_diff_cumsum=predictions_ARIMA_diff.cumsum()
    print(predictions_ARIMA_diff_cumsum.head())

    predictions_ARIMA_log=pd.Series(ts_log[0],index=ts_log.index)
    predictions_ARIMA_log=predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
    predictions_ARIMA_log.head()

    # Next -take the exponent of the series from above (anti-log) which will be the predicted value?—?the time series forecast model.
    ##Now plot the predicted values with the original.

    #Find the RMSE
    predictions_ARIMA=np.exp(predictions_ARIMA_log)

    plt.plot(ts)
    plt.plot(predictions_ARIMA)
    plt.title(‘RMSE: %.4f’% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))

    #Future Prediction
    #Predict for 5 year. We have 144 data points + 60 for next 5 yrs. i.e. predict for 204 data points
    results_ARIMA.plot_predict(1,204)

    • Jason Brownlee June 7, 2019 at 7:58 am #

      You can plot a forecast using matplotlib, e.g. the plot() function.

Leave a Reply