How to Difference a Time Series Dataset with Python

By Jason Brownlee on August 14, 2020 in Time Series 70

Differencing is a popular and widely used data transform for time series.

In this tutorial, you will discover how to apply the difference operation to your time series data with Python.

After completing this tutorial, you will know:

About the differencing operation, including the configuration of the lag difference and the difference order.
How to develop a manual implementation of the differencing operation.
How to use the built-in Pandas differencing function.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated Apr/2019: Updated the link to dataset.

How to Difference a Time Series Dataset with Python
Photo by Marcus, some rights reserved.

Why Difference Time Series Data?

Differencing is a method of transforming a time series dataset.

It can be used to remove the series dependence on time, so-called temporal dependence. This includes structures like trends and seasonality.

Differencing can help stabilize the mean of the time series by removing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality.

— Page 215, Forecasting: principles and practice

Differencing is performed by subtracting the previous observation from the current observation.

difference(t) = observation(t) - observation(t-1)

1	difference(t) = observation(t) - observation(t-1)

In this way, a series of differences can be calculated.

Lag Difference

Taking the difference between consecutive observations is called a lag-1 difference.

The lag difference can be adjusted to suit the specific temporal structure.

For time series with a seasonal component, the lag may be expected to be the period (width) of the seasonality.

Difference Order

Temporal structure may still exist after performing a differencing operation, such as in the case of a nonlinear trend.

As such, the process of differencing can be repeated more than once until all temporal dependence has been removed.

The number of times that differencing is performed is called the difference order.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

Download the dataset.

The example below loads and creates a plot of the loaded dataset.

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot()
pyplot.show()

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

series.plot()

pyplot.show()

Running the example creates the plot that shows a clear linear trend in the data.

Shampoo Sales Dataset Plot

Manual Differencing

We can difference the dataset manually.

This involves developing a new function that creates a differenced dataset. The function would loop through a provided series and calculate the differenced values at the specified interval or lag.

The function below named difference() implements this procedure.

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return Series(diff)

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return Series(diff)

We can see that the function is careful to begin the differenced dataset after the specified interval to ensure differenced values can, in fact, be calculated. A default interval or lag value of 1 is defined. This is a sensible default.

One further improvement would be to also be able to specify the order or number of times to perform the differencing operation.

The example below applies the manual difference() function to the Shampoo Sales dataset.

from pandas import read_csv
from pandas import datetime
from pandas import Series
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return Series(diff)

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
X = series.values
diff = difference(X)
pyplot.plot(diff)
pyplot.show()

from pandas import read_csv

from pandas import datetime

from pandas import Series

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return Series(diff)

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

X = series.values

diff = difference(X)

pyplot.plot(diff)

pyplot.show()

Running the example creates the differenced dataset and plots the result.

Manually Differenced Shampoo Sales Dataset

Automatic Differencing

The Pandas library provides a function to automatically calculate the difference of a dataset.

This diff() function is provided on both the Series and DataFrame objects.

Like the manually defined difference function in the previous section, it takes an argument to specify the interval or lag, in this case called the periods.

The example below demonstrates how to use the built-in difference function on the Pandas Series object.

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
diff = series.diff()
pyplot.plot(diff)
pyplot.show()

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

diff = series.diff()

pyplot.plot(diff)

pyplot.show()

As in the previous section, running the example plots the differenced dataset.

A benefit of using the Pandas function, in addition to requiring less code, is that it maintains the date-time information for the differenced series.

Automatic Differenced Shampoo Sales Dataset

Summary

In this tutorial, you discovered how to apply the difference operation to time series data with Python.

Specifically, you learned:

About the difference operation, including the configuration of lag and order.
How to implement the difference transform manually.
How to use the built-in Pandas implementation of the difference transform.

Do you have any questions about differencing, or about this post?
Ask your questions in the comments below.

70 Responses to How to Difference a Time Series Dataset with Python

Srinath Jayachandran February 26, 2017 at 3:58 pm #

Hi there, here is a recent work on time series that gives a time series a symbolic representation.
https://arxiv.org/ftp/arxiv/papers/1611/1611.01698.pdf

Reply
- Jason Brownlee February 27, 2017 at 5:50 am #
  
  Thanks for sharing.
  
  Reply
NL May 3, 2017 at 1:54 am #

Have a question. What if the difference is negative?

Reply
- Jason Brownlee May 3, 2017 at 7:39 am #
  
  Some differences will be positive, some negative.
  
  Reply
  - Manuel Jose Alvarez December 4, 2017 at 12:06 pm #
    
    Hi, which will be the most pythonic way to set the negative difeferece as zero. Let say that I have some bookings for t+1 and a forecast.
    
    Reply
    - Jason Brownlee December 4, 2017 at 4:58 pm #
      
      My approach is make it work first, then make it readable.
      
      Reply
Hans June 16, 2017 at 1:06 pm #

Are difference functions only useful to remove structures like trends and seasonality,
or can they also be used to build features from trends in data sets?

What other techniques are available to use trends and seasonality in a constructive way in time series predictions?

Reply
- Jason Brownlee June 17, 2017 at 7:22 am #
  
  You can use the transformed variables and extracted structures as features, but check that they lift the skill of the model.
  
  See this post on feature engineering in time series forecasting:
  https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/
  
  Reply
Chris June 22, 2017 at 2:04 pm #

Thanks for these posts, Dr. Brownlee! I like the picture of the beach

Reply
- Jason Brownlee June 23, 2017 at 6:39 am #
  
  Thanks Chris.
  
  Reply
proxy list September 27, 2017 at 7:10 pm #

Hi there,I log on to your new stuff named “How to Difference a Time Series Dataset with Python – Machine Learning Mastery” regularly.Your humoristic style is awesome, keep up the good work! And you can look our website about proxy list.

Reply
- Jason Brownlee September 28, 2017 at 5:24 am #
  
  Thanks.
  
  Reply
Kathy November 13, 2017 at 2:04 pm #

Thank you for valuable insights. Could you please explain how would it be possible to take the third or second difference ?

Reply
- Jason Brownlee November 14, 2017 at 10:07 am #
  
  You apply the difference operation to the already differenced series.
  
  Reply
Qian December 3, 2017 at 1:16 pm #

for “value = int(dataset[i])-int(dataset[i-interval])”
why it shows “TypeError: only length-1 arrays can be converted to Python scalars”
thanks in advance！

Reply
- Jason Brownlee December 4, 2017 at 7:45 am #
  
  Perhaps ensure that you have copied all of the code from the example?
  
  Reply
Mohammed March 24, 2018 at 12:07 pm #

Hi Jason, thanks for posting this, but I’m curious what to do about the NAs after using the diff() function? I’m guessing that data should just be removed? Or should they just be imputed?

Reply
- Jason Brownlee March 25, 2018 at 6:25 am #
  
  Removed.
  
  Reply
HMEDA May 12, 2018 at 3:26 pm #

I TRIED TO RUN YOUR CODE, BUT I RECEIVED THIS MASSAGE
(data_string, format))
ValueError: time data ‘190Sales of shampoo over a three year period’ does not match format ‘%Y-%m’

THANK YOU IN ADVANCE

Reply
- Jason Brownlee May 13, 2018 at 6:34 am #
  
  It looks like you might not have deleted the file footer or downloaded the data in a different format.
  
  Here is a direct link to the data file ready to use:
  https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv
  
  Reply
Adamson June 8, 2018 at 1:22 am #

Do you perform differencing on just the output data or do you difference the features if they are time dependent as well?

Reply
- Jason Brownlee June 8, 2018 at 6:16 am #
  
  Both inputs and outputs.
  
  Reply
Abhay Sharma June 14, 2018 at 1:57 pm #

How does one invert the differencing after the residual forecast has been made to get back to a forecast including the trend and seasonality that was differenced out?

Reply
- Jason Brownlee June 14, 2018 at 4:08 pm #
  
  Good question, I show how in this post:
  https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/
  
  Reply
Elo November 10, 2018 at 1:12 am #

Copy Paste ?
https://www.m-asim.com/2018/10/12/how-to-difference-a-time-series-dataset-with-python/

Thanks for this awesome content by the way !

Reply
- Jason Brownlee November 10, 2018 at 6:07 am #
  
  That’s a shame. I’ll ask him to take it down. Google will also penalize him ferociously.
  
  Reply
ENI November 27, 2018 at 12:37 am #

Doing this, I will have no value for the first observation, I mean Yt-Yt-1 will be my first value and I will have an observation less?

Reply
- Jason Brownlee November 27, 2018 at 6:35 am #
  
  Yes.
  
  Reply
mohammad February 12, 2019 at 5:19 am #

How to undifference?

Reply
- Jason Brownlee February 12, 2019 at 8:08 am #
  
  Add the values back.
  
  Reply
  - sohail May 20, 2021 at 6:41 pm #
    
    how to do that
    
    Reply
    - Jason Brownlee May 21, 2021 at 5:58 am #
      
      This tutorial has an example of differencing and inverse differencing:
      https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
      
      Reply
Tayyab May 23, 2019 at 1:26 am #

Hi Jason! As always a great tutorial.

I need to know, how to get the forecast values of unseen data if the data were differenced by first_order.

Detail:

I am doing univariate ARIMA forecasting for oil prices 3 times a day. The data was uneven so interpolated with forward-fill with an hourly rate. I did forecasting using first-order-differencing. To compare test_data and predictions, I reversed the predictions and test-data (integration).

Now the question is what I do when I don’t have test data but I have forecast unseen data. How would I integrate the predictions back to normal then the different predictions?

Reply
- Jason Brownlee May 23, 2019 at 6:05 am #
  
  the ARIMA will perform the differencing and inverse-differencing for you via the d parameter.
  
  Otherwise, you can do it manually, here’s code to do it:
  https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
  
  Reply
puneeth June 6, 2019 at 9:22 pm #

can u please tell me hoe to extract forecasted value in graph.i got predicted value,but not able to extract forecasted value in python using arima model

predictions_ARIMA_diff=pd.Series(results_ARIMA.fittedvalues, copy=True)
print(predictions_ARIMA_diff.head())

predictions_ARIMA_diff_cumsum=predictions_ARIMA_diff.cumsum()
print(predictions_ARIMA_diff_cumsum.head())

predictions_ARIMA_log=pd.Series(ts_log[0],index=ts_log.index)
predictions_ARIMA_log=predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA_log.head()

# Next -take the exponent of the series from above (anti-log) which will be the predicted value?—?the time series forecast model.
##Now plot the predicted values with the original.

#Find the RMSE
predictions_ARIMA=np.exp(predictions_ARIMA_log)

plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title(‘RMSE: %.4f’% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))

#Future Prediction
#Predict for 5 year. We have 144 data points + 60 for next 5 yrs. i.e. predict for 204 data points
results_ARIMA.plot_predict(1,204)

Reply
- Jason Brownlee June 7, 2019 at 7:58 am #
  
  You can plot a forecast using matplotlib, e.g. the plot() function.
  
  Reply
Stephanie August 21, 2019 at 11:10 pm #

Hi Jason,

Can you perform differencing while also adding a lag of a variable (dependent or independent) in the equation?

Thanks

Reply
- Jason Brownlee August 22, 2019 at 6:27 am #
  
  Sure.
  
  Reply
John W August 26, 2019 at 9:12 pm #

Great python tutorial on time series.

Reply
- Jason Brownlee August 27, 2019 at 6:42 am #
  
  Thanks! I’m glad it helped.
  
  Reply
Gizo December 26, 2019 at 7:19 pm #

unable to run

# create a differenced series
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] – dataset[i – interval]
diff.append(value)
return Series(diff)

error on
value = dataset[i] – dataset[i – interval]
TypeError: unsupported operand type(s) for -: ‘str’ and ‘str’

Reply
- Jason Brownlee December 27, 2019 at 6:32 am #
  
  Sorry to hear that, this might help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
- Gilles February 2, 2020 at 6:46 pm #
  
  Just read the error. Clearly dataset is an array of strings while it should be floats/ints.
  
  Reply
Emma February 28, 2020 at 1:20 am #

Hi Jason, thanks for your very informative tutorials. I’m a PhD student using a time series of ocean data to create a multiple linear regression model (statsmodels GLSAR, as there is autocorrelation of residuals). I’m using the model to then predict past (rather than future) values, but these are for single data points rather than a continuous time series.

However, the dependent variable I am using is not stationary (shows seasonality), and the independent variables show a mix of trend, seasonality and stationarity.

I have a couple of questions:
1) If I want to remove stationarity, I assume I use a mix of differencing and removing trends where applicable, and then create my model. How do I then apply this model to my predictions? Is it the same, or do I need to add the trends/differences back before using it to predict, somehow?

2) I’m using an algorithm to find the combination of independent variables that give the highest R-squared value for my regression. Machine learning is growing in use in my speciality, and I would like to try it. Do you think this sounds suitable? I have 13 years of twice-daily data for training.

I hope this is clear, happy to answer any questions.

Reply
- Jason Brownlee February 28, 2020 at 6:16 am #
  
  Yes, differencing to remove trend, seasonal differencing to remove seasonality. Just like you propagate the differencing down the training set, you can also propagate it down the test set. Then invert the differencing on the predictions to get the original scale.
  
  I recommend testing a suite of methods and use controlled experiments to discover what works best.
  
  Reply
soukaina February 28, 2020 at 8:49 pm #

Hi Jason, thank you for this great tutorial,
I would like to know how to difference a time series date attribute to get a series of durations ?

Reply
- Jason Brownlee February 29, 2020 at 7:12 am #
  
  Sorry, I don’t understand your question. Can you please elaborate?
  
  Reply
  - soukaina March 1, 2020 at 5:03 am #
    
    I have a data in which I have two indexes, the id and date, each id has a series of dates, and I want to extract the difference between dates for each id.
    
    Reply
    - Jason Brownlee March 1, 2020 at 5:25 am #
      
      Great. Sounds like you will need to develop some custom code.
      
      Reply
      - soukaina March 1, 2020 at 7:11 am #
        
        what do you propose?
      - Jason Brownlee March 2, 2020 at 6:07 am #
        
        Developing custom code to meet the requirements of your project. Engineering, not machine learning.
        
        I don’t have the capacity to do engineering for you sorry. If it is challenging you can try posting your question to stackoverflow or hire an engineer?
Al-Baraa October 30, 2020 at 7:29 pm #

Hey Jason, love your tutorials! I wanted to ask you how we could plot the trend line if we difference. I’ve been looking everywhere online and I can’t find how?

Reply
- Jason Brownlee October 31, 2020 at 6:46 am #
  
  Thanks!
  
  If you plot the raw data (before differencing) you should be able to see the trend if present.
  
  Reply
HH November 2, 2020 at 11:02 pm #

if i use built it differencing —- diff = series.diff()
how could i inverse it >?
Also is it possible to use built in differencing in multivariate data? how?

Reply
- Jason Brownlee November 3, 2020 at 6:54 am #
  
  You can see examples of differencing and inverse differencing here:
  https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
  
  Reply
Shaheen November 30, 2020 at 11:42 am #

What is the discrepancy between what is referred to as ‘log difference’ and ‘first difference’ when differencing a time series? I am looking to use ACF/PACF with stationary/transformed data to estimate my ARIMA parameters but I keep running into these two ‘differences’ and I can’t tell if they’re used interchangeably or not. Also if there is a discrepancy how can we use log difference in our code using pandas?

PS: Jason, your website has helped me during my academic career and now in my early-career as an intern who is hoping to get a full time job soon – I just wanted to thank you for all the work you’ve done.

Reply
- Jason Brownlee December 1, 2020 at 6:15 am #
  
  I’ve not heard the terms, sorry.
  
  You can log the data, you can difference the data, and you can do both with different order.
  
  Reply
Tarun January 9, 2021 at 9:54 pm #

Hi Jason,

Thanks for your tutorials. I have a question. After differencing, I am getting a stationary series and that I want to confirm it by doing an ADF Test. This is my code –

from pandas import read_csv
from statsmodels.tsa.stattools import adfuller
series = read_csv(‘D:/Management Books/BSE Index Daily Closing.csv’, header=0, index_col=0, squeeze=True)
X = series.values
diff = series.diff()
X = diff(X)
result = adfuller(X)
print(‘ADF Statistic: %f’ % result[0])
print(‘p-value: %f’ % result[1])
print(‘Critical Values:’)
for key, value in result[4].items():
print(‘\t%s: %.3f’ % (key, value))

When I run it, I get an error like

TypeError Traceback (most recent call last)
in
4 X = series.values
5 diff = series.diff()
—-> 6 X = diff(X)
7 result = adfuller(X)
8 print(‘ADF Statistic: %f’ % result[0])

TypeError: ‘Series’ object is not callable

Please lemme know how to rectify the error.

Reply
- Jason Brownlee January 10, 2021 at 5:40 am #
  
  You’re welcome.
  
  Perhaps try extracting the numpy array from the series after differencing?
  
  Reply
  - Tarun January 11, 2021 at 6:00 pm #
    
    Hi Jason,
    
    Thanks a lot for your helped. It worked. Bingo !!!!!!
    
    Reply
    - Jason Brownlee January 12, 2021 at 7:48 am #
      
      Well done, I’m happy to hear that!
      
      Reply
Aditya March 3, 2021 at 4:11 am #

Hi Jason

How do we convert the values back to original scale to be able to compare prediction with actual?

Reply
- Jason Brownlee March 3, 2021 at 5:38 am #
  
  You can call inverse_transform(), see this tutorial for more information:
  https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
  
  Reply
Vanitha September 9, 2021 at 6:09 pm #

Hi Jason,

I like your tutorials very much and its very clear always. I am working on timeseries data(year wise).just started with univariate time series(values in million), there is trend but no seasonality and applied ARIMA model. Few questions in that.

1. I I have applied my model without doing any feature engineering and scaling since its a univariate and i got expected and predicted values nicely. Is this correct or i need to apply diff() function to make the data stationary before apply the model?

2. Then ARIMA(p,d,q) adjusting this parameter i got finalized value like(1,2,0) based on MAPE metric. Is this Correct?
Here itself i can able to identify best value of p (Partial auto correlation), d(differencing value)
and q(Auto correlation).
3. Then what is the purpose of Dicky fuller test and Rolling mean?

4. If the third parameter q=0 means then it can be call it as ARIMA model?.
(if i give any value to q means it asked me to make data stationary. so i make it zero).

5. Can i use df.diff().diff()? . if so can i call it as 2 nd order difference?. Is this meaningful?
(then only the p-value become less than 0.5)

kindly answer my question . Awaiting for your valuable reply.

Reply
- Adrian Tam September 11, 2021 at 6:09 am #
  
  (1, 2) the I in ARIMA will do the differencing, hence it should make the data stationary by figuring out the right parameter (p,d,q). If you disagree with the fitted result, you can always override it.
  (3) Dicky-Fuller test is to check if the series is stationary. Rolling mean is just another name for moving average.
  (4) q is the MA parameter, d=0 means stationary
  (5) Yes, that is the case of d=2
  
  Reply
  - Vanitha September 11, 2021 at 2:46 pm #
    
    Thank you so much for your reply.
    
    Reply
JK July 20, 2022 at 6:38 am #

Hi Jason,

Thank you for this amazing tutorial! I have one question after reading this: If the data after differencing by subtracting the previous observation from the current observation is seasonal, does that mean the original data is seasonal as well?

Reply
- James Carmichael July 20, 2022 at 9:03 am #
  
  Hi JK…The following may help add clarity:
  
  https://machinelearningmastery.com/time-series-seasonality-with-python/
  
  https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/
  
  Reply
SNA November 17, 2023 at 2:57 am #

i have a question. if the data that we have is in each year (from 2009 to 2020). it is obvious that we have no seasonality, but according to ADF test, it is a non stationary data. Would differencing will make the data stationary or it won’t make any difference?

Reply
- James Carmichael November 17, 2023 at 11:03 am #
  
  Hi SNA…We recommend that you try differencing and also not differencing and compare the results.
  
  Reply

Navigation

How to Difference a Time Series Dataset with Python

Why Difference Time Series Data?

Lag Difference

Difference Order

Stop learning Time Series Forecasting the slow way!

Shampoo Sales Dataset

Manual Differencing

Automatic Differencing

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to
Your Own Projects

More On This Topic

70 Responses to How to Difference a Time Series Dataset with Python

Leave a Reply Click here to cancel reply.

Navigation

Why Difference Time Series Data?

Lag Difference

Difference Order

Stop learning Time Series Forecasting the slow way!

Shampoo Sales Dataset

Manual Differencing

Automatic Differencing

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to Your Own Projects

More On This Topic

70 Responses to How to Difference a Time Series Dataset with Python

Leave a Reply Click here to cancel reply.

Finally Bring Time Series Forecasting to
Your Own Projects