You may have observations at the wrong frequency.

Maybe they are too granular or not granular enough. The Pandas library in Python provides the capability to change the frequency of your time series data.

In this tutorial, you will discover how to use Pandas in Python to both increase and decrease the sampling frequency of time series data.

After completing this tutorial, you will know:

- About time series resampling, the two types of resampling, and the 2 main reasons why you need to use them.
- How to use Pandas to upsample time series data to a higher frequency and interpolate the new observations.
- How to use Pandas to downsample time series data to a lower frequency and summarize the higher frequency observations.

Let’s get started.

**Update Dec/2016**: Fixed definitions of upsample and downsample.

## Resampling

Resampling involves changing the frequency of your time series observations.

Two types of resampling are:

**Upsampling**: Where you increase the frequency of the samples, such as from minutes to seconds.**Downsampling**: Where you decrease the frequency of the samples, such as from days to months.

In both cases, data must be invented.

In the case of upsampling, care may be needed in determining how the fine-grained observations are calculated using interpolation. In the case of downsampling, care may be needed in selecting the summary statistics used to calculate the new aggregated values.

There are perhaps two main reasons why you may be interested in resampling your time series data:

**Problem Framing**: Resampling may be required if your data is available at the same frequency that you want to make predictions.**Feature Engineering**: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

There is a lot of overlap between these two cases.

For example, you may have daily data and want to predict a monthly problem. You could use the daily data directly or you could downsample it to monthly data and develop your model.

A feature engineering perspective may use observations and summaries of observations from both time scales and more in developing a model.

Let’s make resampling more concrete by looking at a real dataset and some examples.

### Stop learning Time Series Forecasting the *slow way*!

Take my free 7-day email course and discover data prep, modeling and more (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

Below is a sample of the first 5 rows of data, including the header row.

1 2 3 4 5 6 |
"Month","Sales" "1-01",266.0 "1-02",145.9 "1-03",183.1 "1-04",119.3 "1-05",180.3 |

Below is a plot of the entire dataset taken from Data Market.

The dataset shows an increasing trend and possibly some seasonal components.

Download and learn more about the dataset here.

## Load the Shampoo Sales Dataset

Download the dataset and place it in the current working directory with the filename “*shampoo-sales.csv*“.

The timestamps in the dataset do not have an absolute year, but do have a month. We can write a custom date parsing function to load this dataset and pick an arbitrary year, such as 1900, to baseline the years from.

Below is a snippet of code to load the Shampoo Sales dataset using the custom date parsing function from *read_csv()*.

1 2 3 4 5 6 7 8 9 10 11 |
from pandas import read_csv from pandas import datetime from matplotlib import pyplot def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) print(series.head()) series.plot() pyplot.show() |

Running this example loads the dataset and prints the first 5 rows. This shows the correct handling of the dates, baselined from 1900.

1 2 3 4 5 6 7 |
Month 1901-01-01 266.0 1901-02-01 145.9 1901-03-01 183.1 1901-04-01 119.3 1901-05-01 180.3 Name: Sales of shampoo over a three year period, dtype: float64 |

We also get a plot of the dataset, showing the rising trend in sales from month to month.

## Upsample Shampoo Sales

The observations in the Shampoo Sales are monthly.

Imagine we wanted daily sales information. We would have to upsample the frequency from monthly to daily and use an interpolation scheme to fill in the new daily frequency.

The Pandas library provides a function called *resample()* on the *Series* and *DataFrame* objects. This can be used to group records when downsampling and making space for new observations when upsampling.

We can use this function to transform our monthly dataset into a daily dataset by calling resampling and specifying the preferred frequency of calendar day frequency or “D”.

Pandas is clever and you could just as easily specify the frequency as “1D” or even something domain specific, such as “5D.” See the further reading section at the end of the tutorial for the list of aliases that you can use.

1 2 3 4 5 6 7 8 9 |
from pandas import read_csv from pandas import datetime def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) upsampled = series.resample('D') print(upsampled.head(32)) |

Running this example prints the first 32 rows of the upsampled dataset, showing each day of January and the first day of February.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
Month 1901-01-01 266.0 1901-01-02 NaN 1901-01-03 NaN 1901-01-04 NaN 1901-01-05 NaN 1901-01-06 NaN 1901-01-07 NaN 1901-01-08 NaN 1901-01-09 NaN 1901-01-10 NaN 1901-01-11 NaN 1901-01-12 NaN 1901-01-13 NaN 1901-01-14 NaN 1901-01-15 NaN 1901-01-16 NaN 1901-01-17 NaN 1901-01-18 NaN 1901-01-19 NaN 1901-01-20 NaN 1901-01-21 NaN 1901-01-22 NaN 1901-01-23 NaN 1901-01-24 NaN 1901-01-25 NaN 1901-01-26 NaN 1901-01-27 NaN 1901-01-28 NaN 1901-01-29 NaN 1901-01-30 NaN 1901-01-31 NaN 1901-02-01 145.9 |

We can see that the *resample()* function has created the rows by putting NaN values in the new values. We can see we still have the sales volume on the first of January and February from the original data.

Next, we can interpolate the missing values at this new frequency.

The *Series* Pandas object provides an *interpolate()* function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions. You may have domain knowledge to help choose how values are to be interpolated.

A good starting point is to use a linear interpolation. This draws a straight line between available data, in this case on the first of the month, and fills in values at the chosen frequency from this line.

1 2 3 4 5 6 7 8 9 10 |
from pandas import read_csv from pandas import datetime def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) upsampled = series.resample('D') interpolated = upsampled.interpolate(method='linear') print(interpolated.head(32)) |

Running this example, we can see interpolated values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
Month 1901-01-01 266.000000 1901-01-02 262.125806 1901-01-03 258.251613 1901-01-04 254.377419 1901-01-05 250.503226 1901-01-06 246.629032 1901-01-07 242.754839 1901-01-08 238.880645 1901-01-09 235.006452 1901-01-10 231.132258 1901-01-11 227.258065 1901-01-12 223.383871 1901-01-13 219.509677 1901-01-14 215.635484 1901-01-15 211.761290 1901-01-16 207.887097 1901-01-17 204.012903 1901-01-18 200.138710 1901-01-19 196.264516 1901-01-20 192.390323 1901-01-21 188.516129 1901-01-22 184.641935 1901-01-23 180.767742 1901-01-24 176.893548 1901-01-25 173.019355 1901-01-26 169.145161 1901-01-27 165.270968 1901-01-28 161.396774 1901-01-29 157.522581 1901-01-30 153.648387 1901-01-31 149.774194 1901-02-01 145.900000 |

Looking at a line plot, we see no difference from plotting the original data as the plot already interpolated the values between points to draw the line.

Another common interpolation method is to use a polynomial or a spline to connect the values.

This creates more curves and can look more natural on many datasets. Using a spline interpolation requires you specify the order (number of terms in the polynomial); in this case, an order of 2 is just fine.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pandas import read_csv from pandas import datetime from matplotlib import pyplot def parser(x): return datetime.strptime('190'+x, '%Y-%m') upsampled = series.resample('D') interpolated = upsampled.interpolate(method='spline', order=2) print(interpolated.head(32)) interpolated.plot() pyplot.show() |

Running the example, we can first review the raw interpolated values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
Month 1901-01-01 266.000000 1901-01-02 258.630160 1901-01-03 251.560886 1901-01-04 244.720748 1901-01-05 238.109746 1901-01-06 231.727880 1901-01-07 225.575149 1901-01-08 219.651553 1901-01-09 213.957094 1901-01-10 208.491770 1901-01-11 203.255582 1901-01-12 198.248529 1901-01-13 193.470612 1901-01-14 188.921831 1901-01-15 184.602185 1901-01-16 180.511676 1901-01-17 176.650301 1901-01-18 173.018063 1901-01-19 169.614960 1901-01-20 166.440993 1901-01-21 163.496161 1901-01-22 160.780465 1901-01-23 158.293905 1901-01-24 156.036481 1901-01-25 154.008192 1901-01-26 152.209039 1901-01-27 150.639021 1901-01-28 149.298139 1901-01-29 148.186393 1901-01-30 147.303783 1901-01-31 146.650308 1901-02-01 145.900000 |

Reviewing the line plot, we can see more natural curves on the interpolated values.

Generally, interpolation is a useful tool when you have missing observations.

Next, we will consider resampling in the other direction and decreasing the frequency of observations.

## Downsample Shampoo Sales

The sales data is monthly, but perhaps we would prefer the data to be quarterly.

The year can be divided into 4 business quarters, 3 months a piece.

Instead of creating new rows between existing observations, the *resample()* function in Pandas will group all observations by the new frequency.

We could use an alias like “3M” to create groups of 3 months, but this might have trouble if our observations did not start in January, April, July, or October. Pandas does have a quarter-aware alias of “Q” that we can use for this purpose.

We must now decide how to create a new quarterly value from each group of 3 records. A good starting point is to calculate the average monthly sales numbers for the quarter. For this, we can use the *mean()* function.

Putting this all together, we get the following code example.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pandas import read_csv from pandas import datetime from matplotlib import pyplot def parser(x): return datetime.strptime('190'+x, '%Y-%m') resample = series.resample('Q') quarterly_mean_sales = resample.mean() print(quarterly_mean_sales.head()) quarterly_mean_sales.plot() pyplot.show() |

Running the example prints the first 5 rows of the quarterly data.

1 2 3 4 5 6 7 |
Month 1901-03-31 198.333333 1901-06-30 156.033333 1901-09-30 216.366667 1901-12-31 215.100000 1902-03-31 184.633333 Freq: Q-DEC, Name: Sales, dtype: float64 |

We also plot the quarterly data, showing Q1-Q4 across the 3 years of original observations.

Perhaps we want to go further and turn the monthly data into yearly data, and perhaps later use that to model the following year.

We can downsample the data using the alias “A” for year-end frequency and this time use sum to calculate the total sales each year.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pandas import read_csv from pandas import datetime from matplotlib import pyplot def parser(x): return datetime.strptime('190'+x, '%Y-%m') resample = series.resample('A') quarterly_mean_sales = resample.sum() print(quarterly_mean_sales.head()) quarterly_mean_sales.plot() pyplot.show() |

Running the example shows the 3 records for the 3 years of observations.

We also get a plot, correctly showing the year along the x-axis and the total number of sales per year along the y-axis.

## Further Reading

This section provides links and further reading for the Pandas functions used in this tutorial.

- pandas.Series.resample API documentation for more on how to configure the resample() function.
- Pandas Time Series Resampling Examples for more general code examples.
- Pandas Offset Aliases used when resampling for all the built-in methods for changing the granularity of the data.
- pandas.Series.interpolate API documentation for more on how to configure the interpolate() function.

## Summary

In this tutorial, you discovered how to resample your time series data using Pandas in Python.

Specifically, you learned:

- About time series resampling and the difference and reasons between downsampling and upsampling observation frequencies.
- How to upsample time series data using Pandas and how to use different interpolation schemes.
- How to downsample time series data using Pandas and how to summarize grouped data.

Do you have any questions about resampling or interpolating time series data or about this tutorial?

Ask your questions in the comments and I will do my best to answer them.

Kinda feel like you inverted upsampling and downsampling.

https://en.wikipedia.org/wiki/Upsampling

https://en.wikipedia.org/wiki/Decimation_(signal_processing)

Thanks Alex, fixed.

Hi,

in the upsample section, why did you write

upampled = series.resample(‘D’).mean()

(by the way, I assume it is _upsampled_, not upampled). I don’t understand why you need to put the mean if you are inserting NaNs. Wouldn’t it be sufficient just to write series.resample(‘D’)?

Hi David,

You are right, I’ve fixed up the examples.

Hello,

I think it is necessary to add “asfreq()”, i.e.:

upsampled = series.resample(‘D’).asfreq()

because in new versions of pandas resample is just a grouping operation and then you have to aggregate functions.

Jason, I have what’s hopefully a quick question that was prompted by the interpolation example you’ve given above.

I’ve been tasked with a monthly forecasting analysis. My original data is daily. If I aggregate it to month-level, this gives me only 24 usable observations so many models may struggle with that. It feels like I should be able to make more use of my richer, daily dataset for my problem.

I have heard somewhere (but can’t remember where or whether I imagined it!) that a workaround is to create “fake” monthly data by creating rolling sums say from 26th Dec to 26th January. So for December I would have 31 “fake months”, one starting on each day of December and ending on the corresponding day number in January. Is this a valid workaround for artificially increasing sample size in short time series for training models? I can see straight off the bat that autocorrelation is a massive issue but is it worth exploring or have I just dreamt that up.

Are there any other workarounds for working with short time series?

Thanks!

Hi Carmen,

Perhaps the 24 obs provide sufficient information for making accurate forecasts.

I would advise you to develop and evaluate a suite of different models and focus on those representations that produce effective results.

Your idea of fake months seems useful only if it can expose more or different information to the learning algorithms not available by other means/representations.

I’d love to hear how you go with your forecast problem.

I have a timeseries data where I am using resample technique to downsample my data from 15 minute to 1 hour. The data is quite large ( values every 15 minutes for 1 year) so there are more than 30k rows in my original csv file.

I am using:

df[‘dt’] = pd.to_datetime(df[‘Date’] + ‘ ‘ + df[‘Time’])

df = df.set_index(‘dt’).resample(‘1H’)[‘KWH’].first().reset_index()

but after resampling I only get first day and last day correctly, all the intermediate values are filled with NAN. Can you help point what I might be doing wrong.

Onse resampled, you need to interpolate the missing data.

This is how my data looks before resampling :

24 01/01/16 06:00:04 4749.28 15.1 23.5 369.6 2016-01-01 06:00:04

25 01/01/16 06:15:04 4749.28 14.7 23.5 369.6 2016-01-01 06:15:04

26 01/01/16 06:30:04 4749.28 14.9 23.5 369.6 2016-01-01 06:30:04

27 01/01/16 06:45:04 4749.47 14.9 23.5 373.1 2016-01-01 06:45:04

28 01/01/16 07:00:04 4749.47 15.1 23.5 373.1 2016-01-01 07:00:04

29 01/01/16 07:15:04 4749.47 15.2 23.5 373.1 2016-01-01 07:15:04

… … … … … … …

2946 31/01/16 16:30:04 4927.18 15.5 24.4 373.1 2016-01-31 16:30:04

2947 31/01/16 16:45:04 4927.24 15.0 24.4 377.6 2016-01-31 16:45:04

2948 31/01/16 17:00:04 4927.30 15.2 24.4 370.5 2016-01-31 17:00:04

and this is how it looks after resampling:

df[‘dt’] = pd.to_datetime(df[‘Date’] + ‘ ‘ + df[‘Time’])

df = df.set_index(‘dt’).resample(‘1H’)[‘KWH’,’OCT’,’RAT’,’CO2′].first().reset_index()

17 2016-01-01 17:00:00 4751.62 15.0 23.8 370.9

18 2016-01-01 18:00:00 4751.82 15.1 23.6 369.2

19 2016-01-01 19:00:00 4752.01 15.3 23.6 375.4

20 2016-01-01 20:00:00 4752.21 14.8 23.6 370.1

21 2016-01-01 21:00:00 4752.61 15.0 23.8 369.2

22 2016-01-01 22:00:00 4752.80 15.2 23.7 369.6

23 2016-01-01 23:00:00 4753.00 15.7 23.5 372.3

24 2016-01-02 00:00:00 NaN NaN NaN NaN

25 2016-01-02 01:00:00 NaN NaN NaN NaN

26 2016-01-02 02:00:00 NaN NaN NaN NaN

27 2016-01-02 03:00:00 NaN NaN NaN NaN

28 2016-01-02 04:00:00 NaN NaN NaN NaN

29 2016-01-02 05:00:00 NaN NaN NaN NaN

… … … … …

8034 2016-11-30 18:00:00 NaN NaN NaN NaN

8035 2016-11-30 19:00:00 NaN NaN NaN NaN

8036 2016-11-30 20:00:00 NaN NaN NaN NaN

8037 2016-11-30 21:00:00 NaN NaN NaN NaN

8038 2016-11-30 22:00:00 NaN NaN NaN NaN

8039 2016-11-30 23:00:00 NaN NaN NaN NaN

8040 2016-12-01 00:00:00 4811.96 14.8 24.8 364.3

8041 2016-12-01 01:00:00 4812.19 15.1 24.8 376.7

8042 2016-12-01 02:00:00 4812.42 15.1 24.7 373.1

8043 2016-12-01 03:00:00 4812.66 15.2 24.7 372.7

8044 2016-12-01 04:00:00 4812.89 14.9 24.7 370.9

Do you really think it makes sense to take monthly sales in January of 266 bottles of shampoo, then resample that to daily intervals and say you had sales of 266 bottles on the 1st Jan, 262.125806 bottles on the 2nd Jan ?

No, it is just an example of how to use the API.

The domain/domain experts may indicate suitable resampling and interpolation schemes.

Instead of interpolating when resampling monthly sales to the daily interval, is there a function that would instead fill the daily values with the daily average sales for the month? This would be useful for data that represent aggregated values, where the sum of the dataset should remain constant regardless of the frequency… For example, if I need to upsample rainfall data, then the total rainfall needs to remain the same. Are there built-in functions that can do this?

Sure, you can do this. You will have to write some code though.

we just had an intern do this with rainfall data. it’s not too hard! thanks Jason for the helpful guide, this was just was i was searching for!

Glad to hear it!

Hello Jason,

Thanks for a nice post. In my time series data, I have two feature columns i.e. Latitude and Longitude and index is datetime.

Since these GPS coordinates are captured at infrequent time intervals, I want to resample my data in the fixed time interval bin, for example: one GPS coordinate in every 5sec time interval.

Is there a way to do it?

Sounds like you could use a linear interpolation for time and something like linear for the spatial coordinates.

Hello Jason,

I have a question regarding down sampling data from daily to weekly or monthly data,

If my data is multivariate time series for example it has a categorical variables and numeric variables, how can I do the down sampling for each column automatically, is there a simple way of doing this?

Thanks in advance

You may need to do each column one at a time.

Hi Jason,

Can we use (if so, how) resampling to balance 2 unequal classes in the data? Example, in predicting stock price direction, the majority class will be “1” (price going up) and minority class will be “-1” (price going down). Problem is that the classifier may predict most or all labels as “1” and still have a high accuracy, thereby showing a bias towards the majority class.

Is there a way to fix this?

Yes, this post suggests some algorithms for balancing classes:

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

I don’t have material on balancing classes for sequence classification though.

hi im using the code below is this correct my data is a signal stored in a single row

from scipy import signal

resample_signal=scipy.signal.resample(x,256)

plt.plot(resample_signal)

I don’t know. If the plot looks good to you, then yes.

Thanks! That was really helpful, but my problem is a bit different. I have data recorded at random time intervals and I need to interpolate values at 5-min timesteps, as shown below:

Input:

——-

2018-01-01 00:04 | 10.00

2018-01-01 00:09 | 12.00

2018-01-01 00:12 | 10.00

2018-01-01 00:14 | 15.00

2018-01-01 00:18 | 20.00

The needed output:

————————

2018-01-01 00:00 | 08.40

2018-01-01 00:05 | 10.40

2018-01-01 00:10 | 11.90

2018-01-01 00:15 | 16.10

2018-01-01 00:20 | 21.50

Hope that is clear enough!

Really appreciate your help!

You might need to read up on the resample/interpolate API in order to customize the tool for this specific case.

Hi ! I’m trying to get a percentual comparison of CPI between two years. In this particular case, I have data with columns:

‘Date’ (one date per week of year, for three years)

‘CPI’

and others that for this are not important.

The thing is I have to divide each CPI by its year-ago-value. For example, if I have the CPI of week 5 year 2010, I have to divide it by CPI of week 5 year 2009.

I’ve already managed to get the week of the year and year of each observation, but I can’t figure out how to get the observation needed, as they are both observations from the same data frame. Any help will be really appreciated.

Sorry, I’m not intimately familiar with your dataset. I don’t know how I can help exactly.

So sorry. I thought I attached a part. This is a header of the data (not sure if it will do for “intimately familiarization” but hope it does clarify):

Date CPI

05-02-2010 211.0963582

12-02-2010 211.2421698

19-02-2010 211.2891429

26-02-2010 211.3196429

05-03-2010 211.3501429

12-03-2010 211.3806429

19-03-2010 211.215635

26-03-2010 211.0180424

02-04-2010 210.8204499

09-04-2010 210.6228574

16-04-2010 210.4887

23-04-2010 210.4391228

30-04-2010 210.3895456

Hello Jason,

Thanks you for the helpful guide. I am currently working to interpolate daily stock returns from weekly returns. I know I have to keep the total cumulative return constant but I am still confused about the procedure. Could you give me some hints on how to write my function?

Thanks

What problem are you having exactly? Do the examples not help?

can i solve this problem with LSTMs? and how to do that?

The LSTM can interpolate. You can train the model as a generator and use it to generate the next point given the prior input sequence.

Hi ,

How to take care of categorical variables while re-sampling.

Good question, persist them forward.