How to Predict Room Occupancy Based on Environmental Factors

Small computers, such as Arduino devices, can be used within buildings to record environmental variables from which simple and useful properties can be predicted.

One example is predicting whether a room or rooms are occupied based on environmental measures such as temperature, humidity, and related measures.

This is a type of common time series classification problem called room occupancy classification.

In this tutorial, you will discover a standard multivariate time series classification problem for predicting room occupancy using the measurements of environmental variables.

After completing this tutorial, you will know:

  • The Occupancy Detection standard time series classification problem in machine learning.
  • How to load and visualize multivariate time series classification data.
  • How to develop simple naive and logistic regression models that achieve nearly perfect skill on the problem.

Kick-start your project with my new book Deep Learning for Time Series Forecasting, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Oct/2018: Updated description of the source of the dataset (I really messed that up), thanks Luis Candanedo.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Occupancy Detection Problem Description
  2. Data Visualization
  3. Concatenated Dataset
  4. Simple Predictive Models

Occupancy Detection Problem Description

A standard time series classification data set is the “Occupancy Detection” problem available on the UCI Machine Learning repository.

It is a binary classification problem which requires that an observation of environmental factors such as temperature and humidity be used to classify whether a room is occupied or unoccupied.

The dataset is described in the 2016 paper “Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models” by Luis M. Candanedo and Véronique Feldheim.

The dataset was collected by monitoring an office with a suite of environmental sensors and using a camera to determine if the room was occupied.

An office room […] was monitored for the following variables: temperature, humidity, light and CO2 levels. A microcontroller was employed to acquire the data. A ZigBee radio was connected to it and was used to transmit the information to a recording station. A digital camera was used to determine if the room was occupied or not. The camera time stamped pictures every minute and these were studied manually to label the data.

— Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models, 2016.

Data is provided with date-time information and six environmental measures taken each minute over multiple days, specifically:

  • Temperature in Celsius.
  • Relative humidity as a percentage.
  • Light measured in lux.
  • Carbon dioxide measured in parts per million.
  • Humidity ratio, derived from temperature and relative humidity measured in kilograms of water vapor per kilogram of air.
  • Occupancy as either 1 for occupied or 0 for not occupied.

This dataset has been used in many simple modeling machine learning papers. For example, see the paper “Visible Light Based Occupancy Inference Using Ensemble Learning,” 2018 for further references.

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Visualization

The data is available in CSV format in three files, claimed to be a split of data for training, validation and testing.

The three files are as follows:

  • datatest.txt (test): From 2015-02-02 14:19:00 to 2015-02-04 10:43:00
  • datatraining.txt (train): From 2015-02-04 17:51:00 to 2015-02-10 09:33:00
  • datatest2.txt (val): From 2015-02-11 14:48:00 to 2015-02-18 09:19:00

What is obvious at first is that the split in the data is not contiguous in time and that there are gaps.

The test dataset is before the train and validation datasets in time. Perhaps this was an error in the naming convention of the files. We can also see that the data extends from Feb 2 to Feb 18, which spans 17 calendar days, not 20.

Download the files from here and place them in your current working directory:

Each file contains a header line, but includes a column for the row number that does not include an entry in the header line.

In order to load the data files correctly, update the header line of each file to read as follows:

From:

To:

Below is a sample of the first five lines of datatraining.txt file with the modification.

We can then load the data files using the Pandas read_csv() function, as follows:

Once loaded, we can create a plot for each of the six series, clearly showing the separation of the three datasets in time.

The complete example is listed below.

Running the example creates a plot with a different color for each dataset:

  • datatest.txt (test): Blue
  • datatraining.txt (train): Orange
  • datatest2.txt (val): Green

We can see the small gap between the test and train sets and the larger gap between the train and validation sets.

We can also see corresponding structures (peaks) in the series for each variable with the room occupancy.

Line Plot Showing Time Series Plots for all variables and each dataset

Line Plot Showing Time Series Plots for all variables and each dataset

Concatenated Dataset

We can simplify the dataset by preserving the temporal consistency of the data and concatenating all three sets into a single dataset, dropping the “no” column.

This will allow ad hoc testing of simple direct framings of the problem (in the next section) that can be tested on a temporally consistent way with ad hoc train/test set sizes.

Note: This simplification does not account for the temporal gaps in the data and algorithms that rely on a sequence of observations at prior time steps may require a different organization of the data.

The example below loads the data, concatenates it into a temporally consistent dataset, and saves the results to a new file named “combined.csv“.

Running the example saves the concatenated dataset to the new file ‘combined.csv‘.

Simple Predictive Models

The simplest formulation of the problem is to predict occupancy based on the environmental conditions at the current time.

I refer to this as a direct model as it does not make use of the observations of the environmental measures at prior time steps. Technically, this is not sequence classification, it is just a straight classification problem where the observations are temporally ordered.

This seems to be the standard formulation of the problem from my skim of the literature, and disappointingly, the papers seem to use the train/validation/test data as labeled on the UCI website.

We will use the combined dataset described in the previous section and evaluate model skill by holding back the last 30% of the data as a test set. For example:

Next, we can evaluate some models of the dataset, starting with a naive prediction model.

Naive Model

A simple model for this formulation of the problem is to predict the most prominent class outcome.

This is called the Zero Rule, or the naive prediction algorithm. We will evaluate predicting all 0 (unoccupied) and all 1 (occupied) for each example in the test set and evaluate the approach using the accuracy metric.

Below is a function that will perform this naive prediction given a test set and a chosen outcome variable

The complete example is listed below.

Running the example prints the naive prediction and the related score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the baseline score is about 82% accuracy by predicting all 0, e.g. all no occupancy.

For any model to be considered skilful on the problem, it must achieve a skill of 82% or better.

Logistic Regression

A skim of the literature shows a range of sophisticated neural network models applied on this problem.

To start with, let’s try a simple logistic regression classification algorithm.

The complete example is listed below.

Running the example fits a logistic regression model on the training dataset and predicts the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The skill of the model is about 99% accurate, showing skill over the naive method.

Normally, I would recommend centering and normalizing the data prior to modeling, but some trial and error demonstrated that a model on the unscaled data was more skilful.

This is an impressive result at first glance.

Although the test-setup is different to that presented in the research literature, the reported skill of a very simple model outperforms more sophisticated neural network models.

Feature Selection and Logistic Regression

A closer look at the time series plot shows a clear relationship between the times when the rooms are occupied and peaks in the environmental measures.

This makes sense and explains why this problem is in fact so easy to model.

We can further simplify the model by testing a simple logistic regression model on each environment measure in isolation. The idea is that we don’t need all of the data to predict occupancy; that perhaps just one of the measures is sufficient.

This is the simplest type of feature selection where a model is created and evaluated with each feature in isolation. More advanced methods may consider each subgroup of features.

The complete example testing a logistic model with each of the five input features in isolation is listed below.

Running the example prints the feature index, name, and the skill of a logistic model trained on that feature and evaluated on the test set.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that only the “Light” variable is required in order to achieve 99% accuracy on this dataset.

It is very likely that the lab rooms in which the environmental variables were recorded had a light sensor that turned internal lights on when the room was occupied.

Alternately, perhaps the light is recorded during the daylight hours (e.g. sunshine through windows), and the rooms are occupied on each day, or perhaps each week day.

At the very least, the results of this tutorial ask some hard questions about any research papers that use this dataset, as clearly it is not a challenging prediction problem.

Extensions

This data may still be interesting for further investigation.

Some ideas include:

  • Perhaps the problem would be more challenging if the light column was removed.
  • Perhaps the problem can be framed as a true multivariate time series classification where lag observations are used in the model.
  • Perhaps the clear peaks in the environmental variables can be exploited in the prediction.

I tried each of these models briefly without exciting results.

If you explore any of these extensions or find some examples online, let me know in the comments below.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered a standard multivariate time series classification problem for predicting room occupancy using the measurements of environmental variables.

Specifically, you learned:

  • The Occupancy Detection standard time series classification problem in machine learning.
  • How to load and visualize multivariate time series classification data.
  • How to develop simple naive and logistic regression models that achieve nearly perfect skill on the problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Time Series Today!

Deep Learning for Time Series Forecasting

Develop Your Own Forecasting models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Time Series Forecasting

It provides self-study tutorials on topics like:
CNNs, LSTMs, Multivariate Forecasting, Multi-Step Forecasting and much more...

Finally Bring Deep Learning to your Time Series Forecasting Projects

Skip the Academics. Just Results.

See What's Inside

9 Responses to How to Predict Room Occupancy Based on Environmental Factors

  1. Avatar
    Miha August 30, 2018 at 5:11 pm #

    Hi! Awesome blog!. Thanks! One thing, why don’t you use plt.tight_layout() at the end of the script in order to have non-overlapping graphs?

  2. Avatar
    Star October 25, 2018 at 12:48 am #

    i have this code:

    to calculate logistic regression and show its result but when i run it in pycharm i see this error:

    what should i do?

  3. Avatar
    Sushmitha John August 3, 2020 at 4:52 pm #

    Can you help me find 5 hypothesis for the data set?

    • Avatar
      Jason Brownlee August 4, 2020 at 6:35 am #

      Sorry, I don’t understand your question. Perhaps you can elaborate.

  4. Avatar
    Navneet January 28, 2022 at 3:35 am #

    Hi Jason, Isn’t predicting occupancy based on AQI technically incorrect? Because AQI is a dependent variable and Occupancy independent. Shouldn’t we be predicting AQI based on occupancy?

Leave a Reply