We live in a world drowning in data. Internet tracking, stock market movement, genome sequencing technologies and their ilk all produce enormous amounts of data.
Most of this data is someone else’s responsibility, generated by someone else, stored in someone else’s database, which is maintained and made available by… you guessed it… someone else.
But. Whenever we carry out a machine learning project we are working with a small subset of the all the data which is out there.
Whether you generate your own data, or use publicly available data, your results must be reproducible. And the reproducibility of an analysis depends crucially on data management.
What is Data Management?
Data management is the process of storing, handling and securing raw data and any associated metadata.
This process includes:
- Identifying appropriate data for your analysis
- Downloading the data
- Reformatting as necessary
- Cleaning the data
- Storing data in an appropriate repository
- Backing up the data
- Annotating with metadata
- Maintaining the data
- Making the data available to those with whom you want to share it
- Protecting the data from malicious or accidental access
The first four points in the list above have been addressed in other posts to this blog. In this post we look at procedures for dealing with a working dataset.
Well organised, well documented, preserved and shared data are invaluable to advance scientific inquiry and to increase opportunities for learning and innovation.
Why Not Leave It To Someone Else?
If you are generating your own data, the need for data management is clear. However, even if you are using someone else’s data you still need to have well-thought-out data management policies and procedures in place.
Most online databases are growing continually, and often exponentially. If you have generated a result with data available today, in a month your dataset will represent only part of a much larger one, and in two years it will probably only be a fraction of the current database. In order to produce reproducible results you must download, clean and secure your own dataset.
The Data Management Process By Numbers
1. Storing data in an appropriate repository
There are several different types of data repository with different uses. These include, from simplest to most complex:
Database: “An organized body of related information” (definition)
The key term here is organized. A simple relational database stores records, and the most basic information about the relationships between records, in a well-defined, structured manner. Structure is key, a database does not necessarily need to have any knowledge built into it about what the data stored are or what they mean. Many types of data can even be stored as Binary Large Objects (BLOBS).
Data warehouse: “An integrated repository of data from multiple, possibly heterogeneous data sources, presented with consistent and coherent semantics” (definition)
With a data warehouse semantics are added to the data structures. Semantic algorithms are an attempt to add meaning to the data, often in the form of an ontology. An ontology based on a controlled vocabulary, with clearly defined relationships between vocabulary terms (for example, “alcohol dehydrogenase” isA “protein“).
Data archive: “Storage of document versions kept for historical or reference purposes” (definition)
Data archives are generally kept on reliable media, and do not necessarily have to be quick to access, since the data is being kept for historical purposes. Many organizations require that data is kept for a specific period of time, even after it has been analyzed and the results published.
Data integration: “The process of combining data residing at different sources and providing the user with a unified view of these data” (definition)
The first three types of repository generally deal with a single type of data: employment records, or protein-protein interaction data, say. Data integration is not so much a repository as a set of algorithms for combining different data types into a single dataset, in order to permit more useful analyses. For example, combining demographic data (age, sex, BMI, etc.) with blood test results and economic data can give you much greater insight into your health than blood test alone.
2. Backing up the data
Everyone knows that all hard drives should be regularly backed up, although a frightening proportion of people don’t do so regularly. So do it! Data backup should include:
Daily incremental backups to a different hard drive, or even a USB stick. There are dozens of backup solutions, both free and proprietary. Wikipedia has a reasonably comprehensive list.
File synchronization software does not attempt to backup everything from one drive to another, but keeps track of which version of each file was last updated, and saves the most recent version. If a file has been updated simultaneously in two places, most applications will ask the user which one to choose. See the Wikipedia list.
For what it’s worth, I use Unison, and have been very happy with its performance.
As well as daily incremental backups, full backups should be performed periodically. One copy of the backup should be stored onsite, for quick access in case of a disaster, and at least one offsite, in case of a real disaster. If you have a hard disk failure and need to access your files, they should be in a drawer. If your office burns down, the files should be at your home, or your Mum’s home.
3. Annotating with metadata
When we first download data, its source and meaning is blindingly obvious to us. We know why we wanted it, what we did with, and what it means. After a couple of months, though, this understanding might not be so clear.
Metadata are data about data. Metadata can include information such as who generated the data, when they were generated, when they were downloaded, which analyses they were used for, what experimental conditions were used, what papers they were used in, and whether there are any known problems with the data.
There are a number of community-based organizations which aim to specify the minimum information required to reproduce data, particularly complex data such as that generated by modern molecular biology experiments.
Metadata annotation seems like a real slog, but if your results are worth reporting, they are worth reproducing, and metadata are essential for long-term understanding of the raw data.
4. Maintaining the Data
Once you have your data selected, cleaned, properly stored, annotated and backed up you might be excused for thinking that the hard part is over. Of course, it’s not that simple, it never is. Data must be maintained.
- Adding new data (and appropriate metadata)
- Updating existing data (and associated metadata)
- Dealing with errors as they become apparent (and updating metadata)
If you have a one-off dataset, the first point might be moot, but the last two are important. Stored data should reflect your current best understanding of the problem, and any updates, changes, or discards should be recorded in the metadata.
Another important issue in data maintenance is that of media. Storage media become obsolete with truly frightening speed.
Today, most datasets are stored on hard disks and backed up to DVDs or USB sticks. These media will inevitable become obsolete, in fewer years than now appears possible, and the machine learning practitioner must be very much aware of this trend. Particularly in the case of archival data sets, which may not be accessed very often, regular checks to make sure that the data are still readable and in modern format are essential.
For over a decade, magnetic tapes from the 1976 Viking Mars landing were unprocessed. When later analyzed, the data was unreadable as it was in an unknown format and the original programmers had either died or left NASA
— From the Wikipedia article on the digital dark age.
5. Making the data available to those, and only to those, with whom you want to share it
Many publication forums insist that the data used to generate the reported results are made available to interested readers. Even outside of publication, you may want to share data with friends and colleagues, to permit deeper analysis.
The catch, of course, is that data made freely available may be abused. Read-only data may be downloaded and used by competitors or people who will misrepresent it to further a cause which may not be yours. Data with read / write access may be modified either by malicious individuals, or accidentally by you or your colleagues.
Data security and access control is a huge area of research, and much of the material available is pretty technical. Basically, most database management systems have built-in access control, at varying levels of granularity, usually in the familiar form of accounts and passwords. Data made available via websites may not have these built-in protections, and appropriate safeguards must be implemented on a case-by-case basis.
Every data manager walk the line between security and sharing.
Data Management Tutorials and Recipes
An absolute beginner’s introduction to backup (this one is a bit old, so the technology has moved on somewhat, but the basic principles are sound):
A (Mostly) Gentle Introduction to Computer Security, by Todd Austin from the University of Michigan:
- The UK Data Archive has a checklist for data management, available here.
- An excellent article on the Digital Dark Age, when storage media becomes unreadable is available on Wikipedia.
- How quickly has computer storage media aged? Check out this article?
- For a compendium of Minimum Information standards for Biology and Biomedicine see the The Open Biological and Biomedical Ontologies.
- A nice overview of access control for Web applications, from the Open Web Application Security Project called the access control cheat sheet.