What if I have more Columns than Rows in my dataset?
Machine learning datasets are often structured or tabular data comprised of rows and columns.
The columns that are fed as input to a model are called predictors or “p” and the rows are samples “n“. Most machine learning algorithms assume that there are many more samples than there are predictors, denoted as p << n.
Sometimes, this is not the case, and there are many more predictors than samples in the dataset, referred to as “big-p, little-n” and denoted as p >> n. These problems often require specialized data preparation and modeling algorithms to address them correctly.
In this tutorial, you will discover the challenge of big-p, little n or p >> n machine learning problems.
After completing this tutorial, you will know:
- Most machine learning problems have many more samples than predictors and most machine learning algorithms make this assumption during the training process.
- Some modeling problems have many more predictors than samples, referred to as p >> n.
- Algorithms to explore when modeling machine learning datasets with more predictors than samples.
Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials and the Excel Spreadsheet files for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Predictors (p) and Samples (n)
- Machine Learning Assumes p << n
- How to Handle p >> n
Predictors (p) and Samples (n)
Consider a predictive modeling problem, such as classification or regression.
The dataset is structured data or tabular data, like what you might see in an Excel spreadsheet.
There are columns and rows. Most of the columns would be used as inputs to a model and one column would represent the output or variable to be predicted.
The inputs go by different names, such as predictors, independent variables, features, or sometimes just variables. The output variable—in this case, sales—is often called the response or dependent variable, and is typically denoted using the symbol Y.
— Page 15, An Introduction to Statistical Learning with Applications in R, 2017.
Each column represents a variable or one aspect of a sample. The columns that represent the inputs to the model are called predictors.
Each row represents one sample with values across each of the columns or features.
- Predictors: Input columns of a dataset, also called input variables or features.
- Samples: Rows of a dataset, also called an observation, example, or instance.
It is common to describe a training dataset in machine learning in terms of the predictors and samples.
The number of predictors in a dataset is described using the term “p” and the number of samples in a dataset is described using the term “n” or sometimes “N“.
- p: The number of predictors in a dataset.
- n: The number of samples in a dataset.
To make this concrete, let’s take a look at the iris flowers classification problem.
Below is a sample of the first five rows of this dataset.
1 2 3 4 5 6 |
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa ... |
This dataset has five columns and 150 rows.
The first four columns are inputs and the fifth column is the output, meaning that there are four predictors.
We would describe the iris flowers dataset as:
- p=4, n=150.
Machine Learning Assumes p << n
It is almost always the case that the number of predictors (p) will be smaller than the number of samples (n).
Often much smaller.
We can summarize this expectation as p << n, where “<<” is a mathematical inequality that means “much less than.”
- p << n: Typically we have fewer predictors than samples.
To demonstrate this, let’s look at a few more standard machine learning datasets:
- Pima Indians Diabetes: p=8, n=768
- Glass Identification: p=9, n=214
- Boston Housing: p=13, n=506
Most machine learning algorithms operate based on the assumption that there are many more samples than predictors.
One way to think about predictors and samples is to take a geometrical perspective.
Consider a hypercube where the number of predictors (p) defines the number of dimensions of the hypercube. The volume of this hypercube is the scope of possible samples that could be drawn from the domain. The number of samples (n) are the actual samples drawn from the domain that you must use to model your predictive modeling problem.
This is a rationale for the axiom “get as much data as possible” in applied machine learning. It is a desire to gather a sufficiently representative sample of the p-dimensional problem domain.
As the number of dimensions (p) increases, the volume of the domain increases exponentially. This, in turn, requires more samples (n) from the domain to provide effective coverage of the domain for a learning algorithm. We don’t need full coverage of the domain, just what is likely to be observable.
This challenge of effectively sampling high-dimensional spaces is generally referred to as the curse of dimensionality.
Machine learning algorithms overcome the curse of dimensionality by making assumptions about the data and structure of the mapping function from inputs to outputs. They add a bias.
The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern. The only way to beat the curse is to incorporate knowledge about the data that is correct.
— Page 15, Pattern Classification, 2000.
Many machine learning algorithms that use distance measures and other local models (in feature space) often degrade in performance as the number of predictors is increased.
When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large.
— Page 168, An Introduction to Statistical Learning with Applications in R, 2017.
It is not always the case that the number of predictors is less than the number of samples.
How to Handle p >> n
Some predictive modeling problems have more predictors than samples by definition.
Often many more predictors than samples.
This is often described as “big-p, little-n,” “large-p, small-n,” or more compactly as “p >> n”, where the “>>” is a mathematical inequality operator that means “much more than.”
… prediction problems in which the number of features p is much larger than the number of observations N, often written p >> N.
— Page 649, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.
Consider this from a geometrical perspective.
Now, instead of having a domain with tens of dimensions (or fewer), the domain has many thousands of dimensions and only a few tens of samples from this space. We cannot expect to have anything like a representative sample of the domain.
Many examples of p >> n problems come from the field of medicine, where there is a small patient population and a large number of descriptive characteristics.
At the same time, applications have emerged in which the number of experimental units is comparatively small but the underlying dimension is massive; illustrative examples might include image analysis, microarray analysis, document classification, astronomy and atmospheric science.
— Statistical challenges of high-dimensional data, 2009.
A common example of a p >> n problem is gene expression arrays, where there may be thousands of genes (predictors) and only tens of samples.
Gene expression arrays typically have 50 to 100 samples and 5,000 to 20,000 variables (genes).
— Expression Arrays and the p >> n Problem, 2003.
Given that most machine learning algorithms assume many more samples than predictors, this introduces a challenge when modeling.
Specifically, the assumptions made by standard machine learning models may cause the models to behave unexpectedly, provide misleading results, or fail completely.
… models cannot be used “out of the box”, since the standard fitting algorithms all require p<n; in fact the usual rule of thumb is that there be five or ten times as many samples as variables.
— Expression Arrays and the p >> n Problem, 2003.
A major problem with p >> n problems when using machine learning models is overfitting the training dataset.
Given the lack of samples, most models are unable to generalize and instead learn the statistical noise in the training data. This makes the model perform well on the training dataset but perform poorly on new examples from the problem domain.
This is also a hard problem to diagnose, as the lack of samples does not allow for a test or validation dataset by which model overfitting can be evaluated. As such, it is common to use leave-one-out style cross-validation (LOOCV) when evaluating models on p >> n problems.
There are many ways to approach a p >> n type classification or regression problem.
Some examples include:
Ignore p and n
One approach is to ignore the p and n relationship and evaluate standard machine learning models.
This might be considered the baseline method by which any other more specialized interventions may be compared.
Feature Selection
Feature selection involves selecting a subset of predictors to use as input to predictive models.
Common techniques include filter methods that select features based on their statistical relationship to the target variable (e.g. correlation), and wrapper methods that select features based on their contribution to a model when predicting the target variable (e.g. RFE).
A suite of feature selection methods could be evaluated and compared, perhaps applied in an aggressive manner to dramatically reduce the number of input features to those determined to be most critical.
Feature selection is an important scientific requirement for a classifier when p is large.
— Page 658, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.
For more on feature selection see the tutorial:
Projection Methods
Projection methods create a lower-dimensionality representation of samples that preserves the relationships observed in the data.
They are often used for visualization, although the dimensionality reduction nature of the techniques may also make them useful as a data transform to reduce the number of predictors.
This might include techniques from linear algebra, such as SVD and PCA.
When p > N, the computations can be carried out in an N-dimensional space, rather than p, via the singular value decomposition …
— Page 659, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.
It might also include manifold learning algorithms often used for visualization such as t-SNE.
Regularized Algorithms
Standard machine learning algorithms may be adapted to use regularization during the training process.
This will penalize models based on the number of features used or weighting of features, encouraging the model to perform well and minimize the number of predictors used in the model.
This can act as a type of automatic feature selection during training and may involve augmenting existing models (e.g regularized linear regression and regularized logistic regression) or the use of specialized methods such as LARS and LASSO.
There is no best method and it is recommended to use controlled experiments to test a suite of different methods.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Papers
- Expression Arrays and the p >> n Problem, 2003.
- Statistical challenges of high-dimensional data, 2009.
Books
- Pattern Classification, 2000.
- Chapter 18, High-Dimensional Problems: p >> N, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.
- An Introduction to Statistical Learning with Applications in R, 2017.
Articles
Summary
In this tutorial, you discovered the challenge of big-p, little n or p >> n machine learning problems.
Specifically, you learned:
- Machine learning datasets can be described in terms of the number of predictors (p) and the number of samples (n).
- Most machine learning problems have many more samples than predictors and most machine learning algorithms make this assumption during the training process.
- Some modeling problems have many more predictors than samples, such as problems from medicine, referred to as p >> n, and may require the use of specialized algorithms.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Interesting blog and great information on Machine learning.
Thanks!
Thank you. How about data augmentation? I often use it when dealing with p>>n. In many of my experiments it seemed to work well.
Yes, it may offer a path toward oversampling the dataset and giving more purchase on the decision boundary/mapping function.
practical examples for p >> n ???
Gene expression datasets.
Thank you for your great article. For a structured data formed by columns and rows. So we can not p and n. For unstructured data, such as images, MNIST dataset, is p equal to 28X28=784? If it is with colors, p = 28X28X3=2352. Am I right? Thanks
Yes, structured data. The concept breaks down for analog data (images/text/audio/…).
Had a “little P, littler N” problem where the standard physical model for the underlying decay curve had more parameters than observations, so standard curve fitting approaches were out. We used Chebychev polynomials, used the data that we had to fit an appropriate family, with only 2 parameters to fit, which over our operating regime was an adequate model.
Very cool, thanks for sharing!
Sure! Great article and love what you’re publishing – great resources!
Thanks.
Thanks for the article! What is the best approach to handle insufficient sample size when forecasting for multiple sequences simultaneously with an RNN?
RNNs typically require a lot of data.
Perhaps you can explore data augmentation to artificially expand your dataset size.
Thanks for this article! Currently, I am working on an ML model which would predict plastic production for the next 5 years based on the data collected from the past 10 years. Since I have fewer rows and many columns, how should I approach this particular problem?
Hi Vinu,
If you are representing your input data as a time series, I would recommend the following resources:
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
https://machinelearningmastery.com/how-to-develop-convolutional-neural-network-models-for-time-series-forecasting/
https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/
Thanks for the great article.
I am often interested in so-called “big p small n (P>>N)” data sets with many features and small sample sizes. That is gene expression data.
In general, machine learning seems to assume N>>P. It is the opposite.
I would like to do feature extraction using unsupervised learning.
Would it be useful to use network analysis page rank or PCAUFE (PCA-based unsupervised feature extraction) to find important features, or is there a better, more specific solution?
Please let me know if you have any important pointers.
Best regards.
Hi YM…You may find the following of interest:
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
You mean to use ANOVA for feature selection.
Thank you very much.
Hello Brownlee,
Your post is really helpful for me because I am handling high-dimensional data.
I have data with n =347 and p = 15.000 (this is optical spectrometer data of materials), and it is a regression problem with multiple outputs of 2. I tried different machine learning methods (linear regression, HUR regression, Random forest, Gradient Boosting for regression, Automatic Relevance Determination …) and found that multiple linear regression is the best (using nested CV, not overfitting…).
I like your above idea that “ignore n and p”. My questions are:
1) Is there any chance for a multiple linear regression machine learning model for high-dimensional data (because many authors said linear regression is unsuitable)?
2) Can I believe my results that linear regression can be the best choice for high-dimensional data?
Thank you in advance.
Best wishes
Cuong
Hi Cuong…The following resources may be of interest related to this topic:
https://medium.com/swlh/all-you-need-to-know-about-handling-high-dimensional-data-7197b701244d
https://stats.stackexchange.com/questions/471726/why-does-machine-learning-work-for-high-dimensional-datan-ll-p
Thank you for your suggestions.
I read two posts, but the questions are still open.
It is said that models using Regularization Techniques will be suitable for high-dimensional data. But in my case: (normal) linear regression is better than others using Regularization Techniques.
Because I intend to publish this result (but it contradicts with basic knowledge) makes me confused!
Could you have any advice?