Last Updated on August 15, 2020
What if you could use a predictive model to find new combinations of attributes that do not exist in the data but could be valuable.
In Chapter 10 of Applied Predictive Modeling, Kuhn and Johnson provide a case study that does just this. It’s a fascinating and creative example of how to use a predictive model.
In this post we will discover this less obvious use of a predictive model and the types of experimental design to which it belongs.
Compressive Strength of Concrete Mixtures
The problem modeled in the case study is the compressive strength of different concrete mixtures. Each record in the data is described by the amounts of ingredients of a concrete mixture, such as:
- Fly ash
- Blast furnace slag
- Coarse aggregate
- Fine aggregate
The property of interest from the resulting mixture is the compressive strength of the concrete. Strong concrete with less or cheaper ingredients is desirable.
Refer to Chapter 10 of Applied Predictive Modeling for deeper insight into the problem.
Many complex machine learning methods are spot checked on this regression problem, such as:
- Linear Regression
- Radial bias function Support Vector Machines (SVM)
- Neural Networks
- Regression Trees (CART and conditional inference trees)
- Bagged and Boosted decision trees
Model accuracy was considered in terms of the RMSE and the R^2 of the predictions. Some of the better performing methods were Neural Networks, Boosted Decision Trees, Cubist and Random Forest.
Optimizing Compressing Strength
This is the clever part of the case study.
After accurate models were created and selected (Neural Networks and Cubist models), the models were used to locate new mixture quantities that resulted in improved concrete compressing strength.
This involved using a direct search method (also called pattern search) called the Nelder Mead algorithm to search the parameter space for a combination of mixture quantities that when passed to the predictive model, predicted a concrete compressing strength greater than any in the dataset.
A number of new mixtures were discovered and plotted in a projected domain relative to the provided data. These new mixtures represent the basis for actual commercial experiments that could be performed in order to find an improved concrete mixture.
Response Surface Methodology
The approach is related to a specific type of experimental design called Response Surface Methodology (RSM).
RSM is used when you want to develop, improve or optimize a process for a new or existing product. It’s commonly used for industrial settings. It is used for problems where the relationship between the inputs and the output are not well understood and need to be estimated.
Designed experiments are performed in order to collect examples of the inputs and the response variable or variables. The inputs variables may be quantities or timings in a process and the output or response variable is something desirable from the result like strength or quality.
The statistical model is constructed to approximate the relationship between the independent variables and the dependent variable, and finally an optimization process explores new combinations of inputs to maximize the output variable.
A critical step prior to performing the designed experiments is to reduce the number of variables to only those factors known to influence the response variable. This is a form of feature selection with which we are very familiar in machine learning.
Simple models are used to model the functional relationship, such as first or second order polynomials. The method is called response surface because of the continuous nature of the response surface for many problems and how it can be plotted as a surface in two-dimensions.
Surrogate modeling is when the model constructed in RSM is used in place of a simulation of the problem. For example, in aviation, you can design and build aircraft wings, design them in software and test them in simulators and model the results of experiments or simulation results and estimate new designs to test.
The models may be more elaborate to capture the complex non-linear relationships between the inputs and response variable. For example Support Vector Machines and Neural Networks may be used. Additionally, more powerful direct search methods may be used that use stochastic processes, such as simulated annealing or evolutionary algorithms.
The over-all process may be something like
- Reduce the number of variables involved
- Design experiments and execute them sequentially to collect source data to model
- Construct a surrogate model from the experimental data
- Apply a search method to the variables using the surrogate model
- Sequentially perform experiments based on the optimized predictions of the surrogate model
- Iterate Steps 3 to 5 until a stopping condition is met
In this post you discovered a clever way to use a predictive model.
In the case study you learned of an example of using machine learning algorithms to model the results of concrete mixture experiments, search the parameter space for mixers with optimal compressive strength that may be taken as the basis for further experiments.
You learned that this type of experimental design is called Response Surface Methodology and is used for industrial problems domains for processes like the concrete mixture example. You also learned that the predictive model is this case study is called a surrogate model.
This is a powerful method that you could use in other domains that have large computation overhead for performing simulations.
Below are some books you may want to look at to learn more about this approach to experimental design and optimization.
Very interesting write-up.
Thank you for keeping the rest of us informed.
Hi Jason, interestingly I was working on a similar problem and bumped into your post. To approximate a response surface I am trying to use a XGboost model.I have 3 features – I choose a 3^3 factorial design => I have 27 points at which I evaluated the model (Performance function is implicit i do 27 Finite element model runs). So I now have 27 rows of min medium and max combinations of the 3 random variables. Conventionally people use a Linear or a polynomial reg model and I wanted to try a XGboost.
X is a 27×3 and y is a 27×1
BUT somehow the linear model was always outperforming(after Monte Carlo 100k simulations) the other models (RF,DTree,Gboost,XGboost)- Response surface is definitely non linear.
I was thinking that all the other models should be as good as or better than the linear model.
Are there any MEASURES to be taken when I use tree based models to RSM approximation when a Factorial design approach is considered ?
Good question. I’m not sure off hand, sorry.
No problem. I was not sure if I communicated the question properly. Normally engineering problems have 4-6 random parameters so a 2 level factorial design to construct a response surface will have 2^4 =16 to 2^6=64 samples which means NOT a lot of training instances.
So question is :
Will Algorithms like RF, Dtree, Gboost ,XGboost perform well on smaller data sets ?
It really depends. Perhaps try, and if results are poor, perhaps try adding regularization.