How To Improve Deep Learning Performance

20 Tips, Tricks and Techniques That You Can Use To
Fight Overfitting and Get Better Generalization

How can you get better performance from your deep learning model?

It is one of the most common questions I get asked.

It might be asked as:

How can I improve accuracy?

…or it may be reversed as:

What can I do if my neural network performs poorly?

I often reply with “I don’t know exactly, but I have lots of ideas.

Then I proceed to list out all of the ideas I can think of that might give a lift in performance.

Rather than write out that list again, I’ve decided to put all of my ideas into this post.

The ideas won’t just help you with deep learning, but really any machine learning algorithm.

Kick-start your project with my new book Better Deep Learning, including step-by-step tutorials and the Python source code files for all examples.

It’s a big post, you might want to bookmark it.

How To Improve Deep Learning Performance

How To Improve Deep Learning Performance
Photo by Pedro Ribeiro Simões, some rights reserved.

Ideas to Improve Algorithm Performance

This list of ideas is not complete but it is a great start.

My goal is to give you lots ideas of things to try, hopefully, one or two ideas that you have not thought of.

You often only need one good idea to get a lift.

If you get results from one of the ideas, let me know in the comments.
I’d love to hear about it!

If you have one more idea or an extension of one of the ideas listed, let me know, I and all readers would benefit! It might just be the one idea that helps someone else get their breakthrough.

I have divided the list into 4 sub-topics:

  1. Improve Performance With Data.
  2. Improve Performance With Algorithms.
  3. Improve Performance With Algorithm Tuning.
  4. Improve Performance With Ensembles.

The gains often get smaller the further down the list. For example, a new framing of your problem or more data is often going to give you more payoff than tuning the parameters of your best performing algorithm. Not always, but in general.

I have included lots of links to tutorials from the blog, questions from related sites as well as questions on the classic Neural Net FAQ.

Some of the ideas are specific to artificial neural networks, but many are quite general. General enough that you could use them to spark ideas on improving your performance with other techniques.

Let’s dive in.

1. Improve Performance With Data

You can get big wins with changes to your training data and problem definition. Perhaps even the biggest wins.

Here’s a short list of what we’ll cover:

  1. Get More Data.
  2. Invent More Data.
  3. Rescale Your Data.
  4. Transform Your Data.
  5. Feature Selection.

1) Get More Data

Can you get more training data?

The quality of your models is generally constrained by the quality of your training data. You want the best data you can get for your problem.

You also want lots of it.

Deep learning and other modern nonlinear machine learning techniques get better with more data. Deep learning especially. It is one of the main points that make deep learning so exciting.

Take a look at the following cartoon:

Why Deep Learning?

Why Deep Learning?
Slide by Andrew Ng, all rights reserved.

More data does not always help, but it can. If I am given the choice, I will get more data for the optionality it provides.

Related:

2) Invent More Data

Deep learning algorithms often perform better with more data.

We mentioned this in the last section.

If you can’t reasonably get more data, you can invent more data.

  • If your data are vectors of numbers, create randomly modified versions of existing vectors.
  • If your data are images, create randomly modified versions of existing images.
  • If your data are text, you get the idea…

Often this is called data augmentation or data generation.

You can use a generative model. You can also use simple tricks.

For example, with photograph image data, you can get big gains by randomly shifting and rotating existing images. It improves the generalization of the model to such transforms in the data if they are to be expected in new data.

This is also related to adding noise, what we used to call adding jitter. It can act like a regularization method to curb overfitting the training dataset.

Related:

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

3) Rescale Your Data

This is a quick win.

A traditional rule of thumb when working with neural networks is:

Rescale your data to the bounds of your activation functions.

If you are using sigmoid activation functions, rescale your data to values between 0-and-1. If you’re using the Hyperbolic Tangent (tanh), rescale to values between -1 and 1.

This applies to inputs (x) and outputs (y). For example, if you have a sigmoid on the output layer to predict binary values, normalize your y values to be binary. If you are using softmax, you can still get benefit from normalizing your y values.

This is still a good rule of thumb, but I would go further.

I would suggest that you create a few different versions of your training dataset as follows:

  • Normalized to 0 to 1.
  • Rescaled to -1 to 1.
  • Standardized.

Then evaluate the performance of your model on each. Pick one, then double down.

If you change your activation functions, repeat this little experiment.

Big values accumulating in your network are not good. In addition, there are other methods for keeping numbers small in your network such as normalizing activation and weights, but we’ll look at these techniques later.

Related:

4) Transform Your Data

Related to rescaling suggested above, but more work.

You must really get to know your data. Visualize it. Look for outliers.

Guesstimate the univariate distribution of each column.

  • Does a column look like a skewed Gaussian, consider adjusting the skew with a Box-Cox transform.
  • Does a column look like an exponential distribution, consider a log transform.
  • Does a column look like it has some features, but they are being clobbered by something obvious, try squaring, or square-rooting.
  • Can you make a feature discrete or binned in some way to better emphasize some feature.

Lean on your intuition. Try things.

  • Can you pre-process data with a projection method like PCA?
  • Can you aggregate multiple attributes into a single value?
  • Can you expose some interesting aspect of the problem with a new boolean flag?
  • Can you explore temporal or other structure in some other way?

Neural nets perform feature learning. They can do this stuff.

But they will also learn a problem much faster if you can better expose the structure of the problem to the network for learning.

Spot-check lots of different transforms of your data or of specific attributes and see what works and what doesn’t.

Related:

5) Feature Selection

Neural nets are generally robust to unrelated data.

They’ll use a near-zero weight and sideline the contribution of non-predictive attributes.

Still, that’s data, weights, training cycles used on data not needed to make good predictions.

Can you remove some attributes from your data?

There are lots of feature selection methods and feature importance methods that can give you ideas of features to keep and features to boot.

Try some. Try them all. The idea is to get ideas.

Again, if you have time, I would suggest evaluating a few different selected “Views” of your problem with the same network and see how they perform.

  • Maybe you can do as well or better with fewer features. Yay, faster!
  • Maybe all the feature selection methods boot the same specific subset of features. Yay, consensus on useless features.
  • Maybe a selected subset gives you some ideas on further feature engineering you can perform. Yay, more ideas.

Related:

6) Reframe Your Problem

Step back from your problem.

Are the observations that you’ve collected the only way to frame your problem?

Maybe there are other ways. Maybe other framings of the problem are able to better expose the structure of your problem to learning.

I really like this exercise because it forces you to open your mind. It’s hard. Especially if you’re invested (ego!!!, time, money) in the current approach.

Even if you just list off 3-to-5 alternate framings and discount them, at least you are building your confidence in the chosen approach.

  • Maybe you can incorporate temporal elements in a window or in a method that permits timesteps.
  • Maybe your classification problem can become a regression problem, or the reverse.
  • Maybe your binary output can become a softmax output?
  • Maybe you can model a sub-problem instead.

It is a good idea to think through the problem and it’s possible framings before you pick up the tool, because you’re less invested in solutions.

Nevertheless, if you’re stuck, this one simple exercise can deliver a spring of ideas.

Also, you don’t have to throw away any of your prior work. See the ensembles section later on.

Related:

2. Improve Performance With Algorithms

Machine learning is about algorithms.

All the theory and math describes different approaches to learn a decision process from data (if we constrain ourselves to predictive modeling).

You’ve chosen deep learning for your problem. Is it really the best technique you could have chosen?

In this section, we’ll touch on just a few ideas around algorithm selection before next diving into the specifics of getting the most from your chosen deep learning method.

Here’s the short list

  1. Spot-Check Algorithms.
  2. Steal From Literature.
  3. Resampling Methods.

Let’s get into it.

1) Spot-Check Algorithms

Brace yourself.

You cannot know which algorithm will perform best on your problem beforehand.

If you knew, you probably would not need machine learning.

What evidence have you collected that your chosen method was a good choice?

Let’s flip this conundrum.

No single algorithm can perform better than any other, when performance is averaged across all possible problems. All algorithms are equal. This is a summary of the finding from the no free lunch theorem.

Maybe your chosen algorithms is not the best for your problem.

Now, we are not trying to solve all possible problems, but the new hotness in algorithm land may not be the best choice on your specific dataset.

My advice is to collect evidence. Entertain the idea that there are other good algorithms and given them a fair shot on your problem.

Spot-check a suite of top methods and see which fair well and which do not.

  • Evaluate some linear methods like logistic regression and linear discriminate analysis.
  • Evaluate some tree methods like CART, Random Forest and Gradient Boosting.
  • Evaluate some instance methods like SVM and kNN.
  • Evaluate some other neural network methods like LVQ, MLP, CNN, LSTM, hybrids, etc.

Double down on the top performers and improve their chance with some further tuning or data preparation.

Rank the results against your chosen deep learning method, how do they compare?

Maybe you can drop the deep learning model and use something a lot simpler, a lot faster to train, even something that is easy to understand.

Related:

2) Steal From Literature

A great shortcut to picking a good method, is to steal ideas from literature.

Who else has worked on a problem like yours and what methods did they use.

Check papers, books, blog posts, Q&A sites, tutorials, everything Google throws at you.

Write down all the ideas and work your way through them.

This is not about replicating research, it is about new ideas that you have not thought of that may give you a lift in performance.

Published research is highly optimized.

There are a lot of smart people writing lots of interesting things. Mine this great library for the nuggets you need.

Related:

3) Resampling Methods

You must know how good your models are.

Is your estimate of the performance of your models reliable?

Deep learning methods are slow to train.

This often means we cannot use gold standard methods to estimate the performance of the model such as k-fold cross validation.

  • Maybe you are using a simple train/test split, this is very common. If so, you need to ensure that the split is representative of the problem. Univariate stats and visualization are a good start.
  • Maybe you can exploit hardware to improve the estimates. For example, if you have a cluster or an Amazon Web Services account, we can train n-models in parallel then take the mean and standard deviation of the results to get a more robust estimate.
  • Maybe you can use a validation hold out set to get an idea of the performance of the model as it trains (useful for early stopping, see later).
  • Maybe you can hold back a completely blind validation set that you use only after you have performed model selection.

Going the other way, maybe you can make the dataset smaller and use stronger resampling methods.

  • Maybe you see a strong correlation with the performance of the model trained on a sample of the training dataset as to one trained on the whole dataset. Perhaps you can perform model selection and tuning using the smaller dataset, then scale the final technique up to the full dataset at the end.
  • Maybe you can constrain the dataset anyway, take a sample and use that for all model development.

You must have complete confidence in the performance estimates of your models.

Related:

3. Improve Performance With Algorithm Tuning

This is where the meat is.

You can often unearth one or two well-performing algorithms quickly from spot-checking. Getting the most from those algorithms can take, days, weeks or months.

Here are some ideas on tuning your neural network algorithms in order to get more out of them.

  1. Diagnostics.
  2. Weight Initialization.
  3. Learning Rate.
  4. Activation Functions.
  5. Network Topology.
  6. Batches and Epochs.
  7. Regularization.
  8. Optimization and Loss.
  9. Early Stopping.

You may need to train a given “configuration” of your network many times (3-10 or more) to get a good estimate of the performance of the configuration. This probably applies to all the aspects that you can tune in this section.

For a good post on hyperparameter optimization see:

1) Diagnostics

You will get better performance if you know why performance is no longer improving.

Is your model overfitting or underfitting?

Always keep this question in mind. Always.

It will be doing one or the other, just by varying degrees.

A quick way to get insight into the learning behavior of your model is to evaluate it on the training and a validation dataset each epoch, and plot the results.

Plot of Model Accuracy on Train and Validation Datasets

Plot of Model Accuracy on Train and Validation Datasets

  • If training is much better than the validation set, you are probably overfitting and you can use techniques like regularization.
  • If training and validation are both low, you are probably underfitting and you can probably increase the capacity of your network and train more or longer.
  • If there is an inflection point when training goes above the validation, you might be able to use early stopping.

Create these plots often and study them for insight into the different techniques you can use to improve performance.

These plots might be the most valuable diagnostics you can create.

Another useful diagnostic is to study the observations that the network gets right and wrong.

On some problems, this can give you ideas of things to try.

  • Perhaps you need more or augmented examples of the difficult-to-train on examples.
  • Perhaps you can remove large samples of the training dataset that are easy to model.
  • Perhaps you can use specialized models that focus on different clear regions of the input space.

Related

2) Weight Initialization

The rule of thumb used to be:

Initialize using small random numbers.

In practice, that is still probably good enough. But is it the best for your network?

There are also heuristics for different activation functions, but I don’t remember seeing much difference in practice.

Keep your network fixed and try each initialization scheme.

Remember, the weights are the actual parameters of your model that you are trying to find. There are many sets of weights that give good performance, but you want better performance.

  • Try all the different initialization methods offered and see if one is better with all else held constant.
  • Try pre-learning with an unsupervised method like an autoencoder.
  • Try taking an existing model and retraining a new input and output layer for your problem (transfer learning).

Remember, changing the weight initialization method is closely tied with the activation function and even the optimization function.

Related

3) Learning Rate

There is often payoff in tuning the learning rate.

Here are some ideas of things to explore:

  • Experiment with very large and very small learning rates.
  • Grid search common learning rate values from the literature and see how far you can push the network.
  • Try a learning rate that decreases over epochs.
  • Try a learning rate that drops every fixed number of epochs by a percentage.
  • Try adding a momentum term then grid search learning rate and momentum together.

Larger networks need more training, and the reverse. If you add more neurons or more layers, increase your learning rate.

Learning rate is coupled with the number of training epochs, batch size and optimization method.

Related:

4) Activation Functions

You probably should be using rectifier activation functions.

They just work better.

Before that it was sigmoid and tanh, then a softmax, linear or sigmoid on the output layer. I don’t recommend trying more than that unless you know what you’re doing.

Try all three though and rescale your data to meet the bounds of the functions.

Obviously, you want to choose the right transfer function for the form of your output, but consider exploring different representations.

For example, switch your sigmoid for binary classification to linear for a regression problem, then post-process your outputs. This may also require changing the loss function to something more appropriate. See the section on Data Transforms for more ideas along these lines.
Related:

5) Network Topology

Changes to your network structure will pay off.

How many layers and how many neurons do you need?

No one knows. No one. Don’t ask.

You must discover a good configuration for your problem. Experiment.

  • Try one hidden layer with a lot of neurons (wide).
  • Try a deep network with few neurons per layer (deep).
  • Try combinations of the above.
  • Try architectures from recent papers on problems similar to yours.
  • Try topology patterns (fan out then in) and rules of thumb from books and papers (see links below).

It’s hard. Larger networks have a greater representational capability, and maybe you need it.

More layers offer more opportunity for hierarchical re-composition of abstract features learned from the data. Maybe you need that.

Later networks need more training, both in epochs and in learning rate. Adjust accordingly.
Related:

These links will give you lots of ideas of things to try, well they do for me.

6) Batches and Epochs

The batch size defines the gradient and how often to update weights. An epoch is the entire training data exposed to the network, batch-by-batch.

Have you experimented with different batch sizes and number of epochs?

Above, we have commented on the relationship between learning rate, network size and epochs.

Small batch sizes with large epoch size and a large number of training epochs are common in modern deep learning implementations.

This may or may not hold with your problem. Gather evidence and see.

  • Try batch size equal to training data size, memory depending (batch learning).
  • Try a batch size of one (online learning).
  • Try a grid search of different mini-batch sizes (8, 16, 32, …).
  • Try training for a few epochs and for a heck of a lot of epochs.

Consider a near infinite number of epochs and setup check-pointing to capture the best performing model seen so far, see more on this further down.

Some network architectures are more sensitive than others to batch size. I see Multilayer Perceptrons as often robust to batch size, whereas LSTM and CNNs quite sensitive, but that is just anecdotal.

Related

7) Regularization

Regularization is a great approach to curb overfitting the training data.

The hot new regularization technique is dropout, have you tried it?

Dropout randomly skips neurons during training, forcing others in the layer to pick up the slack. Simple and effective. Start with dropout.

  • Grid search different dropout percentages.
  • Experiment with dropout in the input, hidden and output layers.

There are extensions on the dropout idea that you can also play with like drop connect.

Also consider other more traditional neural network regularization techniques , such as:

  • Weight decay to penalize large weights.
  • Activation constraint, to penalize large activations.

Experiment with the different aspects that can be penalized and with the different types of penalties that can be applied (L1, L2, both).

Related:

8) Optimization and Loss

It used to be stochastic gradient descent, but now there are a ton of optimizers.

Have you experimented with different optimization procedures?

Stochastic Gradient Descent is the default. Get the most out of it first, with different learning rates, momentum and learning rate schedules.

Many of the more advanced optimization methods offer more parameters, more complexity and faster convergence. This is good and bad, depending on your problem.

To get the most out of a given method, you really need to dive into the meaning of each parameter, then grid search different values for your problem. Hard. Time Consuming. It might payoff.

I have found that newer/popular methods can converge a lot faster and give a quick idea of the capability of a given network topology, for example:

You can also explore other optimization algorithms such as the more traditional (Levenberg-Marquardt) and the less so (genetic algorithms). Other methods can offer good starting places for SGD and friends to refine.

The loss function to be optimized might be tightly related to the problem you are trying to solve.

Nevertheless, you often have some leeway (MSE and MAE for regression, etc.) and you might get a small bump by swapping out the loss function on your problem. This too may be related to the scale of your input data and activation functions that are being used.

Related:

9) Early Stopping

You can stop learning once performance starts to degrade.

This can save a lot of time, and may even allow you to use more elaborate resampling methods to evaluate the performance of your model.

Early stopping is a type of regularization to curb overfitting of the training data and requires that you monitor the performance of the model on training and a held validation datasets, each epoch.

Once performance on the validation dataset starts to degrade, training can stop.

You can also setup checkpoints to save the model if this condition is met (measuring loss of accuracy), and allow the model to keep learning.

Checkpointing allows you to do early stopping without the stopping, giving you a few models to choose from at the end of a run.

Related:

4. Improve Performance With Ensembles

You can combine the predictions from multiple models.

After algorithm tuning, this is the next big area for improvement.

In fact, you can often get good performance from combining the predictions from multiple “good enough” models rather than from multiple highly tuned (and fragile) models.

We’ll take a look at three general areas of ensembles you may want to consider:

  1. Combine Models.
  2. Combine Views.
  3. Stacking.

1) Combine Models

Don’t select a model, combine them.

If you have multiple different deep learning models, each that performs well on the problem, combine their predictions by taking the mean.

The more different the models, the better. For example, you could use very different network topologies or different techniques.

The ensemble prediction will be more robust if each model is skillful but in different ways.

Alternately, you can experiment with the converse position.

Each time you train the network, you initialize it with different weights and it converges to a different set of final weights. Repeat this process many times to create many networks, then combine the predictions of these networks.

Their predictions will be highly correlated, but it might give you a small bump on those patterns that are harder to predict.

Related:

2) Combine Views

As above, but train each network on a different view or framing of your problem.

Again, the objective is to have models that are skillful, but in different ways (e.g. uncorrelated predictions).

You can lean on the very different scaling and transform techniques listed above in the Data section for ideas.

The more different the transforms and framing of the problem used to train the different models, the more likely your results will improve.

Using a simple mean of predictions would be a good start.

3) Stacking

You can also learn how to best combine the predictions from multiple models.

This is called stacked generalization or stacking for short.

Often you can get better results over that of a mean of the predictions using simple linear methods like regularized regression that learns how to weight the predictions from different models.

Baseline reuslts using the mean of the predictions from the submodels, but lift performance with learned weightings of the models.

Conclusions

You made it.

Additional Resources

There’s a lot of good resources, but few tie all the ideas together.

I’ll list some resources and related posts that you may find interesting if you want to dive deeper.

Know a good resource? Let me know, leave a comment.

Handle The Overwhelm

This is a big post and we’ve covered a lot of ground.

You do not need to do everything. You just need one good idea to get a lift in performance.

Here’s how to handle the overwhelm:

  1. Pick one group
    1. Data.
    2. Algorithms.
    3. Tuning.
    4. Ensembles.
  2. Pick one method from the group.
  3. Pick one thing to try of the chosen method.
  4. Compare the results, keep if there was an improvement.
  5. Repeat.

Share Your Results

Did you find this post useful?

Did you get that one idea or method that made a difference?

Let me know, leave a comment!
I would love to hear about it.

Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

Bring better deep learning to your projects!

Skip the Academics. Just Results.

See What's Inside

183 Responses to How To Improve Deep Learning Performance

  1. Avatar
    Xu Lu September 22, 2016 at 4:31 am #

    Thank you for your advice!

  2. Avatar
    Nitin September 22, 2016 at 5:30 am #

    It is really a comprehensive explanation, going to try it.

  3. Avatar
    Dr. Sudhir Pathak September 22, 2016 at 12:13 pm #

    Thank you for sharing – this is very useful information.

  4. Avatar
    Arifa September 22, 2016 at 3:14 pm #

    What are mathematical challenges for Deep learning in Big data?

    • Avatar
      Jason Brownlee September 22, 2016 at 5:32 pm #

      Great question.

      Generally, deep learning is empirical. We don’t have good theory for why large networks work or even what the heck is going on inside.

      How many layers? What learning rate? These are problems that can only be solved empirically, not analytically. For now.

      A strong math theory could push back the empirical side/voodoo and improve understanding.

  5. Avatar
    Daisuke September 25, 2016 at 10:43 pm #

    Thank you for sharing great post, I really appreciate.
    I think it is very helpful, so I’d like to share the idea with my Japanese followers.
    So I’m making translated summary of this post.
    Could I ask for permission to post my summary in http://qiita.com/daisukelab ?
    (Japanese tech blog media)

    And could I ask the detail for 2-3),
    “Maybe you can constrain the dataset anyway, take a sample and use that for all model development.”

    Do you mean:
    “If we use smaller subset of dataset, we could use the subset for completing model development to the end”?

    Thank you!

    • Avatar
      Jason Brownlee September 26, 2016 at 6:59 am #

      Please do not repost the material Daisuke.

      My that comment I meant that working with a sample of your data, rather than all of the data has benefits – like increasing the speed of turning around models.

      • Avatar
        Daisuke September 26, 2016 at 11:16 am #

        Hi Jason,
        OK I will not repost, though it is for spreading your idea with translation and lead people visit here.
        And thank you for clarification!

  6. Avatar
    Chintan zaveri September 30, 2016 at 7:42 am #

    Amazing post.. Thanx for sharing information about deep learning and enhancing current models

    • Avatar
      Jason Brownlee September 30, 2016 at 7:52 am #

      Thanks Chintan, I’m glad you found it useful.

  7. Avatar
    Fernando C. November 25, 2016 at 6:07 am #

    This is my current favorite website!

    I’m currently working on implementing some nlp for regressions and was wondering if I could improve my results. I’ll try ensembles, as I have many models already trained.

    Thanks!

    • Avatar
      Jason Brownlee November 25, 2016 at 9:34 am #

      Best of luck Fernando, I’d love to hear how you go.

  8. Avatar
    Lee Moon November 25, 2016 at 9:12 pm #

    This is really good information what I found. Actually, I am working in Semantic Segmentation using Deep learning. It can view as an extension task of recognition task. As I read, I felt that all segmentation techniques have come from recognition (We can think that the recognition as encoding phase provides probability map, the segmentation task maps the probability maps to the image by using decode phase). Hence my opinion, I think that if any state-of-the-art recognition network architecture applies for segmentation task which can achieve more accuracy than segmentation using older recognition network architecture. Do you think so? What is good direction to improve segmentation accuracy? Thank you in advance

  9. Avatar
    Alp ALTIN December 9, 2016 at 6:20 pm #

    I am a newbie in deep learning and experimenting with existing examples, using the digits interface. Thank you for all the effort to simplify the topic, a technical documentation still well understandable for newcomers.
    My interest is on detecting (and counting) particles via deep learning. After many many experiments with various samples, I realised particles too close to each (almost touching) other are counted as one, while there is a clear seperation, to my eye.

    I am looking for an approach in how to handle this.

    I tried some thresholding techniques on individual images in an image processing software and best results obtained with color thresholding. Although I have no idea whether thresholding is a part of particle detection in object detection processes by deep learning, but wondering if is it possible to integrate an equivalent process into object detection models ?(currently using the detectnet model)
    Regards,

  10. Avatar
    Naoki January 21, 2017 at 5:15 pm #

    thank you. I’ll try some techniques of this post.

    • Avatar
      Jason Brownlee January 22, 2017 at 5:08 am #

      You’re welcome Naoki, I’d love to hear about your results.

  11. Avatar
    emma February 5, 2017 at 11:40 pm #

    very comprehensive blog! Great work!

    • Avatar
      Jason Brownlee February 6, 2017 at 9:43 am #

      Thanks emma, I hope it helps with your project.

  12. Avatar
    Max S. February 17, 2017 at 9:26 am #

    Hello,

    I would like to know if there is an implementation in Keras of “drop connect”.

    Thanks you for your time

    • Avatar
      Jason Brownlee February 17, 2017 at 10:07 am #

      I have not seen one Max, but I expect there will be something out there!

  13. Avatar
    KERKENI L. March 10, 2017 at 3:28 am #

    Hi Jason,

    Thank you for sharing great post, I really appreciate.

    you mentioned in section 2: Invent More Data

    -If you can’t reasonably get more data, you can invent more data.

    -If your data are vectors of numbers, create randomly modified versions of existing vectors.

    —-> Have you an example how to create randomly modified versions of existing vectors.?

    Thank you!

    • Avatar
      Jason Brownlee March 10, 2017 at 9:29 am #

      I don’t but you could experiment with different perturbation methods to see what works best.

      Some good ideas to try include:

      – randomly replace a subset of values with randomly selected values in the data population
      – add a small random value (select distribution to meet the data distribution for a column)

      It is tricky, because you need the new data to be “reasonable” for the assigned class.

      Consider a skim of the literature for more sophisticated methods.

  14. Avatar
    Lucas April 7, 2017 at 8:46 am #

    First of all thank you for the thorough explanation and rich material, it’s been helping me quite a lot.

    One thing that still troubles me is applying Levenberg-Marquardt in Python, more specifically in Keras. I’m dealing with non-ideal input variables to infer target and would like to go through a range of optimizers to test the network performance.
    Since one of the best available in Matlab is Levenberg-Marquardt, it would very good (and provide comparison value between languages) if I could accurately apply it in keras to train my network.

    I am currently using Keras with Theano backend. Any help at all would be appreciated.

    Best,
    Lucas

  15. Avatar
    Marcin May 17, 2017 at 3:27 am #

    Hi Jason,

    Thank you for the great work and posts.

    I have got a question- after improving deep learning performance in my project i achieve accuracy of 75% in a binary classification problem. Using other methods gives me in the best shot 77%. My question is when do i know that my model is the best possbile? Is there any measure that can explain to which extend my data has explanatory power?

    • Avatar
      Jason Brownlee May 17, 2017 at 8:39 am #

      Well done!

      Great question. We can never know for sure, but only when we run out of time or ideas.

      Generally, no, I’m not aware of methods to estimate explanatory power, I’m not even clear what that might mean.

      • Avatar
        Marcin May 17, 2017 at 7:14 pm #

        What I meant by ‘explanatory power’, was the ability of data to distinguish that one record belongs to class 1, second to class 2, third again to class 1 and so on.
        It is some kind of limitation of the dataset, that it can achieve max. accuracy for example at 80% and my question was is it a way to measure, that level.
        Thank you for your answer, now i am ready to accept my model 🙂

        • Avatar
          Jason Brownlee May 18, 2017 at 8:35 am #

          You can configure the model to output probabilities instead of classes, this may give the result you require.

  16. Avatar
    Cyrus June 22, 2017 at 4:30 pm #

    Finally! Come across all the techniques to improve your deep learning model in a nutshell!
    Thanks a lot Jason!

  17. Avatar
    Parva June 27, 2017 at 11:00 pm #

    Hi, I’m working on my final year project which is detecting nsfw content in images and further for videos (if possible). My problem is that i cannot find any dataset for working so if you could please help me out with this problem by giving me some suggestions will really help me in this project.I need data of about (150-200 gb) to make my algorithm more precise. I have reached out to yahoo open nsfw team but there is no response from them.

    • Avatar
      Jason Brownlee June 28, 2017 at 6:26 am #

      Sorry, I don’t know where to get such a dataset.

  18. Avatar
    Glendon July 24, 2017 at 12:35 pm #

    Hey Jason, Do you know of any empirical evidence for the “Why Deep Learning?” slide by Andrew Ng. Specifically I am working on a text classification problem, I am finding BoW + (Linear SVM’s or Logistic Regression) giving me the best performance (which is what I find in the literature at least pre 2015). I’m fairly new to Deep learning, I have been testing it out on my problem having seen stuff like https://arxiv.org/abs/1408.5882 perform well in the Rotten Tomatoes Kaggle text classification problem. Although there was some other fancy tricks that I think gave the CNN extra juice. Do you have any recommendations or any benchmarking studies in this area that demonstrate what Andrew Ng is claiming?

  19. Avatar
    Nunu July 26, 2017 at 7:58 pm #

    Dear Jason,
    Thanks for this article 🙂 I have a question : how to calculate the total error of a network ?!
    thanks in advance.
    Nunu

    • Avatar
      Jason Brownlee July 27, 2017 at 7:59 am #

      Evaluate it on test data and calculate an error score, such as RMSE for regression or Accuracy on classification.

  20. Avatar
    Nunu July 27, 2017 at 10:49 pm #

    ok I will try it. Thanks a lot

  21. Avatar
    Danny July 28, 2017 at 1:31 pm #

    (“Hyperbolic Tangent (tanh), rescale to values between -1 and 1” )
    WHOOOOPS!!, that might be important,now i know, lol!

    Finalllllllllly! someone who has explain this wonderfully with structure, and not just said its a black box!
    great info Doc

    I have a question! do you have any pointers for unbalanced data?

    is it better to sacrifice other data to balance every class out?

  22. Avatar
    Danny July 28, 2017 at 1:47 pm #

    Examples of what I mean by unbalance data

    === self drive car ===
    left = [0,1,0] =5k samples
    right = [0,0,1]= 5k samples
    forward = [1,0,0]=100k samples

    ===stocks====
    Buy = [0,1,0] =5k samples
    Sell = [0,0,1]= 5k samples
    Hold = [1,0,0]=100k samples

    etc……..

    • Avatar
      J September 6, 2017 at 10:03 am #

      Undersampling, Oversampling, or Smote

    • Avatar
      shayma May 4, 2018 at 5:17 am #

      thanks for allll your articles in this website ,it is favorite for me <3 <3

  23. Avatar
    diao July 28, 2017 at 6:01 pm #

    great article

  24. Avatar
    K.D. August 7, 2017 at 8:41 pm #

    Hi Jason,

    A long try-and-test process led me to the same conclusions. So thank you very much! This post will serve for a lot of new comers to the keras/ deep learning area.

    Question:
    In a classic case, you normalize your data, you train the model and then you “de-normalize” (inverse using the scaler). Now, imagine that the model you are training is fed with its own output and the predicted outpùt is out of the scaler range, what would you do to improve the model’s performance.
    Thank you in advance for your feedback!

    Kind Regards.

    • Avatar
      Jason Brownlee August 8, 2017 at 7:49 am #

      Thanks, I’m glad it helped.

      If you have data outside of the scalers range, you can force it in bounds or update the scaling.

      I hope that helps.

  25. Avatar
    Sam August 16, 2017 at 4:03 pm #

    Hi Jason,

    Thanks for sharing this valuable post

    One thing I am still wondering about, I am interested to apply deep learning in data stream classification (real time prediction), but my concern is the execution time that the deep learning needs. Any idea how to speed it up or how to handle it for real time prediction.

    Regards

    • Avatar
      Jason Brownlee August 16, 2017 at 5:01 pm #

      For modestly sized data, the feed-forward part of the neural network (to make predictions) is very fast.

      As for training the network in real-time, I would suggest that it is perhaps a bad fit for the problem.

      • Avatar
        Sam August 22, 2017 at 5:26 pm #

        Many thanks Jason

  26. Avatar
    My Nguyen August 23, 2017 at 6:16 am #

    Been learning a lot from your posts. Thanks for posting. Just curious, what’s up with the random pictures?:)

    • Avatar
      Jason Brownlee August 23, 2017 at 6:58 am #

      Thanks!

      I figure the pictures would lighten the mood, be something interesting to look at as we get deep into technical topics. I often choose pictures based on where I am going or want to go for a holiday, e.g. the beach, the forest, etc. Sometimes they are a pun (e.g. a pic of a gas pipeline for a pipeline method). Sometimes they are just random.

      You’re the first person in 4 years to ask about them 🙂

  27. Avatar
    Devakar Kumar Verma August 24, 2017 at 2:28 pm #

    Under “Rescale Your Data”, you have pointed about scaling and activation function. So if we scale the data between [-1,1], then we have to implicitly mention about activation function (i.e tanh function) in LSTM using Keras. Am i correct in assumption, or Keras will pass tanh activation function default in LSTM.

    • Avatar
      Jason Brownlee August 24, 2017 at 4:27 pm #

      For LSTMs at the first hidden layer, you will want to scale your data to the range 0-1.

  28. Avatar
    faruk ahmad September 30, 2017 at 3:01 am #

    Thanks for this amazing writing. It is very useful.

  29. Avatar
    Robin October 24, 2017 at 11:27 pm #

    A bit outdated but still very useful. Thanks for this great article!

  30. Avatar
    Nguyen Anh Nguyen October 30, 2017 at 10:16 pm #

    Would you publish technique for “DNN/CNN incremental learning” please.

    Thank you.

  31. Avatar
    Amina November 16, 2017 at 6:35 am #

    Thanks Jason, I really love this blog. Pls I have a little questions. In what ways can you improve existing machine learning with deep learning?.

    Thanks

  32. Avatar
    Alan Liu November 23, 2017 at 2:33 pm #

    Hi, Jason,

    I got a question about the training epoch. During the training process, does the weights we get from previous epoch have any impact on later epoch? Or in every epoch, the weights will be initialized?

    Thanks
    Alan

    • Avatar
      Jason Brownlee November 24, 2017 at 9:31 am #

      The weights are initialized once at the beginning of the process and updated at the end of each batch. One epoch may be comprised of one or more batches (weight updates).

  33. Avatar
    jagesh maharjan November 30, 2017 at 10:09 pm #

    A quick question, (I will simplify my explanation), I have a total of 11 classes. where 10 classes have a 50 data points and one class has only 1 datapoint. I train a model. Can i retrain a same model with former 10 classes with no datapoints and later one class with 50 datapoints. Does that work.
    Reason i am trying to do this is because, i will have datapoints for later class after some time later.
    Does my proposed plan work ?

    • Avatar
      Jason Brownlee December 1, 2017 at 7:33 am #

      You might want to use data augmentation to create a larger training dataset, it does not sound like enough data.

      • Avatar
        Jagesh Maharjan December 3, 2017 at 7:32 am #

        Actually, I have enough data, the above example is just for the illustration only.

        Let me put it this way (this might be more specific [Incremental Learning]): Initially, I trained a model with 10 classes/labels. Later, How do I re-trained the same model, with new classes, keeping old class intact. So that, I can use the re-trained model to predict former and later classes.

        • Avatar
          Jason Brownlee December 4, 2017 at 7:42 am #

          Perhaps create a new dataset with examples of old and new classes and update the model weights on the new dataset?

          • Avatar
            Jagesh Maharjan December 4, 2017 at 2:10 pm #

            Thanks, @Jason Brownlee, Indeed that way I could retain all of my previous classes along with new classes.

          • Avatar
            Jason Brownlee December 4, 2017 at 4:58 pm #

            Let me know how you go.

  34. Avatar
    Pavs December 24, 2017 at 12:06 am #

    Great post with broader details. Thanks.

  35. Avatar
    Fredrik January 13, 2018 at 11:51 am #

    Perfect 🙂

    Thanks.

    //Fredrik

  36. Avatar
    Tony January 18, 2018 at 1:18 am #

    Dear sir
    I am creating an NN for predicting as the House pricing by Keras example of yours: https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/.

    Number of experiment data (training data + testing data) is X1, small group in the boundaries. In your example, X1 = 506 data

    Number of data for predicting data is X2, covering almost the boundaries.

    It means that X1 are much smaller than X2.

    And to achieve a high accuracy of prediction, we should enlarge the X1 as much as we can.

    So, I would like to ask that how many percentage of X1 we should collect compared with X2?
    X1 = 10%(X2), 20%…

    Thank you very much

    • Avatar
      Jason Brownlee January 18, 2018 at 10:11 am #

      I don’t follow. Do you mean X2 are observations on which you need to make predictions?

  37. Avatar
    Liu We February 22, 2018 at 7:13 pm #

    Dear Jason,
    I am trying to predict about 40 related time series with Seq2seq networks. I have read about autoencoders to automatically engineer features witthout having to do it manually. Could you please explain how you use the autoencoder outputs in iorder to make prédictions? Do you concatenate them with the original time series before feeding the prediction network.
    Thank you very much
    Regards

  38. Avatar
    Niladri Ray Choudhury March 8, 2018 at 4:08 am #

    Can you suggest some data augmentation methods for time-series data (or 1D data) that do not employ windowing techniques. It seems that for time-series data the most popular data augmentation technique are the window based techniques, which does not sit well with the problem I have at hand. I have some audio sensor data and I want to predict the exact location of the sound source. Windowing may have some negative impact on the problem as the time difference of arrival (TDOA) is one the most important feature for such type of tasks and it might get corrupted by windowing.
    What are other method I can use to increase my data.
    Information about data:
    5 sensors are placed on 4 wall and ceiling in a room. We are trying to find the exact location of a sound source from the sensor data.
    At any given point of time two different sound are active at different locations within the room.

  39. Avatar
    Yousaf May 7, 2018 at 7:35 pm #

    Dear Jason

    I really appreciate your post and that is helpful for us. Actually, I am working in Deep learning last 6 months and most of the idea that you mention here comes to my mind during learning Deep learning and I applied all these ideas that come to my mind on my problem most of the tricks work perfectly. And Today I saw your post and I was surprised all these tricks that I was applied to my problem included in this post. Before viewing this post I was always thinking maybe I am in wrong way. But now I am happy to get a reference. Thank you very much for sharing this valuable post.

  40. Avatar
    Rohan Gawande June 28, 2018 at 8:08 pm #

    Hi Jason, I am just a beginner to using neural networks. I wanted to know that if my input to neural networks is 5 but i have almost a number of 185 distinct outputs in my dataset , But my output can be a different value than those 185 values , so what method could I use?? rather than using one hot encoding and how can I increase performance of my model??

    • Avatar
      Jason Brownlee June 29, 2018 at 5:56 am #

      Neural networks require a fixed number of inputs.

      If the number of inputs vary, you can use padding to ensure the input vector is always the same size.

  41. Avatar
    Kingsley Udeh July 25, 2018 at 9:55 pm #

    Hi Dr. Jason,

    Thanks for the comprehensive posts. As you may have known, I have become an addicted reader of your blog resources.

    I don’t think you have been able to address the following questions vividly:

    How do I save a combined predictions(models) from ensample for use in productions?

    What’s the difference between Walk-forward Validation method and combined predictions from ensambles technique? Little bit confused here.

    Looking forward to your answers soon.

    • Avatar
      Jason Brownlee July 26, 2018 at 7:43 am #

      Putting an ensemble into production is no different to putting a model into production.

      Walk forward validation and ensembles are orthogonal ideas, they are not directly related.

  42. Avatar
    Kingsley Udeh July 26, 2018 at 3:07 pm #

    Thanks Jason.

    In this post: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

    You talked about a model may be updated each time step a new data is received -> Walk forward Validation. This being the k-fold cross validation for time series.

    Don’t I have to combine all the models created by the Walk-forward Validation to one single model using either Bagging or Stalking approach? Otherwise, how do I create a final model from Walk-forward Validation?

    • Avatar
      Jason Brownlee July 27, 2018 at 5:44 am #

      Walk-forward validation is ONLY used to evaluate the performance of an approach.

      Once you have evaluated it, you can train a final model on all available data and use it to make predictions.

      • Avatar
        Kingsley Udeh July 27, 2018 at 9:48 am #

        I got it, Jason.

        I used ModelCheckpoint to select the best model among models evaluated with Walk-forward Validation. Hence, I will build the final model by fitting the entire dataset.

        Thanks for your patience and response.

        • Avatar
          Jason Brownlee July 27, 2018 at 11:04 am #

          I don’t recommend that approach. If it works for you, glad to hear it.

  43. Avatar
    Chrisa August 19, 2018 at 12:10 am #

    Hello,

    In my case I have a very good accuracy percentage (91.6%) but my score is really low (30%). Actually, I don’t really understand the difference. Can yoy please help?

  44. Avatar
    Brian McDermott September 4, 2018 at 9:41 pm #

    Hi Jason,

    Thanks for sharing such a useful article. I have a naive question, tough.

    I’m trying to solve a classification problem using LSTM network and I’m experiencing about 99.90% accuracy (the other metrics shows more or less same percentage) on the test set.

    Nevertheless, the training and validation accuracies are also similar. I understand that this is suspiciously higher.

    I observed the learning graph and found that both training and validation errors were homogeneous.

    I’m really confused since both the accuracies for the training, validation, and test are higher. Am I under/overfitting? Or doing something wrong?

    One more thing is that the label is not included in the training set.

    Any clue would be very helpful and appreciated.

    Thanks.

    • Avatar
      Jason Brownlee September 5, 2018 at 6:39 am #

      You are training on unlabelled data? I don’t understand?

      • Avatar
        Brian McDermott September 5, 2018 at 7:19 am #

        Hi Jason,

        Please ignore the following sentence: “One more thing is that the label is not included in the training set”. I couldn’t edit in the comment.

        Actually, my dataset is labeled.

        • Avatar
          Jason Brownlee September 5, 2018 at 2:40 pm #

          With such high accuracies, it sounds like your problem is easily solved. Perhaps try a simpler method.

  45. Avatar
    Brian McDermott September 5, 2018 at 10:44 pm #

    Did you mean using linear or tree-based method would be a better idea? Well, I’ll try.

    But I have a few other concerns too. Although I’m experiencing about 98~99% accuracies on both training, validation and test sets, the ‘score’ (i.e. score, acc = model.evaluate(….)) value is very low about 40%.

    This signifies that perhaps my LSTM model is overfitting (according to your comment on Chrisa’s question). But in that case, I was supposed to experience lower accuracy for the test set too, but I didn’t.

    I’m really confused on whether my model is underfitting or overfitting!

    I’m using a very high-dimensional gene expression data having 20,309 features and 14 classes.

    Do you think achieving 99% accuracy is possible for such a high-dimensional dataset?

    • Avatar
      Jason Brownlee September 6, 2018 at 5:37 am #

      Yes sounds like overfitting, but what are you evaluating on exactly? A new test set?

      • Avatar
        Brian McDermott September 6, 2018 at 9:54 am #

        Yes, on a new test set (totally unseen) and again have about 99% accuracy.

  46. Avatar
    Anam Habib September 6, 2018 at 12:55 am #

    Dear Jason,
    I have a question that my single deep neural network model gives above 90% accuracy for one data set and the same model gives an accuracy between 70-80% for the other data set. I want to know why this variation in accuracy happens in spite of the fact that i am using the same deep neural network model for both datasets(contain textual content).

    • Avatar
      Jason Brownlee September 6, 2018 at 5:40 am #

      You may be over fit. Perhaps try some regularization methods to reduce error on the other dataset.

  47. Avatar
    Hasby Fahrudin October 5, 2018 at 11:43 pm #

    What an article! i am still new in the neural network thingy and this help me a lot. We still need “trial and error” element. But this article at least give me an idea where to start on improving my model. Thank you Jason!

  48. Avatar
    Marcus Harvey October 16, 2018 at 3:44 am #

    This is the most helpful Machine Learning article I’ve seen. Thanks very much.

  49. Avatar
    HAONAN LIU November 9, 2018 at 2:38 pm #

    Thanks Jason! I learned quite a lot from your blogs!

  50. Avatar
    ismetb December 7, 2018 at 8:15 pm #

    Hi Jason, thank you for these wonderful ideas. But, don’t you think AI is reduced to;

    1) Find some data
    2) Apply built-in algoirthms
    3) Tune (mostly this!)

    I see the people often have no idea about what’s going on (including me sometimes). So, what do you think?

    • Avatar
      Jason Brownlee December 8, 2018 at 7:05 am #

      Not AI, instead a small subfield that is the most useful part of AI right now called “predictive modeling”.

      Results over understanding is accepted almost everywhere else, why not here?

  51. Avatar
    AMM March 10, 2019 at 11:28 pm #

    Thank you very much for this grate post, it is really useful.

  52. Avatar
    DeepNet March 11, 2019 at 2:52 am #

    Hi Jason, thanks a lot for sharing the other best post,
    I have a question please answer the question that I lined the StackOverflow,
    If possible, reply to this question here, thanks,

    https://stackoverflow.com/questions/55075256/how-to-deal-with-noisy-images-in-deep-learning-based-object-detection-task

  53. Avatar
    hojae son March 20, 2019 at 5:47 pm #

    Thank you

  54. Avatar
    Lego March 28, 2019 at 5:05 pm #

    siva has no teeth

  55. Avatar
    VISHWSAIJ\] March 28, 2019 at 5:26 pm #

    Hi Jason,

    Thank you for sharing great post, I really appreciate.

    you mentioned in section 2: Invent More Data

    -If you can’t reasonably get more data, you can invent more data.

    -If your data are vectors of numbers, create randomly modified versions of existing vectors.

    —-> Have you an example how to create randomly modified versions of existing vectors.?

    Thank you!

    • Avatar
      Jason Brownlee March 29, 2019 at 8:21 am #

      Not directly.

      A simple approach would be to add gaussian noise.

  56. Avatar
    DHARMA March 28, 2019 at 5:29 pm #

    HIIII JASON

  57. Avatar
    KS Aswin March 28, 2019 at 5:29 pm #

    I am a newbie in deep learning and experimenting with existing examples, using the digits interface. Thank you for all the effort to simplify the topic, a technical documentation still well understandable for newcomers.
    My interest is on detecting (and counting) particles via deep learning. After many many experiments with various samples, I realised particles too close to each (almost touching) other are counted as one, while there is a clear seperation, to my eye.

    I am looking for an approach in how to handle this.

    I tried some thresholding techniques on individual images in an image processing software and best results obtained with color thresholding. Although I have no idea whether thresholding is a part of particle detection in object detection processes by deep learning, but wondering if is it possible to integrate an equivalent process into object detection models ?(currently using the detectnet model)
    Regards,

    • Avatar
      Jason Brownlee March 29, 2019 at 8:24 am #

      Perhaps you can try a suite of different preparations for each input image and either model them with parallel models in an ensemble or a multi-input model?

      Also, as images, consider a multi-head cnn with different kernel size on each head.

      Perhaps also try leveraging pre-trained models.

  58. Avatar
    KS Aswin March 28, 2019 at 5:31 pm #

    I like your smile Jason

  59. Avatar
    Ilona April 16, 2019 at 6:24 am #

    Hi Jason,
    Thank you very much for sharing your knowledge and experience with all of us.
    It made my life as a ML newcomer much easier and answered a lot of open questions. Furthermore, your style of writing is nice to read, it makes curious to know more 🙂

  60. Avatar
    Karlfans April 17, 2019 at 1:27 pm #

    Thank you very much!

  61. Avatar
    vaishali May 8, 2019 at 8:08 pm #

    Dear Sir,
    Thank u for ur post. Its really helpful for me for my phd research which few months back i have started. Can u suggest me which algorithm of Machine learning/ Deep learning will be best for text classification?
    Thank u.

  62. Avatar
    Andi May 25, 2019 at 7:30 pm #

    incredibly, overwhelming good article! 🙂

  63. Avatar
    Ahmed May 31, 2019 at 1:08 pm #

    Nice, big thanks 🙂

  64. Avatar
    Dora July 30, 2019 at 1:35 am #

    Hi Jason, thank you so much for this post. I’ve been overwhelmed by tuning parameters for weeks and your post gives me a clear direction on how to do that. I just found the two links under 3.5 network topology: how many hidden layers and units should I used don’t work. Could you update those links? Thanks!

  65. Avatar
    julia August 7, 2019 at 12:55 pm #

    Thank you so much Jason, By using weight initialization accuracy increased from 0.05 to o.9497, your tutorial is the best in machine learning, I’m going to publish paper with this excellent results, thank you so much

  66. Avatar
    Rudina August 30, 2019 at 10:32 pm #

    you are great. All my questions answered by you
    keep on Jason and thanks a lot 🙂

  67. Avatar
    hassan ahmed September 26, 2019 at 4:33 pm #

    I have a question.
    let suppose I have a data set of thousands of the images. It took several hours to train the DL model. After training I realize that I should go with some other configuration of hyper parameters (selection by errors and trails). In that way I will again have to wait for several hour to train the model on new hyper parameters and parameters and same situation is going on.
    Can you suggest me tutorial or relevant topics, so that after performing my training once, I get the best model, instead of trying the training on various configurations. . . ?
    Thanks for you cooperation.

    • Avatar
      Jason Brownlee September 27, 2019 at 7:47 am #

      No such method is known.

      You can short-cut the process with transfer learning – adapting a pre-trained model from another domain, or one of your own pre-trained models.

  68. Avatar
    hemanth kumar October 1, 2019 at 4:19 pm #

    sir how can i do clear segmentation in cnn
    what parameters i have to change to give clear segmentation sir
    filter size or, padding size in maxpool ,

    sir i am getting wrong segmentation with 3 class (using softmax as af)

  69. Avatar
    awan April 22, 2020 at 4:55 pm #

    Hi sir. Again great article. I have a question regarding all this.
    It is difficult to implement all these for a beginner and self implemented network might not be accurate. So can you please suggest me a good open source implementation of all this in R language or python in which we can change our own new activation function or just change the aforementioned (in your post) parameters and check the results. Does such implementation exists covering all these things where we are supposed to change just parameters.
    Thanks four your kind response sir.

  70. Avatar
    mike June 16, 2020 at 3:05 pm #

    Hi Jason.

    Sometimes some part of too old data , if we include in training data become “toxic” to our model.
    But the problem is we don’t know which part of old data that cause this, it can be from
    oldest data, can be in the middle, and it can be only 10% percent bad data, 15 % percent bad data.

    By looking maybe we can find it manually, but how to create automatic to detect this “toxic” data and remove it.

    any suggestion

    Thanks
    Mike

    • Avatar
      Jason Brownlee June 17, 2020 at 6:16 am #

      Perhaps fit the model with each subset of data removed and compare the performance from each experiment.

  71. Avatar
    Nicolas Nemec June 19, 2020 at 1:20 am #

    Hi Jason, thanks a lot for this post! I am using it for my computer science school project and it really helps. I have one question though in section 2. A don’t quite understand why resampling methods are in the algorithm section and not in section 1. Is this just because resampling methods are section 1 material applied?

  72. Avatar
    Nicolas Nemec June 23, 2020 at 6:55 pm #

    Thank you so much for this article! It really helped and it wasn’t the only one that did. I have been using your website for a while now to help with my school project.

    • Avatar
      Nicolas Nemec June 23, 2020 at 6:56 pm #

      Forgot I already commented here 🙂

      • Avatar
        Jason Brownlee June 24, 2020 at 6:29 am #

        No problem. We’re one big community of practitioners.

    • Avatar
      Jason Brownlee June 24, 2020 at 6:28 am #

      You’re very welcome!

  73. Avatar
    nkm July 2, 2020 at 12:44 am #

    Hi Jason,

    thanks for these great supports.

    I am trying binary classification using VGG transfer learning. I would like to share few observations for your comments:

    1. My training accuracy is not increasing beyond 87%. How can I increase training accuracy to beyond 99%.

    2. For large number of epochs, validation accuracy remains higher than training accuracy. When both converge and validation accuracy goes down to training accuracy, training loop exits based on Early Stopping criterion.

    3. Test accuracy comes higher than training and validation accuracy.

    4. Do we need to use SGD or Adam, using very low learning rate, while re-training VGG?

    Any suggestions for improvement will make me grateful.

  74. Avatar
    nkm July 2, 2020 at 4:35 pm #

    Thank you so much for your valuable support.

  75. Avatar
    sukhpal February 27, 2021 at 4:06 pm #

    sir kindly provide the information about ensembling of cnn with fine tunning and freezing

    • Avatar
      Jason Brownlee February 27, 2021 at 6:06 pm #

      You can find examples of all of this on the blog, use the search box at the top of the page.

  76. Avatar
    Vaishnavi June 23, 2021 at 2:08 pm #

    Jason, I am not able to open any of the related links provided in this blog. Can you please check and confirm once.

    • Avatar
      Jason Brownlee June 24, 2021 at 5:58 am #

      Sorry to hear that, perhaps you can try a different browser or different internet connection.

  77. Avatar
    Kevin Casey April 11, 2022 at 5:47 am #

    Still a great article all these years later! Thanks! Bookmarking this for forever.

    There’s a tiny typo, by the way… “Spot-check a suite of top methods and see which fair well and which do not” should actually be “…which fare well…”.

    • Avatar
      James Carmichael April 14, 2022 at 3:42 am #

      Thank you for the feedback Kevin!

  78. Avatar
    Zahoor Ullah December 21, 2023 at 4:33 pm #

    Thanks for your efforts and apprectiating your work, have helped in understanding the basix of deep learning.

    • Avatar
      James Carmichael December 22, 2023 at 10:34 am #

      You are very welcome Zahoor! We appreciate your support!

Leave a Reply