So, You are Working on a Machine Learning Problem…

So, you’re working on a machine learning problem.

I want to really nail down where you’re at right now.

Let me make some guesses…

So, You are Working on a Machine Learning Problem...

So, You are Working on a Machine Learning Problem…
Photo by David Mulder, some rights reserved.

1) You Have a Problem

So you have a problem that you need to solve.

Maybe it’s your problem, an idea you have, a question, or something you want to address.

Or maybe it is a problem that was provided to you by someone else, such as a supervisor or boss.

This problem involves some historical data you have or can access. It also involves some predictions required from new or related data in the future.

Let’s dig deeper.

2) More on Your Problem

Let’s look at your problem in more detail.

You have historical data.

You have observations about something, like customers, voltages, prices, etc. collected over time.

You also have some outcome related to each observation, maybe a label like “good” or “bad” or maybe a quantity like 50.1.

The problem you want to solve is, given new observations in the future, what is the most likely related outcome?

So far so good?

3) The Solution to Your Problem

You need a program. A piece of software.

You need a thing that will take observational data as input and give you the most likely outcome as output.

The outcomes provided by the program need to be right, or really close to right. The program needs to be skillful at providing good outcomes for observations.

With such a piece of software, you could run it multiple times for each observation you have.

You could integrate it into some other software, like an app or webpage, and make use of it.

Am I right?

4) Solve with Machine Learning

You want to solve this problem with machine learning or artificial intelligence, or something.

Someone told you to use machine learning or you just think it is the right tool for this job.

But, it’s confusing.

  • How do you use machine learning on problems like this?
  • Where do you start?
  • What math do you need to know before solving this problem?

Does this describe you?

Or maybe you’ve started working on your problem, but you’re stuck.

  • What data transforms should you use?
  • What algorithm should you use?
  • What algorithm configurations should you use?

Is this a better fit for where you’re at?

I Am Here to Help

I am thinking about writing a step-by-step playbook that will walk you through the process of defining your problem, preparing your data, selecting algorithms, and ultimately developing a final model that you can use to make predictions for your problem.

But to make this playbook as useful as possible, I need to know where you are having trouble in this process.

Please, describe where you’re stuck in the comments below.

Share your story. Or even just a small piece.

I promise to read every single one, and even offer advice where possible.


If you are struggling, I strongly recommend following this process when working through a predictive modeling problem:

151 Responses to So, You are Working on a Machine Learning Problem…

  1. Avatar
    ML rookie April 4, 2018 at 5:44 am #

    Thank you for your blog. So much great posts here, and did not go through all your previous posts. Let me describe where am I right now:

    1- I have an idea: To create a DL model that generates code.

    2- More: Actually my model aims to generate some templates. Those templates need some additional data from the user before they can be rendered into complete code.

    3- I have been reading about it, and I guess I need to use RNN (LSTM) models in order to generate code (or templates). My problems:
    A- If I want my DL to generate templates for a linear regression program, for example. My training data should be linear regression programs, right? How can I also input the performance of these programs to be considered as training data as well?
    B- Most linear regression programs have a lot in common. So for example, how can I teach my DL model to be proficient in generating linear regression programs, without necessarily going through predicting the next character or word?

    Just to summarize, I want to create a DL model that generates ML programs based on some user input 😀 what are the logical step I can take to do so?

    • Avatar
      Jason Brownlee April 4, 2018 at 6:23 am #

      Interesting problem.

      I have some examples of LSTMs learning to compute that might help:

      This is a general process to work through for a new predictive model:

      I’d recommend spending a lot of time on defining the problem (first step) and working on a large dataset for input/output. Data used to train the model.

    • Avatar
      Bart April 5, 2018 at 12:26 am #

      Hate to spoil day dreaming but it is not possible to have DL writing a code for you:

      “…, you could not train a deep-learning model to read a product description and generate the appropriate codebase. That’s just one example among many. In general, anything that requires reasoning—like programming or applying the scientific method—long-term plan-
      ning, and algorithmic data manipulation is out of reach for deep-learning models, no
      matter how much data you throw at them. Even learning a sorting algorithm with a
      deep neural network is tremendously difficult.
      This is because a deep-learning model is just a chain of simple, continuous geometric
      transformations mapping one vector space into another. All it can do is map one data
      manifold X into another manifold Y, assuming the existence of a learnable continuous
      transform from X to Y. A deep-learning model can be interpreted as a kind of pro-
      gram; but, inversely, most programs can’t be expressed as deep-learning models—for most
      tasks, either there exists no corresponding deep-neural network that solves the task or,
      even if one exists, it may not be learnable: the corresponding geometric transform may
      be far too complex, or there may not be appropriate data available to learn it.”

      – françois chollet – in a book “Deep learning with Keras”
      (keras creator)

      • Avatar
        Bart April 5, 2018 at 12:30 am #

        correction the name of teh book I quoted from is “Deep Learning with Python” by F. Chollet

      • Avatar
        Jason Brownlee April 5, 2018 at 6:05 am #

        I’m hesitant to count anything out. I have seen some of the learning to compute and code generating LSTM papers and it’s impressive stuff.

        Agreed, there is a long way to go.

        François does has a pessimistic outlook in his writing and tweeting. It was not long ago that object identification and automatic image captioning were a pipe dream and now it is practically a hello world for beginners. Neural nets could “never learn a language model”, until LSTMs started working at scale.

        For a kid who used to learn XOR with backprop back in the day (me), the progress is incredible.

        • Avatar
          Bart April 5, 2018 at 7:48 am #

          Hello Jason, I strongly advice against the over-hyping DL. In fact phrases such as “deep learning” and “neural networks” are misleading for most of the public.

          “… a deep-learning model is just a chain of simple, continuous geometric
          transformations mapping one vector space into another. All it can do is map one data
          manifold X into another manifold Y, assuming the existence of a learnable continuous
          transform from X to Y….”

          Hence, if we start raising expectations that our chain of simple, continuous geometric
          transformations will do magic and start writing a programming code on its own we are likely to end up like pet.coms in internet bubble mania 2001.
          Machine learning is equipped in powerful tools DL included, but DL is no magic box.

      • Avatar
        Kleyn Guerreiro April 6, 2018 at 5:26 am #

        I do agree with you. But I guess one of the worse (and inexact) tasks DL can do as it can be mapped with data to feed DL is:

        Function point analysis, a method to raise the cost of an app or system.

        As it is based on past projects, you can load previous projects’ features (word2vector from the code, screenshots, type of language, number of entities, their attributes and types and so on) in order to quantify numeric outcomes like number of function points for future projects or for improvements in existent ones.

  2. Avatar
    Hissashi April 4, 2018 at 6:07 am #

    I have tons of data at my disposal and I want to find insights for the business but I am not sure how to find the questions in the first place.

    • Avatar
      Jason Brownlee April 4, 2018 at 6:24 am #

      Perhaps talk to the business and ask what types of insights would really interest them, what areas, what structure, etc.

      Ideally, you want information that is actionable. E.g. where you can then devise an intervention.

  3. Avatar
    Emeka Farrier April 4, 2018 at 7:32 am #

    I have over 1000 hours of audio and transcribed text for the said audio. I’m also embarking on studying TensorFlow.

    My issue at present is how I should prepare thus data to train a model (and which model should I use?). I don’t want to wait till I get a handle on TensorFlow and ML concepts before preparing my data appropriately to use in training, because I’m awear that data preparation can be 80% of the work in AI

    • Avatar
      Matt April 4, 2018 at 1:04 pm #

      What are you trying to predict from the transcripts?

    • Avatar
      Jason Brownlee April 5, 2018 at 5:44 am #

      That does sound like a great project.

      I would recommend reading some papers on deep learning audio transcription (speech to text) type projects to see what type of representations are common.

      The material I have on preparing text data might help for one half of the project:

      • Avatar
        Emeka Farrier April 6, 2018 at 10:04 am #

        Thanks a million. I would like to get in touch with you to keep you posted on my progress.

  4. Avatar
    Dan April 4, 2018 at 11:11 am #

    Thanks Jason – you are the first educator that I have known in ML to define the problem facing users in order to develop a solution!! My main problem is that I have variables with high variance (sometimes maybe outliers but can’t be excluded for convience). A struggle with finding the best model to extract feature importance.

  5. Avatar
    Puja April 4, 2018 at 1:24 pm #

    I have a dataset x-API dataset that is related to educational data mining and I want to use association rule mining and clustering on this dataset using WEKA tool.Previously I have worked on this dataset using classification to predict students academic performance. So this time what can I do new to it using above mentioned techniques.
    Could you help me in this problem and i want to know how training, testing and validation can be done in WEKA tool.
    Thank you.

  6. Avatar
    Samarth Barthwal April 4, 2018 at 7:50 pm #

    How to choose a paper to solve a problem? And if usng transfer learning which existing architecture should I begin with?

    • Avatar
      Jason Brownlee April 5, 2018 at 5:56 am #

      Start with a strong description of your problem:

      Then find papers on and related to the definition of your problem. There will be 1000 ways to solve a given problem, find methods that look skilful, simple and that you understand. Too much complexity in learning models is often a smell (e.g. code smell).

      For transfer learning, again use a strong definition of your problem and project goals to guide you. Maybe using the most skilful model matters most, maybe it doesn’t. Or maybe you must test a suite of methods to see what works best for your specific problem.

  7. Avatar
    Muhammad Younas April 4, 2018 at 8:22 pm #

    I have a task to predict student retention by modelling student behavior by observing observable states (such as interaction with log data that contain accessing lectures,discussion, problem and so on) using “hidden markov model”,I have data also some research papers related to my problem.But have not idea how to implement HMM on this type of task,Please can you refer any link related to this type of objective implemented with HMM,or anything through which I can get idea how and where to start.Thanx

    • Avatar
      Jason Brownlee April 5, 2018 at 5:58 am #

      Perhaps google search and find some existing open source code that you can play around with to see if the approach will be viable. It will save days/weeks.

      Also, I have worked on similar problems and found very simple stats and quantile models to be very effective. Basically, users that use the software more stay longer. Sometimes obvious works.

  8. Avatar
    Ruhin shaikh April 4, 2018 at 11:16 pm #

    Hello sir,
    Sir,my problem statement which I have choosen for my project is “NETWORK ANOMALY DETECTION OF DDOS ATTACKS “.
    We have planned of using DEEP LEARNING TECHNIQUES “to solve the problem.
    We have planned to use a two stage classifiers in our model.In first stage we classify using STACKED AUTOENCODERS and in the second stage using RNN(LSTM).
    The first stage classifier will mark anamolous that will be fed to second stage classifier.In the first stage most of the known attacks will be classified properly whereas in the second stage the novel attacks will be classified.This is done to reduce the false positives.
    The dataset which we will be using is NSL KDD dataset.
    The dataset contains 42 features.
    And we will be using python as platform.
    Sir,my questions are:
    1)The feasibility of this model .
    2)can you please help me with a sample code which can help to detect DDOS attack using STACKED AUTOENCODER and LSTM.
    Sir,I’m finding difficulty in implementing this model.I would be really grateful to you if you could help me.

    • Avatar
      Jason Brownlee April 5, 2018 at 6:02 am #

      Sounds fine. Opinions on feasibility don’t matter though. Only the skill of the model matters.

      Sorry, I cannot write code for you. I hope to cover autoencoders in the future. Until then, perhaps you can find some open source code to use as a starting point?

  9. Avatar
    Komal April 5, 2018 at 1:27 am #

    Hi Jason,

    I am a newbie in ML. I am in process of preparing an approach to an ML problem. I came up the following:

    Filling missing values -> Scatter Plot -> Transformations(Log/Pow etc..) -> Normalization -> Train Model->Evaluate Model with the metrics.

    I have couple of questions. Your inputs are highly appreciated.

    Around the process of choosing a transformation /normalization for any give data set. I looked for this in internet, most of the blogs suggest it’s specific to the data set.

    Would like understand, if there is a way at least to narrow down to few transformations/normalization algorithms for a given data set.

    The other important understanding i lack is the statistical importance of any metric and how to choose a right metric over the other for a given data set.


    • Avatar
      Jason Brownlee April 5, 2018 at 6:12 am #

      Most methods that use a linear/non-linear sum of the inputs benefit from scaling, think neural nets and logistic regression. Also methods that use distance calculations, think SVM and KNN.

      If unsure, try it and see, use model skill/hard results to guide you.

      Metric – for evaluating model skill? In that case think about what matters in a given model. How do you/stakeholders know it is good or not. Pick a metric or metrics that make answering this question crystal clear for everyone involved.

  10. Avatar
    Mahesh Nirmal April 5, 2018 at 4:02 pm #

    The article gave incredible insights on Machine learning and its importance in the present day. Loved the content. Thanks.

  11. Avatar
    Sangeeta Industries April 5, 2018 at 6:05 pm #

    Hey Jason!
    Thanks for sharing a marvelous article your content is amazing.

  12. Avatar
    Alberto April 5, 2018 at 8:06 pm #

    Hi Jason!

    I will try to describe our problem as easy as I can:

    We need to classify some economic registers. We have about 20 different categories. There are attributes whose type is easily interpretable: price is a continuous variable, product type is a discrete variable, iva type is an ordinal variable, …

    Our first attempt has consisted in find the best binary classifier for each category. I’m not sure if it is the best. But with it we can check what kind of algorithms can work better.

    Our main problem is to manage some attributes as the nif (The NIF number is the tax code allowing you to have fiscal presence in Spain). We believe that our dataset will grow and then that “discrete variable” will have a huge variety of values… And we think that this variable can be decisive to classify a new register… How we have to treat this variable?

    The problem we see is that we need to encode this variable values because some machine learning algorithms only works with numbers. Using the label and count encoders strategy we are generating a lot of columns (one per nif code) and this can underestimate the rest of the columns…

    What do you think? Does exist a machine learning algorithm which works better with this kind of variables?

    Thanks a lot for your job, your blog is very useful for us!

    • Avatar
      Jason Brownlee April 6, 2018 at 6:30 am #

      Thanks for sharing, good problem!

      I think you’re describing a situation where you have a categorical input (a factor) with a large number of categories (levels). E.g. a categorical variable with high and perhaps growing cardinality.

      If so, some ideas off the cuff to handle this might be:

      – Confirm that the variable adds value to the model, prove with experiments with/without it.
      – Scope all possible values and one hot encode them.
      – Integer encode labels, perhaps set an upper limit and normalize the integer (1/n)
      – Group labels into higher order categories then integer or one hot encode.
      – Analyse labels and perhaps pull out flags/binary indicators of properties of interest as new variables.

      Get creative and brainstorm domain specific feature engineering like the last suggestion.

      Does that help or did I misunderstand the question?

  13. Avatar
    Srihari Katti April 6, 2018 at 12:10 am #

    I stay in the Metropolitan city of Bengaluru and wanted to regulate the water supply to different parts of the city using Machine learning. ie use different places to classify of usage is heavy or less and distribute water accordingly using a central water supply. How do I enunciate this?

  14. Avatar
    John R April 6, 2018 at 5:47 am #

    Hi Jason,

    Thanks so much for your blog! It’s been very helpful with getting my feet wet in applying machine learning.

    I have data on many different parameters from health sensors (heart rate, skin temperature, breathing rate, air temperature, humidity, etc.) and want to try and predict the next reading of one of them (heart rate) based on the current readings of the others.

    Any thoughts on how this could be accomplished?

    I have this data for many different people, and eventually want to model how each individual responds to changes in their measurements. For example, fit people’s heart rates may not change as much with humidity.


  15. Avatar
    Charles Brauer April 6, 2018 at 6:12 am #

    Hello Dr. Brownlee,

    I am doing research into patterns that occur in financial data. In particular, trade data from the major exchanges.

    To see what I am working on, please visit:

    The main problem I am trying to solve is modeling a dataset that is out-of-balance. Packages like SKLearn and H2O address this problem with an API argument like class_weight = ‘balanced’. This helps, but I feel that this is not enough.

    It looks like Google is ignoring the unbalanced dataset problem. That’s understandable. Their business model is based on totally balanced datasets that contain text, images and audio data.

    Any comment or suggestions will be greatly appreciated.

    Charles Brauer

  16. Avatar
    khaldoon April 6, 2018 at 2:38 pm #

    hello Jason, I am working on text classification research, for that first we need to extraction features as you know, I am confused to select machine learning or deep learning for my research, how to select one and way ….. thanks

    • Avatar
      Jason Brownlee April 6, 2018 at 3:53 pm #

      Perhaps take a quick survey of the literature for similar problems and see what is common.

      Perhaps start with a technique that you are familiar with, then expand from there.

      Perhaps find a tutorial on a similar problem and adapt it for your needs.

      There will not be a single best approach, try a suite of methods to see what works and incrementally improve your code and understanding until you are happy with the skill of the system. It may take a while.

  17. Avatar
    sai sowmya grandhi April 6, 2018 at 4:37 pm #

    Lacking the consistency of keeping up the work of modeling using ANNs. The results are not good enough and I lose interest. I don’t have skills of coding by my own but take codes from here and there and customize for my work. If I have to change some parameter in it for improving results I seriously get overwhelmed by the amount of material I have to search through to finally get what I need. I am in short of time and getting frustrated about why I took up this challenge (project) in the first place.I use R studio for modeling and have large amounts of data to deal (daily climate data of 40-50 years) with. Please suggest me Jason what I can do to speed up my learning process.

  18. Avatar
    Maria April 6, 2018 at 6:25 pm #

    Hi Jason,
    thank you for all your posts about the different problems in ML and DL. They are always very detailed and therefore very helpful.

    I have to solve a multi-label classification problem with blogposts. For me as a student in Digital Humanities it is very difficult to understand all the different parameters and statistics. I am using doc2vec to get the vector representation of the text as input to the keras model. I tried to find out which model is the best for my problem and came to the solution, that a LSTM should fit.

    But I still have many questions:
    – How can I get something like accuracy for the multi-label multi-class prediction? How can I evaluate the multi-label model?
    – How many and which kind of layers would be good? Several LSTM cells in a row?
    – Does it make sense to use an autoencoder between doc2vec and the LSTM to improve my accuracy?
    – How big is the impact of the doc2vec parameters on the LSTM output?
    – How can I find the best combination of all these different parameters?

    I think it is even harder for a newbie to solve a multi-label classification problem with text instead of a multi-class classification problem with images because there are a lot more useful papers, tutorials and examples about that.


    • Avatar
      Jason Brownlee April 7, 2018 at 6:18 am #

      Great comment!

      Yes, multi-label classification is under served. For example, I have nothing on it and I should. I hope to change that in the future.

      Every project will have lots of questions, lots of unknowns. You must get comfortable with the idea that there are no great truths out there, that all “results” and “findings” are provisional until you learn or discover something to overt turn them. That is the scientific mindset required in applied machine learning.

      I would recommend writing out each question as you have done. Tackle each in turn. Survey literature on google scholar, look at code and related projects, ask experts, get provisional answers to each question and move on. Circle back as needed. Iterate, but always continue to progress the project forward.

      Many questions we have, like how many layers or how many neurons or which algorithm is best for your problem have no answer. No expert in the world can tell you. They might have ideas of things to try, but the answer is unknown and must be discovered through experimentation on your specific dataset.

      I do see this a lot and I think the remedy is a change in mindset:

      – From: “I am working on a closed problem with a single solution that I just don’t know yet”
      – To: “I am working on an open problem with no best answer and many good enough answers and must use experiments and empirical results to discover what works”.

      Does that help?

      I write about this here:

      And here:

  19. Avatar
    Patrick April 6, 2018 at 7:35 pm #

    Hi Jason,

    Thanks for your blog! I have data of all previous shipping vessel positions in the world. I am looking at a specific market, and I am trying to forecast the freight price based on features extracted (variables created) from the data (AIS shipping data).

    To this point I have mainly focused on data exploration and feature extraction. Looking at counting number of vessels in specific areas over time, distances to loading port etc. So I have lots of variables that can be used. What do you recommend to do in finding the “right” and “right number” variables to use in my model? I have read that random forests might do that.

    I’ve done a lot of reading, and think that RNN (LSTM or MLP) might be the way to go and use diagnostics and grid search to find epochs, nr of neurons etc. I’ve also read that other types of neural networks might be used? And that this kind of problem also might be solved by a Support vector regressor, or use the SVR as a fitting technique as in the multivariate case, when the number of variables is high and data become complex, traditional fitting techniques, such as ordinary least squares (OLS), lose efficiency. Lastly, I’ve been told that multivariate adaptive regression splines (MARS) can give some results in forecasting.

    So to summarise, I want to do a multivariate forecast for the price data based on several variables found from global vessel positions, using LSTM, MLP, SVR, MARS or other ANN algorithms.

    What do you recommend/ what ate your thoughts?

    One last question, do you have any good resources on stream learning? To my understanding, it might be inefficient to re-train the model every time a new observation is made.

    Thank you!

  20. Avatar
    iulia April 6, 2018 at 7:47 pm #

    Currently I’m working on multiclass classification problem with RF.
    My biggest challenge for this particular problem is heavily imbalanced classes : I have one class that contains only one sample. I cannot ignore this class, I cannot collect more samples for this class and I don’t know how to generate more samples for this class.

    If smbd was facing the same problem please help 🙂

  21. Avatar
    Carlos Augusto April 7, 2018 at 5:35 am #

    Hi Jason, congratulations for your posts and availability for help people on ML problem.
    We are working no ML project that involves health features to predict a infant mortality in specific region.
    We are using regression modeling on this project and the difficulties we are dealing with now is that non of the regression models we
    tested (linear, general linear model and SVM) provided a good statiscal measure as result, such as low p-value and residual is not normal.
    The features on dataset have outliers that can not be removed, non of the predictors features presents a linear with target feature.
    The predictors present a low values on correlations matrix, whats is good. The features on dataset was selected by experient people on the domain.
    We also test some padronization on data such as normalization, zscore, log but dont solved the problem.
    Do you have any guess on where we are failing? May you suggest some ideia for us please.
    Thanks and sorry about some english error writing because i’m from Brasil and don’t have mastery of the english language.

    • Avatar
      Jason Brownlee April 7, 2018 at 6:40 am #

      It could be a million things, for example:

      – Perhaps you need more data.
      – Perhaps you need to relax your ideas about removing outliers (inputs won’t be gaussian invalidating most linear methods)
      – Perhaps you need to try a suite of nonlinear methods
      – Perhaps you need to use an alternate error metric to evaluate model skill.
      – …

      No one can give you the silver bullet answer, you are going to have to work to figure it out. Gather evidence to support your findings. Be prepared to change your mind on everything in order to get the most out of the problem.

      I list a ton of ideas to think about here:

  22. Avatar
    Gautam Karmakar April 8, 2018 at 4:12 pm #

    I have a problem that I need help or guidance. I have a prioblem like finding similar questions given a new questions. I have around 1500 un labeled questions to start with. What is best approach?

    • Avatar
      Jason Brownlee April 9, 2018 at 6:06 am #

      Good question.

      There are many excellent classical text similarity metrics. Perhaps experiment with them? I would recommend start by surveying the field to see what your options are.

  23. Avatar
    khaldoon April 10, 2018 at 2:50 am #

    How can use information gain in text classification? what output will be like? for example, if we have X_train.shape=(10,50) what output shape for information gain? and can we used it for classifiers like NB or SVM?

  24. Avatar
    Jordan April 10, 2018 at 3:03 am #

    I’m just starting to learn machine learning. So I don’t have much idea about this field. I don’t even know if the problem I’m going to take up is even a Machine learning problem, asking it here anyway.

    There is an exam that is being conducted once in every year. And the rank list (result) is also publicly available. If I have the rank list of last 5 years ( that is 5 ranklists, and a ranklist will have the rank of the aspirants and the score they gained). Is is possible for me to predict the rank of a new user based on the score of a mock test he attended.

    Is this even a machine learning problem?

  25. Avatar
    Anthony Weche April 13, 2018 at 9:34 am #

    Hello Jason, what is the best algorithm to use for preventive maintenance take for example you want to do predictive maintenance early enough before the machine breaks down. Or you want to see that the machine has worked sufficiently before changing parts or oil.
    please advise thank you

    • Avatar
      Jason Brownlee April 13, 2018 at 3:30 pm #

      Great question, perhaps look into survival analysis methods?

  26. Avatar
    Novoszáth András May 12, 2018 at 12:07 am #

    My actual issue: How do I recognize as quickly as possible that what type of model should I start to use and tinker with. After that the main problem becomes understanding the model because of the notations with which it is described.

    Nonetheless, these can learned, and I happy to do so in the long term. A big constrain for for that, however, is the lack of accessible, worked out and solid examples providing practice opportunities. There are some really good ones out there, but it always takes lots of time and trial and error to find them.

  27. Avatar
    Riad June 2, 2018 at 3:41 am #

    hello jason
    I want to do a sentiment analysis on a dataset of movie reviews but I don’t know where do I start so please can you help me

  28. Avatar
    Amandeep June 14, 2018 at 5:38 pm #

    Excellent article,
    Thanks for sharing this good information on machine learning…. 🙂

  29. Avatar
    Sevval June 29, 2018 at 5:02 pm #

    Thank you for your help and this helpful website. I sent an email to you.

  30. Avatar
    German July 5, 2018 at 1:55 am #

    Hi Jason,
    We are building a logger program to track user interface with a core business system. The data set will have the system name, object type, description, order step and a process tag.

    What we would like to predict for example is,
    – In a new logger instance, which is the process tag?
    – Which is the next step?


  31. Avatar
    Mohamed Bennouf July 22, 2018 at 10:14 am #


    I amazed i . found your blog where I can actually ask the question I had for so long but did not know where to begun and who to ask!

    Anyway I am service engineer and I am trying to figure the feasibility of troubleshooting system to assist me (and other engineers) solve tool failures (of mass spectrometer instrument) Over the last 12 years we accumulated a 20K+ cases for failures in form of problem/solution database (filled up by our engineers on the field. I would like to use machine learning (maybe deep learning) to assist our engineers found the most likely case they can apply. For now they use the regular search function in the database software we are using. Of course this means that they have use the correct word since the search for keywords. It will be great if the AI system could find solution to problem stated in a sentence rather than keywords. Of course the system actually learn the domain it is even better. In the past something like case-based reasoning would have been a solution.

    My issue is how to approach this learning problem given the fact that it is a text based domain (rather than just some numbers) I can output the case (problem type, problem and solution) into an excel file if needed. The goal is to state the new problem and then have the AI show all the cases (solution part) that could apply to the new problem. That way the engineers do not have to re-invent the wheel. Of course the AI may or may not find the correct answer but could at least give the engineers to solutions that they can check.

    I hope I am making sense, if please do not hesitated to ask me questions. To help, i pasted below what a typical case looks like from our 20k+ database.

    Thank you again for allowing us to pick your brain 🙂




    The Hall probe is stuck at 7000. Reseting the chassis and the real time did not work.

    Found a dead power transistor along with bunch of dead 0.1 power resistors. One transistor support was also damaged. Replaced those parts and the magnet worked correctly. Working with Cs2 (hall probe around 8000) is very hard on the magnet and that we can expect more failure if using that mode too long.



    The projection turbo pump failed again. The power went up and the vacuum degraded in that area.

    Was going to replace the pump again but noticed that backing valve of the pump was closed (?) but the synoptic show the valve open. Closed/open the valve from the synoptic and the vacuum got better and pump power went down to normal. I also noticed that the compressed air was set to the minimum value. We increased it and so far it is work fine. Customer will return the pump back to Madison (never used)


    The egun current is lower than normal (6-7 uA instead of 30-70 uA) I also found that emission current is very low (0.5 mA at 3000 bits filament)

    Replaced the egun filament. At first I adjusted the filament height by unscrewing the part 3/5 turn (3 notches back) But then I saw that the manual now said to unscrew 1/4 to 1/3 turns so I open and I did. Still I could not get any normal emission current (got 0.5 mA at filament set to 3000 bits instead of 2 mA)  Finally I found out that the resistors R1 and R2 had bad solders connection. Now we can get 2 mA with around 3000 bits.

  32. Avatar
    Mohamed Bennouf July 25, 2018 at 4:11 pm #

    Thank you so much Jason! I will look into the link. So you think it will be more like a search problem rather than machine learning issue per se? I was hoping that system X would learn about the domain (from the 20K cases) and then come up with a solution of an novel problem description.

    In any event, thank you for an amazing blog and for taking the time to help us learn how to solve real problems with Ai.


    • Avatar
      Jason Brownlee July 26, 2018 at 7:36 am #

      Try many different framings of the problem. You know more about it than me, try it as a supervised learning, try it as a search, go with what works best.

  33. Avatar
    Rashmiranjan Sharma September 4, 2018 at 7:16 pm #

    i want to predict the min,max and modal price of agricultural commodities for next 30 days.
    so can you help me how to solve this problem ,yes i have already a dataset. i dont know how to select best algorithm.
    Thank you in advance

  34. Avatar
    Johann September 27, 2018 at 10:39 am #


    First, thanks for your awesome website: so much great articles inside!

    My problem: I have technical issues (expressed in common english / natural words like “my application is not working and the error message is ‘cannot access'”) and i would like to make a bot to automatically answer it with the most probable reason (“did you try to open the firewall? Here is the link to the how to”). Where should i start? How should i train? Any information / guidance will be super appreciated.

    Thanks a lot,

    • Avatar
      Jason Brownlee September 27, 2018 at 2:46 pm #


      Perhaps there is a natural relationship between problems and things to try. One approach would be to structure it like a recommender system: people that ask questions like this find it help to get suggestions like that.

      If you don’t have such data, you may have to collect it using an existing dumber system with added randomness.

      If you do have such data, you can use it to see the system.

      This won’t be a new problem, I’d encourage you to browse the literature on to get idea of how others have approached it. Perhaps try a few different framings and see what makes sense based on the time and resources to you have.

  35. Avatar
    jagjeet kaur October 25, 2018 at 4:36 am #

    hey Jason,

    I am new to machine learning and i want to start with a problem solving which will boost up my confidence in this field.Plz help me out by suggesting some beginner level problems!

  36. Avatar
    kavya October 28, 2018 at 11:53 pm #

    Hi Jason
    Great article! I really liked it,
    Thanks for sharing such a good informative article insights… 🙂

  37. Avatar
    Björn Ludin October 31, 2018 at 8:54 pm #

    My problem is this:
    I got loads of data.
    I got the odds on a betexchange for horse races during the race – for about 2 years.
    That is – for every race during these two years I got data like
    r1 -rn are the runners (horses)

    t1 r1 r2 r3 …
    0001 5.25 2.04 3.25
    0002 5.10 2.50 2.75

    3254 55 1.01 520

    here – r2 won

    The sample frequence is about 5 times per second.

    I also know if the runner won/placed/lost

    My goal is to be able to say that runnerx X will win/place after ca 50-75% of the expected racetime with say 80% accuracy.

    My problem is that I don’t know how to model this situation. I’ve seem tournament strategies – ie who out of two runner will win – but here’s more data – both in time and in participants

    What model should I pay attention to?

  38. Avatar
    Zafar November 16, 2018 at 3:41 pm #

    well I am trying to develop my Phd Problem statement. what actually I want to do,

    The existing state of the art classifier expound low accuracy on imbalanced multi label data set.
    there is a need to design/develop a novel/intelligence classifier to improve the accuracy on imbalance multi label data set.

    how can you help me to mature the problem statement, advising how to improve an existing classifier or start designing a new classifier.

    • Avatar
      Jason Brownlee November 17, 2018 at 5:42 am #

      I recommend talking to your research supervisor.

  39. Avatar
    Suleman Zafar Paracha November 19, 2018 at 5:33 pm #

    I just started to learn ANN. I want to start by solving a problem. Lets suppose I have an image with many symbols in it. The image is not clear as the signs are old and rough but gives clear idea that what is the actual sign. Now I want to extract those signs from the image and find those signs in the database and create a new image with those signs matched in the database.
    now new HD image will contain signs one the same locations.
    I don’t know from where to start and what tools to use for this. Thank you.

    • Avatar
      Jason Brownlee November 20, 2018 at 6:33 am #

      Start by building up a dataset of images and their associated clean symbols. Thousands of examples.

  40. Avatar
    Farzan Zaheer November 24, 2018 at 4:56 am #


    I have historic data of money spent on advertisements by various industries to a tv channel. Now I want to know if there is a budget shift in future, to be more specific if and when there is a budget shift from one industry to another.

    I don’t know where to start. If it is a classification problem or regression, or a combination of both? as I have to predict the shift in spendings from one industry to another.

    I would be glad if you could point me in the right direction. Thank you

  41. Avatar
    Vince Creed December 4, 2018 at 7:32 pm #

    Hello Jason! I am currently working on a system that wants to detect if a frame in a video contains a fire or not, using machine learning. I have no idea yet how to start. I bet it is a classification problem. Can you suggest the best algorithm to use for this problem? Your advise would be highly appreciated. Thank you!

    • Avatar
      Jason Brownlee December 5, 2018 at 6:13 am #

      Sounds like time series classification, a CNN-LSTM might be a good starting point.

  42. Avatar
    Tyrion December 7, 2018 at 5:44 pm #

    Hello Jason! I recently came across the problem to classify, find patterns and analyze the news articles about HIV over the past 10 years. What would be the best approach and which would be the best algorithm to use?

  43. Avatar
    John December 12, 2018 at 11:56 am #

    Hi Jason,

    I’m currently working on a webscraping project for a finance professor at my university. It involves me manually navigating a company’s website and recording whether or not it contains instances/phrases that indicate certain principles of business; I would then assign the principle a value of 1, 0, or -1 based on whether it was included positively, not included at all, or included negatively (for example, if a company expressed that they worked individually, the principle collaboration would be assigned a value of -1), respectfully. I have access to a large database (hundreds, maybe thousands) of previous analyses of websites through this project, and I was wondering if I would be able to run through the examples provided for each principle on each site and develop a program that would be able to recognize a principle out of a given text and assign it a numerical value. Thanks.

    • Avatar
      Jason Brownlee December 12, 2018 at 2:16 pm #


      The presence of a phrase is binary and no ML is needed.

      If you want to handle synonyms/etc. you could use a predictive model, but you’re going to need thousands of existing examples of webpages and their scoring.

      It might be possible to fit a CNN/LSTM to output a scoring. Try it and see.

  44. Avatar
    Billy December 14, 2018 at 11:32 pm #

    I am new for ML and I am going to use ML for Traffic accident Analysis, like predictions and discovering hidden patterns to understand the major causes of accidents. What method you will guide me to go through and what more problems in ML can be solved using traffic Accident dataset.

    Thanks in advance.

  45. Avatar
    Tommy December 18, 2018 at 9:52 pm #

    Hello Jason,

    First of all, thank you for your posts, they are really helpful and accessible!

    Here’re some of the problem that I’m trying to solve:

    1. Classifying identity cards and other documents issued by the government
    2. Exracting information from those identity cards
    3. Extracting information from table found on some of the documents

    To solve problem 1, I’m using LeNet. Currently I’m classifying 5 types of documents (+1 class where it doesn’t detect any).

    To solve problem 2, I’m planning to use YoloV2 and preparing a custom dataset, annotating the fields manually.

    To solve problem 3, I’m not using conv net but rather using pure image processing approach, by creating a mask from the convex hull, and extracting each cell individually.

    I’d very much like to hear your thoughts about this. Thanks in advance!

    • Avatar
      Jason Brownlee December 19, 2018 at 6:34 am #

      I like your approach Tommy, it is close to the approach that I would start with.

      Preparing enough training data is going to be the hard part. Model performance will hinge on how much data you can gather.

      Let me know how you go.

  46. Avatar
    Sarah January 25, 2019 at 10:22 am #

    Hi Jason,
    Thank you so much for taking the time to read this (if you do end up reading it).
    I’m a front end engineer at women’s designer shoe startup and we want to try to accurately predict a new customer’s shoe size based on a series of questions. We are currently gathering data on a customer’s foot attributes, the size they wear in other brands, and the size they wear in our brand. Is this something that can be predicted with machine learning and if so, where do I start? How much data do I need? Thank you so much!
    — Sarah

    • Avatar
      Jason Brownlee January 25, 2019 at 12:03 pm #

      Hi Sarah, sounds like a wonderful problem.

      I’d recommend two starting points.

      1. define the problem clearly and spend a long time brainstorming data that could be predictive of shoe size. Talking to experts may be helpful. Also, this framework might help:

      2. start gathering data, you will need lots of data. Maybe use friends family as a start and just look at the data in excel. Humans brains are very good at finding short-cut patterns and these might save you a ton of time.

      I hope that helps. I’d love to hear how you go!

  47. Avatar
    Sarah January 25, 2019 at 12:27 pm #

    Thank you so much for responding. Will start with your article. Thank you!

  48. Avatar
    Henry Munene Pheneas March 7, 2019 at 1:21 am #

    My problem is this, I want to develop a multivariate regression model using using gradient descent, then use the model to make predictions and thereafter plot the predictions against the times the inputs were recorded.

    I aslo would like to save the trained model in a way I can use it for predictions in future.

  49. Avatar
    Assia March 20, 2019 at 5:33 am #

    Hi jason I am a student in computer science I will realize a project of end of study (I will compare 3 methods of dynamic selection KNORAE, KNORAU and META-DES) and I need your help please:
    – what is the best dataset format to use for this problem or for machine learning in general .csv or .txt
    – which ensemble learning I have to choose Bagging, Boosting, or RandomSubspace to generate the models
    – which are the best classifiers to use to compare these three methods
    And thank you in advance

    • Avatar
      Jason Brownlee March 20, 2019 at 8:39 am #

      What are those methods, I’ve never heard of them before?

      Perhaps test a few datasets and see what works?
      Perhaps try a few different ensemble methods and see what works?
      Perhaps try a few different classifiers and see what works?

      The key here, is that there are no pre-defined answers, you must discover the answers via prototypes and experiments.

      • Avatar
        Assia March 20, 2019 at 10:04 am #

        thank you very much Jason for your attention to my message and thanks again for your help maybe I didn’t specify my questions but your blog and your articls are so helpful

  50. Avatar
    Roman October 30, 2019 at 4:58 am #

    Hi Jason! Great to see that there are people like you in DataScience world that are ready to help and share ????

    I’m trying to touch Data Science and ML topics and realized that the best way to do it is to solve a problem I like to solve.

    Now I’m trying to deal with such looking simple problem: I want to predict the next release date of one open-source project. I have dates of all previous releases (it’s not a long history, just about ~30 releases so far), count of the parallel version they were and are going to release. I also know deadlines (releasing once per quarter) that they try to meet. Also, I have found that they are usually releasing a new versions in the first part of the week (based on prev. release statistic).

    So far, I have investigated a bit the input data and need to get where to move with this and how to proceed approaching this problem.

    Can you please suggest me the way?

    Thank you very much,

    • Avatar
      Jason Brownlee October 30, 2019 at 6:07 am #

      Sounds cool!

      Follow this process:

      One way to frame the problem might be as a time series classification task. That is, is a release expected in this interval. Or what is the probability of a release in this interval.

      I’m eager to hear how you go. Shout if you need help.

  51. Avatar
    Srinivas December 12, 2019 at 10:00 pm #

    Hi Jason,
    Firstly, I would like to thank you for your amazing works.
    Recently I got stuck in a problem where I would like to predict some groundwater resources using some climatic parameters as my predictors, So for this regression analysis I am using Random Forest Algorithm.
    I increased the no.of trees and tune the hyper-parameters and try to fit a best model but unfortunately none of them worked. There was no correlation between any of the variables.
    Can you help me, how can I overcome this problem.

  52. Avatar
    Elias January 9, 2020 at 9:07 am #

    Hello jason, Impressive thread here. I have enjoyed reading your blog, great work!

    I am pondering about a problem with data that looks like this:
    1: 23,aaa bbb ccc
    1: 20,bbb
    2, 20, ddddd qq zz
    3,10,tt jj

    document 2:
    1: 13, ccc
    1: 30 , aaa
    2: 10,zz ccc oo
    2: 10 qq
    3, 7 jj
    3, 3 jj

    As you might be able to tell , the sum’s for the number in each row in document 1 with the prefix “1:”
    matches the sum’s for the rows with the same prefix in document 2 (23+20=13+33.
    In the same way 20=10+10 for “2:” and 10=7+3 for “3:”.

    As a help there is also some words on each row that can help connect the rows.

    The training data above can be generated, so there should be no lack of training data.
    The trick is to get a trained model that are able to do the matching from document1 to document2 without having the prefix (before the “:” that connects them, this is just there to train the model.
    Can this be solved with machine learning ? What would be your approach ?

    • Avatar
      Jason Brownlee January 9, 2020 at 1:44 pm #


      Hmmm, the constraints make me think of an optimization type problem with constraint satisfaction. Not really a learning problem.

      Perhaps dig into operations research?

      • Avatar
        Elias January 21, 2020 at 7:46 pm #

        Thanks, that’s one way to go. But I was not clear that there is a strong hint between the row occurrence and also surrounding rows, also repeatable patterns in the data that maybe could be learned aswell. Also the wordmarching might not always work/match

        • Avatar
          Jason Brownlee January 22, 2020 at 6:20 am #

          Perhaps explore multiple framings of the problem and discover what works?

  53. Avatar
    Parimal Gaurkhede March 27, 2020 at 2:25 pm #

    Can you help with getting accurate predictions with my dataset .i am trying but not get the accurate result can you help me out..!!

  54. Avatar
    Deejay November 6, 2020 at 1:17 am #

    Hi Jason, I am a newbie to ML, I was about start some serious handson after going through the theory part about the algorithms and other stuff. Now I have started this process by selecting few use cases from kaggle, but I start it and explore the data and can’t seem to understand how to proceed with it. As there are so many algorithms we have and what else can be done for feature engineering and stuff like that. I am not able to pinpoint my next step as to what algorithm I should continue with etc. please help me on this . Thanks

  55. Avatar
    Anna January 5, 2021 at 3:13 am #

    My model gives me 100% accuracy and i don’t know what to do!!!
    I know that this is totally wrong.

    I work in a region and I have data about urban and non-urban points for 4 years. I want to predict the urbans and non-urbans for thw next year. I also use some data about distances and the altitude for every point.

  56. Avatar
    Zeba Shaikh April 13, 2021 at 5:12 am #

    Hello this blog is what I needed. I really need help with NLP problem. I have a folder with multiple text file documents as train_docs and it’s train_tags files folders manually written by human annatator. And I have given a test_docs folder where I have multiple text files documents so I have to automatically generate tags for this test docs.
    I’m having trouble in how to like combine or add these multiple text file in one CSV or Excel so that I can use to for ml model and which model should I use for this problem.
    Thanks in advance.

    • Avatar
      Jason Brownlee April 13, 2021 at 6:11 am #


      Perhaps try loading the data into memory / a data structure so that you can manipulate it and use it with modeling.

  57. Avatar
    jagriti May 25, 2021 at 9:59 pm #

    I am doing the research on queuing theory problems.I want to use machine learning algorithms
    to predict waiting time or reduce waiting plz suggest me the application where we predictt wait time using ml

  58. Avatar
    hani February 5, 2022 at 11:09 pm #

    Thank you sir for the tutorials on ML.
    how can i predict gender from the height , weight and waist circumference of an individual ?
    what models can i use ?
    thank you

  59. Avatar
    S.Nouhaila May 18, 2022 at 6:53 pm #

    Hello, i’m writting not knowing if i’d still get an answear or not but sure why not try right?

    Im currently working on a project and I’m struggling with setting the best sceanrio to tackle my problem.

    My problem is taht I want to be able to predict the price that costs the rehabilitation of a house. ( target = total amount )

    And I have 20 features (like buildinf surface and so on …) to help figure out the price ( numerical features )

    Therefore my dataset consists of 37 (rows) example of rehabilitation projects on 37 diffrent houses ( I have the info on the total amount as well)

    so I chose to use : Regression supervised Machine learning Models

    First I generated a correlation heatmap to see how my features correlate with my target variable ( 10 features have high correlation coefficient > 0.77 )

    My struggle is that wichi algorithm can I use to predict my target variable.

    All I could think of is use linear regression on the features highly correlated but which Model : SVM, Lasso Regression , Ridge Regression , Linear Regression (OLS) , Stochastic Gradient Descent ?

    Or should I use decison tree models : Random Forest ? Decision tree ? Gradient Boost ?

    Would using Principal Component Analysis/ Principal Component Regression/ Partial Least Squares Regression help get a better performance ?

    To validate my model i’m thinking of leave one out CV as i have small dataset ?

    I’m getting lost in choosing the model that can best fit my problem ?
    Im thinking you can help me choose several ones and then compare on r2_square score ?

    Getting an answear of yours would be so helpful to me
    also one Last Question do you think I need Feature Selection for my problem ?

    • Avatar
      James Carmichael May 19, 2022 at 6:28 am #

      Hi S.Nouhaila…Please narrow your query to a specific question so that we may better assist you.

  60. Avatar
    S.Nouhaila May 23, 2022 at 5:56 pm #

    My question is that can I use Random Forest for a small dataset ( 20 features, 1 label, and 37 rows (data)) ? knwing that the variance ( vif ) of my features is inf and tht half of my features are highly correlated with my label ?

  61. Avatar
    Charan May 23, 2022 at 6:56 pm #

    Hi Jason,

    I am new to ML and my first project is to rank the ‘Modes of Payment’ (or Channels) used in the banking sector based on the usage distribution.
    Channels may include mobile transactions, ATM transactions, web/online banking, cheques, etc.

    My questions are:
    1) How and where do I need to start to find an appropriate algorithm?
    2) After finding one how do I make it adaptable to additional factors*?

    For example, I need to find the usage distribution of channels on weekdays, weekends, and national holidays, etc.

    So how can I predict or find the pattern of usage distribution, let’s say for the next week or month?

    *Additional factors could be, for example, finding the most used ATM location in case of ATM transactions.

Leave a Reply