Last Updated on January 9, 2019

So, you’re working on a machine learning problem.

I want to really nail down where you’re at right now.

Let me make some guesses…

## 1) You Have a Problem

So you have a problem that you need to solve.

Maybe it’s your problem, an idea you have, a question, or something you want to address.

Or maybe it is a problem that was provided to you by someone else, such as a supervisor or boss.

This problem involves some historical data you have or can access. It also involves some predictions required from new or related data in the future.

Let’s dig deeper.

## 2) More on Your Problem

Let’s look at your problem in more detail.

You have historical data.

You have observations about something, like customers, voltages, prices, etc. collected over time.

You also have some outcome related to each observation, maybe a label like “*good*” or “*bad*” or maybe a quantity like *50.1*.

The problem you want to solve is, given new observations in the future, what is the most likely related outcome?

So far so good?

## 3) The Solution to Your Problem

You need a program. A piece of software.

You need a thing that will take observational data as input and give you the most likely outcome as output.

The outcomes provided by the program need to be right, or really close to right. The program needs to be skillful at providing good outcomes for observations.

With such a piece of software, you could run it multiple times for each observation you have.

You could integrate it into some other software, like an app or webpage, and make use of it.

Am I right?

## 4) Solve with Machine Learning

You want to solve this problem with machine learning or artificial intelligence, or something.

Someone told you to use machine learning or you just think it is the right tool for this job.

But, it’s confusing.

- How do you use machine learning on problems like this?
- Where do you start?
- What math do you need to know before solving this problem?

Does this describe you?

Or maybe you’ve started working on your problem, but you’re stuck.

- What data transforms should you use?
- What algorithm should you use?
- What algorithm configurations should you use?

Is this a better fit for where you’re at?

## I Am Here to Help

I am thinking about writing a step-by-step playbook that will walk you through the process of defining your problem, preparing your data, selecting algorithms, and ultimately developing a final model that you can use to make predictions for your problem.

But to make this playbook as useful as possible, I need to know where you are having trouble in this process.

Please, describe where you’re stuck in the comments below.

Share your story. Or even just a small piece.

I promise to read every single one, and even offer advice where possible.

### Update:

If you are struggling, I strongly recommend following this process when working through a predictive modeling problem:

Thank you for your blog. So much great posts here, and did not go through all your previous posts. Let me describe where am I right now:

1- I have an idea: To create a DL model that generates code.

2- More: Actually my model aims to generate some templates. Those templates need some additional data from the user before they can be rendered into complete code.

3- I have been reading about it, and I guess I need to use RNN (LSTM) models in order to generate code (or templates). My problems:

A- If I want my DL to generate templates for a linear regression program, for example. My training data should be linear regression programs, right? How can I also input the performance of these programs to be considered as training data as well?

B- Most linear regression programs have a lot in common. So for example, how can I teach my DL model to be proficient in generating linear regression programs, without necessarily going through predicting the next character or word?

Just to summarize, I want to create a DL model that generates ML programs based on some user input 😀 what are the logical step I can take to do so?

Interesting problem.

I have some examples of LSTMs learning to compute that might help:

https://machinelearningmastery.com/learn-add-numbers-seq2seq-recurrent-neural-networks/

This is a general process to work through for a new predictive model:

https://machinelearningmastery.com/start-here/#process

I’d recommend spending a lot of time on defining the problem (first step) and working on a large dataset for input/output. Data used to train the model.

Hate to spoil day dreaming but it is not possible to have DL writing a code for you:

“…, you could not train a deep-learning model to read a product description and generate the appropriate codebase. That’s just one example among many. In general, anything that requires reasoning—like programming or applying the scientific method—long-term plan-

ning, and algorithmic data manipulation is out of reach for deep-learning models, no

matter how much data you throw at them. Even learning a sorting algorithm with a

deep neural network is tremendously difficult.

This is because a deep-learning model is just a chain of simple, continuous geometric

transformations mapping one vector space into another. All it can do is map one data

manifold X into another manifold Y, assuming the existence of a learnable continuous

transform from X to Y. A deep-learning model can be interpreted as a kind of pro-

gram; but, inversely, most programs can’t be expressed as deep-learning models—for most

tasks, either there exists no corresponding deep-neural network that solves the task or,

even if one exists, it may not be learnable: the corresponding geometric transform may

be far too complex, or there may not be appropriate data available to learn it.”

– françois chollet – in a book “Deep learning with Keras”

(keras creator)

correction the name of teh book I quoted from is “Deep Learning with Python” by F. Chollet

https://www.manning.com/books/deep-learning-with-python

I’m hesitant to count anything out. I have seen some of the learning to compute and code generating LSTM papers and it’s impressive stuff.

Agreed, there is a long way to go.

François does has a pessimistic outlook in his writing and tweeting. It was not long ago that object identification and automatic image captioning were a pipe dream and now it is practically a hello world for beginners. Neural nets could “never learn a language model”, until LSTMs started working at scale.

For a kid who used to learn XOR with backprop back in the day (me), the progress is incredible.

Hello Jason, I strongly advice against the over-hyping DL. In fact phrases such as “deep learning” and “neural networks” are misleading for most of the public.

“… a deep-learning model is just a chain of simple, continuous geometric

transformations mapping one vector space into another. All it can do is map one data

manifold X into another manifold Y, assuming the existence of a learnable continuous

transform from X to Y….”

Hence, if we start raising expectations that our chain of simple, continuous geometric

transformations will do magic and start writing a programming code on its own we are likely to end up like pet.coms in internet bubble mania 2001.

Machine learning is equipped in powerful tools DL included, but DL is no magic box.

Thanks for the feedback.

I do agree with you. But I guess one of the worse (and inexact) tasks DL can do as it can be mapped with data to feed DL is:

Function point analysis, a method to raise the cost of an app or system.

As it is based on past projects, you can load previous projects’ features (word2vector from the code, screenshots, type of language, number of entities, their attributes and types and so on) in order to quantify numeric outcomes like number of function points for future projects or for improvements in existent ones.

I have tons of data at my disposal and I want to find insights for the business but I am not sure how to find the questions in the first place.

Perhaps talk to the business and ask what types of insights would really interest them, what areas, what structure, etc.

Ideally, you want information that is actionable. E.g. where you can then devise an intervention.

I have over 1000 hours of audio and transcribed text for the said audio. I’m also embarking on studying TensorFlow.

My issue at present is how I should prepare thus data to train a model (and which model should I use?). I don’t want to wait till I get a handle on TensorFlow and ML concepts before preparing my data appropriately to use in training, because I’m awear that data preparation can be 80% of the work in AI

What are you trying to predict from the transcripts?

I would guess it would be to automatically transcribe speech to text:

https://en.wikipedia.org/wiki/Speech_recognition

I’m trying to transcribe speech to text automatically

That does sound like a great project.

I would recommend reading some papers on deep learning audio transcription (speech to text) type projects to see what type of representations are common.

The material I have on preparing text data might help for one half of the project:

https://machinelearningmastery.com/start-here/#nlp

Thanks a million. I would like to get in touch with you to keep you posted on my progress.

Please do.

Thanks Jason – you are the first educator that I have known in ML to define the problem facing users in order to develop a solution!! My main problem is that I have variables with high variance (sometimes maybe outliers but can’t be excluded for convience). A struggle with finding the best model to extract feature importance.

Do you really need feature importance or just the outcome of knowing what features are important?

Perhaps focus instead on feature selection and explore wrapper methods to use model skill to choose the subset of features that results in the best models?

More here:

https://machinelearningmastery.com/an-introduction-to-feature-selection/

I have a dataset x-API dataset that is related to educational data mining and I want to use association rule mining and clustering on this dataset using WEKA tool.Previously I have worked on this dataset using classification to predict students academic performance. So this time what can I do new to it using above mentioned techniques.

Could you help me in this problem and i want to know how training, testing and validation can be done in WEKA tool.

Thank you.

Interesting problem.

I don’t have material on clustering, but I do have a ton on how to best use the Explorer and Experimenter interfaces for supervised learning. A good place to get started is right here:

https://machinelearningmastery.com/start-here/#weka

How to choose a paper to solve a problem? And if usng transfer learning which existing architecture should I begin with?

Start with a strong description of your problem:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Then find papers on and related to the definition of your problem. There will be 1000 ways to solve a given problem, find methods that look skilful, simple and that you understand. Too much complexity in learning models is often a smell (e.g. code smell).

https://en.wikipedia.org/wiki/Code_smell

For transfer learning, again use a strong definition of your problem and project goals to guide you. Maybe using the most skilful model matters most, maybe it doesn’t. Or maybe you must test a suite of methods to see what works best for your specific problem.

I have a task to predict student retention by modelling student behavior by observing observable states (such as interaction with log data that contain accessing lectures,discussion, problem and so on) using “hidden markov model”,I have data also some research papers related to my problem.But have not idea how to implement HMM on this type of task,Please can you refer any link related to this type of objective implemented with HMM,or anything through which I can get idea how and where to start.Thanx

Perhaps google search and find some existing open source code that you can play around with to see if the approach will be viable. It will save days/weeks.

Also, I have worked on similar problems and found very simple stats and quantile models to be very effective. Basically, users that use the software more stay longer. Sometimes obvious works.

i also need templates or sample for similiar problem.

have any specific links? or info?

Sure:

https://machinelearningmastery.com/start-here/#process

many thanks

I’ll check it out

any more recommendations?

ohh.

I mean examples of using HMM.

have something?

Sorry, I don’t have examples of using a HMM.

Hello sir,

Sir,my problem statement which I have choosen for my project is “NETWORK ANOMALY DETECTION OF DDOS ATTACKS “.

We have planned of using DEEP LEARNING TECHNIQUES “to solve the problem.

We have planned to use a two stage classifiers in our model.In first stage we classify using STACKED AUTOENCODERS and in the second stage using RNN(LSTM).

The first stage classifier will mark anamolous that will be fed to second stage classifier.In the first stage most of the known attacks will be classified properly whereas in the second stage the novel attacks will be classified.This is done to reduce the false positives.

The dataset which we will be using is NSL KDD dataset.

The dataset contains 42 features.

And we will be using python as platform.

Sir,my questions are:

1)The feasibility of this model .

2)can you please help me with a sample code which can help to detect DDOS attack using STACKED AUTOENCODER and LSTM.

Sir,I’m finding difficulty in implementing this model.I would be really grateful to you if you could help me.

Sounds fine. Opinions on feasibility don’t matter though. Only the skill of the model matters.

Sorry, I cannot write code for you. I hope to cover autoencoders in the future. Until then, perhaps you can find some open source code to use as a starting point?

Hi Jason,

I am a newbie in ML. I am in process of preparing an approach to an ML problem. I came up the following:

Filling missing values -> Scatter Plot -> Transformations(Log/Pow etc..) -> Normalization -> Train Model->Evaluate Model with the metrics.

I have couple of questions. Your inputs are highly appreciated.

Around the process of choosing a transformation /normalization for any give data set. I looked for this in internet, most of the blogs suggest it’s specific to the data set.

Would like understand, if there is a way at least to narrow down to few transformations/normalization algorithms for a given data set.

The other important understanding i lack is the statistical importance of any metric and how to choose a right metric over the other for a given data set.

Thanks

Most methods that use a linear/non-linear sum of the inputs benefit from scaling, think neural nets and logistic regression. Also methods that use distance calculations, think SVM and KNN.

If unsure, try it and see, use model skill/hard results to guide you.

Metric – for evaluating model skill? In that case think about what matters in a given model. How do you/stakeholders know it is good or not. Pick a metric or metrics that make answering this question crystal clear for everyone involved.

The article gave incredible insights on Machine learning and its importance in the present day. Loved the content. Thanks.

Thanks.

Hey Jason!

Thanks for sharing a marvelous article your content is amazing.

You’re welcome.

Hi Jason!

I will try to describe our problem as easy as I can:

We need to classify some economic registers. We have about 20 different categories. There are attributes whose type is easily interpretable: price is a continuous variable, product type is a discrete variable, iva type is an ordinal variable, …

Our first attempt has consisted in find the best binary classifier for each category. I’m not sure if it is the best. But with it we can check what kind of algorithms can work better.

Our main problem is to manage some attributes as the nif (The NIF number is the tax code allowing you to have fiscal presence in Spain). We believe that our dataset will grow and then that “discrete variable” will have a huge variety of values… And we think that this variable can be decisive to classify a new register… How we have to treat this variable?

The problem we see is that we need to encode this variable values because some machine learning algorithms only works with numbers. Using the label and count encoders strategy we are generating a lot of columns (one per nif code) and this can underestimate the rest of the columns…

What do you think? Does exist a machine learning algorithm which works better with this kind of variables?

Thanks a lot for your job, your blog is very useful for us!

Thanks for sharing, good problem!

I think you’re describing a situation where you have a categorical input (a factor) with a large number of categories (levels). E.g. a categorical variable with high and perhaps growing cardinality.

If so, some ideas off the cuff to handle this might be:

– Confirm that the variable adds value to the model, prove with experiments with/without it.

– Scope all possible values and one hot encode them.

– Integer encode labels, perhaps set an upper limit and normalize the integer (1/n)

– Group labels into higher order categories then integer or one hot encode.

– Analyse labels and perhaps pull out flags/binary indicators of properties of interest as new variables.

Get creative and brainstorm domain specific feature engineering like the last suggestion.

Does that help or did I misunderstand the question?

I stay in the Metropolitan city of Bengaluru and wanted to regulate the water supply to different parts of the city using Machine learning. ie use different places to classify of usage is heavy or less and distribute water accordingly using a central water supply. How do I enunciate this?

Great problem!

Try this framework for describing your problem, and framing it as a supervised learning problem:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Hi Jason,

Thanks so much for your blog! It’s been very helpful with getting my feet wet in applying machine learning.

I have data on many different parameters from health sensors (heart rate, skin temperature, breathing rate, air temperature, humidity, etc.) and want to try and predict the next reading of one of them (heart rate) based on the current readings of the others.

Any thoughts on how this could be accomplished?

I have this data for many different people, and eventually want to model how each individual responds to changes in their measurements. For example, fit people’s heart rates may not change as much with humidity.

Cheers!

Multiple inputs, one output, a type of multivariate time series forecasting.

I am going to write a book about this next, but I do have a tutorial on this here:

https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

LSTM might not be the best tech for this problem, but you can see how to prepare the data and approach the problem.

This post will help you reorganize your data so you can try any algorithms, such as those from sklearn:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

For general help on how does time series work and how to make data stationary etc, start here:

https://machinelearningmastery.com/start-here/#timeseries

Re the framing of the problem, you need to get creative and test all ideas.

– maybe model each user.

– maybe group users by some similarity and model the groups

– maybe develop a model that works for all users.

– maybe some combination of the above

There wont be one best way, find what works best for your data, time and resources.

Does that give you some ideas?

Thanks so much! Your resources really help me organize a mental framework for everything!

Thanks, I’m glad to hear that John.

Hello Dr. Brownlee,

I am doing research into patterns that occur in financial data. In particular, trade data from the major exchanges.

To see what I am working on, please visit: http://blog.cypresspoint.com/.

The main problem I am trying to solve is modeling a dataset that is out-of-balance. Packages like SKLearn and H2O address this problem with an API argument like class_weight = ‘balanced’. This helps, but I feel that this is not enough.

It looks like Google is ignoring the unbalanced dataset problem. That’s understandable. Their business model is based on totally balanced datasets that contain text, images and audio data.

Any comment or suggestions will be greatly appreciated.

Charles Brauer

Try everything you can think of and double down on what works.

I have a list of ideas here that might help:

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

In the literature, it may be called “class imbalance”, for example:

https://scholar.google.com/scholar?q=class+imbalance

Does that help?

Thank you for the excellent reference. Very through. It will take me a year to work through all the algorithms you describe. Although, I must say, the resampling tactic seems like cheating. I am currently spending 80% of my time collecting, cleaning, and building data sets. And only 20% building and analyzing models. I feel like Sisyphus.

Yes. The modeling is the easy part. Having something good to model (data) is where most of the work is.

hello Jason, I am working on text classification research, for that first we need to extraction features as you know, I am confused to select machine learning or deep learning for my research, how to select one and way ….. thanks

Perhaps take a quick survey of the literature for similar problems and see what is common.

Perhaps start with a technique that you are familiar with, then expand from there.

Perhaps find a tutorial on a similar problem and adapt it for your needs.

There will not be a single best approach, try a suite of methods to see what works and incrementally improve your code and understanding until you are happy with the skill of the system. It may take a while.

Lacking the consistency of keeping up the work of modeling using ANNs. The results are not good enough and I lose interest. I don’t have skills of coding by my own but take codes from here and there and customize for my work. If I have to change some parameter in it for improving results I seriously get overwhelmed by the amount of material I have to search through to finally get what I need. I am in short of time and getting frustrated about why I took up this challenge (project) in the first place.I use R studio for modeling and have large amounts of data to deal (daily climate data of 40-50 years) with. Please suggest me Jason what I can do to speed up my learning process.

Use a checklist. It helps me and I’ve been at this for some time.

Here’s a good start:

https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

Here’s one for lifting skill of a deep learning model:

https://machinelearningmastery.com/improve-deep-learning-performance/

Does that help or give you some ideas?

Hi Jason,

thank you for all your posts about the different problems in ML and DL. They are always very detailed and therefore very helpful.

I have to solve a multi-label classification problem with blogposts. For me as a student in Digital Humanities it is very difficult to understand all the different parameters and statistics. I am using doc2vec to get the vector representation of the text as input to the keras model. I tried to find out which model is the best for my problem and came to the solution, that a LSTM should fit.

But I still have many questions:

– How can I get something like accuracy for the multi-label multi-class prediction? How can I evaluate the multi-label model?

– How many and which kind of layers would be good? Several LSTM cells in a row?

– Does it make sense to use an autoencoder between doc2vec and the LSTM to improve my accuracy?

– How big is the impact of the doc2vec parameters on the LSTM output?

– How can I find the best combination of all these different parameters?

I think it is even harder for a newbie to solve a multi-label classification problem with text instead of a multi-class classification problem with images because there are a lot more useful papers, tutorials and examples about that.

Thanks.

Great comment!

Yes, multi-label classification is under served. For example, I have nothing on it and I should. I hope to change that in the future.

Every project will have lots of questions, lots of unknowns. You must get comfortable with the idea that there are no great truths out there, that all “results” and “findings” are provisional until you learn or discover something to overt turn them. That is the scientific mindset required in applied machine learning.

I would recommend writing out each question as you have done. Tackle each in turn. Survey literature on google scholar, look at code and related projects, ask experts, get provisional answers to each question and move on. Circle back as needed. Iterate, but always continue to progress the project forward.

Many questions we have, like how many layers or how many neurons or which algorithm is best for your problem have no answer. No expert in the world can tell you. They might have ideas of things to try, but the answer is unknown and must be discovered through experimentation on your specific dataset.

I do see this a lot and I think the remedy is a change in mindset:

– From: “I am working on a closed problem with a single solution that I just don’t know yet”

– To: “I am working on an open problem with no best answer and many good enough answers and must use experiments and empirical results to discover what works”.

Does that help?

I write about this here:

https://machinelearningmastery.com/applied-machine-learning-is-hard/

And here:

https://machinelearningmastery.com/applied-machine-learning-as-a-search-problem/

Hi Jason,

Thanks for your blog! I have data of all previous shipping vessel positions in the world. I am looking at a specific market, and I am trying to forecast the freight price based on features extracted (variables created) from the data (AIS shipping data).

To this point I have mainly focused on data exploration and feature extraction. Looking at counting number of vessels in specific areas over time, distances to loading port etc. So I have lots of variables that can be used. What do you recommend to do in finding the “right” and “right number” variables to use in my model? I have read that random forests might do that.

I’ve done a lot of reading, and think that RNN (LSTM or MLP) might be the way to go and use diagnostics and grid search to find epochs, nr of neurons etc. I’ve also read that other types of neural networks might be used? And that this kind of problem also might be solved by a Support vector regressor, or use the SVR as a fitting technique as in the multivariate case, when the number of variables is high and data become complex, traditional fitting techniques, such as ordinary least squares (OLS), lose efficiency. Lastly, I’ve been told that multivariate adaptive regression splines (MARS) can give some results in forecasting.

So to summarise, I want to do a multivariate forecast for the price data based on several variables found from global vessel positions, using LSTM, MLP, SVR, MARS or other ANN algorithms.

What do you recommend/ what ate your thoughts?

One last question, do you have any good resources on stream learning? To my understanding, it might be inefficient to re-train the model every time a new observation is made.

Thank you!

Hi Patrick.

Spend a lot of time on feature selection. Choose features based on their impact on model skill. Try as many combinations as you have time to test. Spend as long feature engineering as you can afford as it usually offers more impact on the project than the choice and tuning of models. I have a ton on the topic, but you can start here:

https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

https://machinelearningmastery.com/an-introduction-to-feature-selection/

RNNs are for sequence prediction, be sure your problem is a sequence prediction problem first:

https://machinelearningmastery.com/sequence-prediction/

If it is a sequence prediction problem, still use an MLP as a baseline for comparison. LSTMs are hard to dial in, and you might get good enough results with the (easier to train/understand) MLPs. In fact, always try a suite of methods and double down on what works, rather than start with an idea of the “right model” for the problem. I call this spot checking:

https://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/

If your data has a quality of time series, then get versed in time series methods for data prep. e.g. making the series stationary. This will help regardless of the type of model. I have a ton on intro time series material here:

https://machinelearningmastery.com/start-here/#timeseries

Does that give you some ideas?

Thank you for the detailed answer!

As these are classic weekly time series of price and feature observations, these are sequences. I will try out LSTM and MLP, thanks fot the baseline input.

1. Do you recommend doing a random forrest analysis to estimate the importance of the variables/features?

2. Do you recommend looking at feed-forward neural networks also? With respect to this, do you have anything on hardcoding of temporal dependence for classical neural networks?

Thank you for this blog, helps a lot!

Try everything and use the results as evidence on what works best for your specific data.

Currently I’m working on multiclass classification problem with RF.

My biggest challenge for this particular problem is heavily imbalanced classes : I have one class that contains only one sample. I cannot ignore this class, I cannot collect more samples for this class and I don’t know how to generate more samples for this class.

If smbd was facing the same problem please help 🙂

Imbalance is a common problem. I sometimes toy with the idea of writing a whole mini-book on the topic.

Try a suite of approaches and see what helps best for your problem. I list common approaches for working with class imbalance here:

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Does that give you some ideas?

Hi Jason, congratulations for your posts and availability for help people on ML problem.

We are working no ML project that involves health features to predict a infant mortality in specific region.

We are using regression modeling on this project and the difficulties we are dealing with now is that non of the regression models we

tested (linear, general linear model and SVM) provided a good statiscal measure as result, such as low p-value and residual is not normal.

The features on dataset have outliers that can not be removed, non of the predictors features presents a linear with target feature.

The predictors present a low values on correlations matrix, whats is good. The features on dataset was selected by experient people on the domain.

We also test some padronization on data such as normalization, zscore, log but dont solved the problem.

Do you have any guess on where we are failing? May you suggest some ideia for us please.

Thanks and sorry about some english error writing because i’m from Brasil and don’t have mastery of the english language.

It could be a million things, for example:

– Perhaps you need more data.

– Perhaps you need to relax your ideas about removing outliers (inputs won’t be gaussian invalidating most linear methods)

– Perhaps you need to try a suite of nonlinear methods

– Perhaps you need to use an alternate error metric to evaluate model skill.

– …

No one can give you the silver bullet answer, you are going to have to work to figure it out. Gather evidence to support your findings. Be prepared to change your mind on everything in order to get the most out of the problem.

I list a ton of ideas to think about here:

https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

I have a problem that I need help or guidance. I have a prioblem like finding similar questions given a new questions. I have around 1500 un labeled questions to start with. What is best approach?

Good question.

There are many excellent classical text similarity metrics. Perhaps experiment with them? I would recommend start by surveying the field to see what your options are.

How can use information gain in text classification? what output will be like? for example, if we have X_train.shape=(10,50) what output shape for information gain? and can we used it for classifiers like NB or SVM?

Information gain is a measure of “information”. It is not a predictive modeling algorithm.

See more here:

https://en.wikipedia.org/wiki/Information_gain_ratio

Hello,

I’m just starting to learn machine learning. So I don’t have much idea about this field. I don’t even know if the problem I’m going to take up is even a Machine learning problem, asking it here anyway.

There is an exam that is being conducted once in every year. And the rank list (result) is also publicly available. If I have the rank list of last 5 years ( that is 5 ranklists, and a ranklist will have the rank of the aspirants and the score they gained). Is is possible for me to predict the rank of a new user based on the score of a mock test he attended.

Is this even a machine learning problem?

Yes, it sounds like a rating prediction problem. Perhaps have a read about rating systems, like those used in chess that can be used to predict the outcomes of chess games from ratings alone:

https://en.wikipedia.org/wiki/Elo_rating_system

Hello Jason, what is the best algorithm to use for preventive maintenance take for example you want to do predictive maintenance early enough before the machine breaks down. Or you want to see that the machine has worked sufficiently before changing parts or oil.

please advise thank you

Great question, perhaps look into survival analysis methods?

My actual issue: How do I recognize as quickly as possible that what type of model should I start to use and tinker with. After that the main problem becomes understanding the model because of the notations with which it is described.

Nonetheless, these can learned, and I happy to do so in the long term. A big constrain for for that, however, is the lack of accessible, worked out and solid examples providing practice opportunities. There are some really good ones out there, but it always takes lots of time and trial and error to find them.

I recommend being systematic and trying a suite of methods and let the results guide you as to what methods are best for a given dataset.

I write more about this here:

https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

hello jason

I want to do a sentiment analysis on a dataset of movie reviews but I don’t know where do I start so please can you help me

Start here:

https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

Excellent article,

Thanks for sharing this good information on machine learning…. 🙂

Thanks, I’m glad it helps.

Thank you for your help and this helpful website. I sent an email to you.

Thanks

Thanks.

Hi Jason,

We are building a logger program to track user interface with a core business system. The data set will have the system name, object type, description, order step and a process tag.

What we would like to predict for example is,

– In a new logger instance, which is the process tag?

– Which is the next step?

Thanks,

German.

This framework will help you describe your problem:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

This process will help you work through your problem:

https://machinelearningmastery.com/start-here/#process

THANK YOU JASON!

I amazed i . found your blog where I can actually ask the question I had for so long but did not know where to begun and who to ask!

Anyway I am service engineer and I am trying to figure the feasibility of troubleshooting system to assist me (and other engineers) solve tool failures (of mass spectrometer instrument) Over the last 12 years we accumulated a 20K+ cases for failures in form of problem/solution database (filled up by our engineers on the field. I would like to use machine learning (maybe deep learning) to assist our engineers found the most likely case they can apply. For now they use the regular search function in the database software we are using. Of course this means that they have use the correct word since the search for keywords. It will be great if the AI system could find solution to problem stated in a sentence rather than keywords. Of course the system actually learn the domain it is even better. In the past something like case-based reasoning would have been a solution.

My issue is how to approach this learning problem given the fact that it is a text based domain (rather than just some numbers) I can output the case (problem type, problem and solution) into an excel file if needed. The goal is to state the new problem and then have the AI show all the cases (solution part) that could apply to the new problem. That way the engineers do not have to re-invent the wheel. Of course the AI may or may not find the correct answer but could at least give the engineers to solutions that they can check.

I hope I am making sense, if please do not hesitated to ask me questions. To help, i pasted below what a typical case looks like from our 20k+ database.

Thank you again for allowing us to pick your brain 🙂

Mo

CASE#1

TYPE: MAGNET ISSUES

PROBLEM:

The Hall probe is stuck at 7000. Reseting the chassis and the real time did not work.

SOLUTION:

Found a dead power transistor along with bunch of dead 0.1 power resistors. One transistor support was also damaged. Replaced those parts and the magnet worked correctly. Working with Cs2 (hall probe around 8000) is very hard on the magnet and that we can expect more failure if using that mode too long.

CASE#2

TYPE: VACUUM ISSUES

PROBLEM:

The projection turbo pump failed again. The power went up and the vacuum degraded in that area.

SOLUTION:

Was going to replace the pump again but noticed that backing valve of the pump was closed (?) but the synoptic show the valve open. Closed/open the valve from the synoptic and the vacuum got better and pump power went down to normal. I also noticed that the compressed air was set to the minimum value. We increased it and so far it is work fine. Customer will return the pump back to Madison (never used)

CASE#3

TYPE: EGUN ISSUES

PROBLEM:

The egun current is lower than normal (6-7 uA instead of 30-70 uA) I also found that emission current is very low (0.5 mA at 3000 bits filament)

SOLUTION:

Replaced the egun filament. At first I adjusted the filament height by unscrewing the part 3/5 turn (3 notches back) But then I saw that the manual now said to unscrew 1/4 to 1/3 turns so I open and I did. Still I could not get any normal emission current (got 0.5 mA at filament set to 3000 bits instead of 2 mA) Finally I found out that the resistors R1 and R2 had bad solders connection. Now we can get 2 mA with around 3000 bits.

Generally, this process will help you define and work through a new predictive modeling problem:

https://machinelearningmastery.com/start-here/#process

Your question is very specific and sounds like perhaps the fine tuning of a search system rather than a question on how to develop a predictive model.

Thank you so much Jason! I will look into the link. So you think it will be more like a search problem rather than machine learning issue per se? I was hoping that system X would learn about the domain (from the 20K cases) and then come up with a solution of an novel problem description.

In any event, thank you for an amazing blog and for taking the time to help us learn how to solve real problems with Ai.

Mo

Try many different framings of the problem. You know more about it than me, try it as a supervised learning, try it as a search, go with what works best.

i want to predict the min,max and modal price of agricultural commodities for next 30 days.

so can you help me how to solve this problem ,yes i have already a dataset. i dont know how to select best algorithm.

Thank you in advance

Try this process:

https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/

Hello,

First, thanks for your awesome website: so much great articles inside!

My problem: I have technical issues (expressed in common english / natural words like “my application is not working and the error message is ‘cannot access blabla.com'”) and i would like to make a bot to automatically answer it with the most probable reason (“did you try to open the firewall? Here is the link to the how to”). Where should i start? How should i train? Any information / guidance will be super appreciated.

Thanks a lot,

Johann

Thanks.

Perhaps there is a natural relationship between problems and things to try. One approach would be to structure it like a recommender system: people that ask questions like this find it help to get suggestions like that.

If you don’t have such data, you may have to collect it using an existing dumber system with added randomness.

If you do have such data, you can use it to see the system.

This won’t be a new problem, I’d encourage you to browse the literature on scholar.google.com to get idea of how others have approached it. Perhaps try a few different framings and see what makes sense based on the time and resources to you have.

hey Jason,

I am new to machine learning and i want to start with a problem solving which will boost up my confidence in this field.Plz help me out by suggesting some beginner level problems!

Great, start right here:

https://machinelearningmastery.com/how-to-run-your-first-classifier-in-weka/

Hi Jason

Great article! I really liked it,

Thanks for sharing such a good informative article insights… 🙂

I’m happy it helped.

Hi!

My problem is this:

I got loads of data.

I got the odds on a betexchange for horse races during the race – for about 2 years.

That is – for every race during these two years I got data like

r1 -rn are the runners (horses)

t1 r1 r2 r3 …

0001 5.25 2.04 3.25

0002 5.10 2.50 2.75

…

3254 55 1.01 520

here – r2 won

The sample frequence is about 5 times per second.

I also know if the runner won/placed/lost

My goal is to be able to say that runnerx X will win/place after ca 50-75% of the expected racetime with say 80% accuracy.

My problem is that I don’t know how to model this situation. I’ve seem tournament strategies – ie who out of two runner will win – but here’s more data – both in time and in participants

What model should I pay attention to?

I’m not familiar with horse racing.

Perhaps this framework will help you frame the problem:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Also, maybe look into rating systems as a signal of “skill” for each participant.

well I am trying to develop my Phd Problem statement. what actually I want to do,

The existing state of the art classifier expound low accuracy on imbalanced multi label data set.

there is a need to design/develop a novel/intelligence classifier to improve the accuracy on imbalance multi label data set.

how can you help me to mature the problem statement, advising how to improve an existing classifier or start designing a new classifier.

I recommend talking to your research supervisor.

Hi,

I just started to learn ANN. I want to start by solving a problem. Lets suppose I have an image with many symbols in it. The image is not clear as the signs are old and rough but gives clear idea that what is the actual sign. Now I want to extract those signs from the image and find those signs in the database and create a new image with those signs matched in the database.

now new HD image will contain signs one the same locations.

I don’t know from where to start and what tools to use for this. Thank you.

Start by building up a dataset of images and their associated clean symbols. Thousands of examples.

Hi,

I have historic data of money spent on advertisements by various industries to a tv channel. Now I want to know if there is a budget shift in future, to be more specific if and when there is a budget shift from one industry to another.

I don’t know where to start. If it is a classification problem or regression, or a combination of both? as I have to predict the shift in spendings from one industry to another.

I would be glad if you could point me in the right direction. Thank you

Hmm, perhaps collect the data that you have available and try to work through this framework to nail down the problem type:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Hello Jason! I am currently working on a system that wants to detect if a frame in a video contains a fire or not, using machine learning. I have no idea yet how to start. I bet it is a classification problem. Can you suggest the best algorithm to use for this problem? Your advise would be highly appreciated. Thank you!

Sounds like time series classification, a CNN-LSTM might be a good starting point.

Hello Jason! I recently came across the problem to classify, find patterns and analyze the news articles about HIV over the past 10 years. What would be the best approach and which would be the best algorithm to use?

Great question, I recommend following this process:

https://machinelearningmastery.com/start-here/#process

Perhaps this will help:

https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

Hi Jason,

I’m currently working on a webscraping project for a finance professor at my university. It involves me manually navigating a company’s website and recording whether or not it contains instances/phrases that indicate certain principles of business; I would then assign the principle a value of 1, 0, or -1 based on whether it was included positively, not included at all, or included negatively (for example, if a company expressed that they worked individually, the principle collaboration would be assigned a value of -1), respectfully. I have access to a large database (hundreds, maybe thousands) of previous analyses of websites through this project, and I was wondering if I would be able to run through the examples provided for each principle on each site and develop a program that would be able to recognize a principle out of a given text and assign it a numerical value. Thanks.

Hmmm.

The presence of a phrase is binary and no ML is needed.

If you want to handle synonyms/etc. you could use a predictive model, but you’re going to need thousands of existing examples of webpages and their scoring.

It might be possible to fit a CNN/LSTM to output a scoring. Try it and see.

I am new for ML and I am going to use ML for Traffic accident Analysis, like predictions and discovering hidden patterns to understand the major causes of accidents. What method you will guide me to go through and what more problems in ML can be solved using traffic Accident dataset.

Thanks in advance.

Perhaps use CNNs on the image data.

Start with a strong definition of your problem:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Hello Jason,

First of all, thank you for your posts, they are really helpful and accessible!

Here’re some of the problem that I’m trying to solve:

1. Classifying identity cards and other documents issued by the government

2. Exracting information from those identity cards

3. Extracting information from table found on some of the documents

To solve problem 1, I’m using LeNet. Currently I’m classifying 5 types of documents (+1 class where it doesn’t detect any).

To solve problem 2, I’m planning to use YoloV2 and preparing a custom dataset, annotating the fields manually.

To solve problem 3, I’m not using conv net but rather using pure image processing approach, by creating a mask from the convex hull, and extracting each cell individually.

I’d very much like to hear your thoughts about this. Thanks in advance!

I like your approach Tommy, it is close to the approach that I would start with.

Preparing enough training data is going to be the hard part. Model performance will hinge on how much data you can gather.

Let me know how you go.

Hi Jason,

Thank you so much for taking the time to read this (if you do end up reading it).

I’m a front end engineer at women’s designer shoe startup and we want to try to accurately predict a new customer’s shoe size based on a series of questions. We are currently gathering data on a customer’s foot attributes, the size they wear in other brands, and the size they wear in our brand. Is this something that can be predicted with machine learning and if so, where do I start? How much data do I need? Thank you so much!

— Sarah

Hi Sarah, sounds like a wonderful problem.

I’d recommend two starting points.

1. define the problem clearly and spend a long time brainstorming data that could be predictive of shoe size. Talking to experts may be helpful. Also, this framework might help:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

2. start gathering data, you will need lots of data. Maybe use friends family as a start and just look at the data in excel. Humans brains are very good at finding short-cut patterns and these might save you a ton of time.

I hope that helps. I’d love to hear how you go!

Thank you so much for responding. Will start with your article. Thank you!

You’re welcome.

My problem is this, I want to develop a multivariate regression model using using gradient descent, then use the model to make predictions and thereafter plot the predictions against the times the inputs were recorded.

I aslo would like to save the trained model in a way I can use it for predictions in future.

I have many tutorials that you can follow to achieve this, perhaps start here:

https://machinelearningmastery.com/start-here/#deep_learning_time_series

Hi jason I am a student in computer science I will realize a project of end of study (I will compare 3 methods of dynamic selection KNORAE, KNORAU and META-DES) and I need your help please:

– what is the best dataset format to use for this problem or for machine learning in general .csv or .txt

– which ensemble learning I have to choose Bagging, Boosting, or RandomSubspace to generate the models

– which are the best classifiers to use to compare these three methods

And thank you in advance

What are those methods, I’ve never heard of them before?

Perhaps test a few datasets and see what works?

Perhaps try a few different ensemble methods and see what works?

Perhaps try a few different classifiers and see what works?

The key here, is that there are no pre-defined answers, you must discover the answers via prototypes and experiments.

thank you very much Jason for your attention to my message and thanks again for your help maybe I didn’t specify my questions but your blog and your articls are so helpful

Hi Jason! Great to see that there are people like you in DataScience world that are ready to help and share 🙌

I’m trying to touch Data Science and ML topics and realized that the best way to do it is to solve a problem I like to solve.

Now I’m trying to deal with such looking simple problem: I want to predict the next release date of one open-source project. I have dates of all previous releases (it’s not a long history, just about ~30 releases so far), count of the parallel version they were and are going to release. I also know deadlines (releasing once per quarter) that they try to meet. Also, I have found that they are usually releasing a new versions in the first part of the week (based on prev. release statistic).

So far, I have investigated a bit the input data and need to get where to move with this and how to proceed approaching this problem.

Can you please suggest me the way?

Thank you very much,

Roman

Sounds cool!

Follow this process:

https://machinelearningmastery.com/start-here/#process

One way to frame the problem might be as a time series classification task. That is, is a release expected in this interval. Or what is the probability of a release in this interval.

I’m eager to hear how you go. Shout if you need help.

Hi Jason,

Firstly, I would like to thank you for your amazing works.

Recently I got stuck in a problem where I would like to predict some groundwater resources using some climatic parameters as my predictors, So for this regression analysis I am using Random Forest Algorithm.

I increased the no.of trees and tune the hyper-parameters and try to fit a best model but unfortunately none of them worked. There was no correlation between any of the variables.

Can you help me, how can I overcome this problem.

Perhaps try some of the suggestions here:

https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

Hello jason, Impressive thread here. I have enjoyed reading your blog, great work!

I am pondering about a problem with data that looks like this:

document1:

1: 23,aaa bbb ccc

1: 20,bbb

2, 20, ddddd qq zz

3,10,tt jj

document 2:

1: 13, ccc

1: 30 , aaa

2: 10,zz ccc oo

2: 10 qq

3, 7 jj

3, 3 jj

As you might be able to tell , the sum’s for the number in each row in document 1 with the prefix “1:”

matches the sum’s for the rows with the same prefix in document 2 (23+20=13+33.

In the same way 20=10+10 for “2:” and 10=7+3 for “3:”.

As a help there is also some words on each row that can help connect the rows.

The training data above can be generated, so there should be no lack of training data.

The trick is to get a trained model that are able to do the matching from document1 to document2 without having the prefix (before the “:” that connects them, this is just there to train the model.

Can this be solved with machine learning ? What would be your approach ?

Thanks!

Hmmm, the constraints make me think of an optimization type problem with constraint satisfaction. Not really a learning problem.

Perhaps dig into operations research?

Thanks, that’s one way to go. But I was not clear that there is a strong hint between the row occurrence and also surrounding rows, also repeatable patterns in the data that maybe could be learned aswell. Also the wordmarching might not always work/match

Perhaps explore multiple framings of the problem and discover what works?

Can you help with getting accurate predictions with my dataset .i am trying but not get the accurate result can you help me out..!!

Yes, follow these steps:

https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

Hi Jason, I am a newbie to ML, I was about start some serious handson after going through the theory part about the algorithms and other stuff. Now I have started this process by selecting few use cases from kaggle, but I start it and explore the data and can’t seem to understand how to proceed with it. As there are so many algorithms we have and what else can be done for feature engineering and stuff like that. I am not able to pinpoint my next step as to what algorithm I should continue with etc. please help me on this . Thanks

I recommend this process:

https://machinelearningmastery.com/start-here/#process

Thanks I’ll look into it

You’re welcome.

My model gives me 100% accuracy and i don’t know what to do!!!

I know that this is totally wrong.

I work in a region and I have data about urban and non-urban points for 4 years. I want to predict the urbans and non-urbans for thw next year. I also use some data about distances and the altitude for every point.

Perhaps this will help:

https://machinelearningmastery.com/faq/single-faq/what-does-it-mean-if-i-have-0-error-or-100-accuracy

Thanks a lot Jason!!! Do you thing your example model for the Pima Indians is suitable for my case? I have changed of course the number of neurons in the layers.

Perhaps try adapting it and see?

Ook

Hello this blog is what I needed. I really need help with NLP problem. I have a folder with multiple text file documents as train_docs and it’s train_tags files folders manually written by human annatator. And I have given a test_docs folder where I have multiple text files documents so I have to automatically generate tags for this test docs.

I’m having trouble in how to like combine or add these multiple text file in one CSV or Excel so that I can use to for ml model and which model should I use for this problem.

Thanks in advance.

Thanks.

Perhaps try loading the data into memory / a data structure so that you can manipulate it and use it with modeling.

I am doing the research on queuing theory problems.I want to use machine learning algorithms

to predict waiting time or reduce waiting time.so plz suggest me the application where we predictt wait time using ml

Perhaps first collect some data then this process may help:

https://machinelearningmastery.com/start-here/#process

Thank you sir for the tutorials on ML.

how can i predict gender from the height , weight and waist circumference of an individual ?

what models can i use ?

thank you

Hi Hani…The following should help add clarity:

https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/

Hello, i’m writting not knowing if i’d still get an answear or not but sure why not try right?

Im currently working on a project and I’m struggling with setting the best sceanrio to tackle my problem.

My problem is taht I want to be able to predict the price that costs the rehabilitation of a house. ( target = total amount )

And I have 20 features (like buildinf surface and so on …) to help figure out the price ( numerical features )

Therefore my dataset consists of 37 (rows) example of rehabilitation projects on 37 diffrent houses ( I have the info on the total amount as well)

so I chose to use : Regression supervised Machine learning Models

First I generated a correlation heatmap to see how my features correlate with my target variable ( 10 features have high correlation coefficient > 0.77 )

My struggle is that wichi algorithm can I use to predict my target variable.

All I could think of is use linear regression on the features highly correlated but which Model : SVM, Lasso Regression , Ridge Regression , Linear Regression (OLS) , Stochastic Gradient Descent ?

Or should I use decison tree models : Random Forest ? Decision tree ? Gradient Boost ?

Would using Principal Component Analysis/ Principal Component Regression/ Partial Least Squares Regression help get a better performance ?

To validate my model i’m thinking of leave one out CV as i have small dataset ?

I’m getting lost in choosing the model that can best fit my problem ?

Im thinking you can help me choose several ones and then compare on r2_square score ?

Getting an answear of yours would be so helpful to me

also one Last Question do you think I need Feature Selection for my problem ?

Hi S.Nouhaila…Please narrow your query to a specific question so that we may better assist you.

My question is that can I use Random Forest for a small dataset ( 20 features, 1 label, and 37 rows (data)) ? knwing that the variance ( vif ) of my features is inf and tht half of my features are highly correlated with my label ?

Hi Jason,

I am new to ML and my first project is to rank the ‘Modes of Payment’ (or Channels) used in the banking sector based on the usage distribution.

Channels may include mobile transactions, ATM transactions, web/online banking, cheques, etc.

My questions are:

1) How and where do I need to start to find an appropriate algorithm?

2) After finding one how do I make it adaptable to additional factors*?

For example, I need to find the usage distribution of channels on weekdays, weekends, and national holidays, etc.

So how can I predict or find the pattern of usage distribution, let’s say for the next week or month?

*Additional factors could be, for example, finding the most used ATM location in case of ATM transactions.

Hi Charan…You may want to investigate deep learning methods for time series forecasting:

https://machinelearningmastery.com/how-to-get-started-with-deep-learning-for-time-series-forecasting-7-day-mini-course/