Data Science From Scratch: Book Review

By Jason Brownlee on August 16, 2020 in Machine Learning Resources 33

Programmers learn by implementing techniques from scratch.

It is a type of learning that is perhaps slower than other types of learning, but fuller in that all of the micro decisions involved become intimate. The implementation is owned from head to tail.

In this post we take a close look at Joel Grus popular book “Data Science from Scratch: First Principles with Python“.

I recently finished reading the paperback version and I think it might be one of my favorite beginner machine learning books for the year. Grab a copy!

Data Science From Scratch

Overview of the Book

Let’s take a birds eye look at this book.

The Author: Joel Grus

The author of this book is Joel Grus, a software engineer at Google.

In previous roles he’s been a Data Scientist and Analyst at startups and engineer at Google. He got his PhD from Caltech. A very fine background.

Learn more about Joel on his LinkedIn profile and blog and Twitter.

The Target Audience: Beginners

The target audience for the book is intermediate programmers interested in getting started in data science and machine learning.

Python is not a prerequisite to read this book (there is a Python crash course in Chapter 2), but it would speed things up if you were already a Python programmer.

The book does not assume any mathematical background in machine learning (there is a crash course in Chapters 4-7), but again, some background in stats, probability and algebra would speed things along.

The Book Approach: Write Code (in Python)

This is an introductory book to data science and machine learning.

The majority of the text focuses on the implementation of machine learning algorithms. There is a brief introduction to Python and the coverage of some basic math, data visualization and data gathering subjects.

It will take you from an a beginner programmer to being able to implement machine learning algorithms to address various data science problems.

Code From Scratch

The approach taken in the book is to describe the concepts and then to implement them in Python from scratch. This means without the use of machine learning and data handling libraries (e.g. scikit-learn).

The stated goal by the author of implementing algorithms from scratch is:

…building tools and implemented algorithms by hand in order to better understand them.

Good code examples must be readable first and efficient and effective second. They are written for understandability as a teaching aid, not production level code. Take note that the programming you will be doing from scratch will be instructional only, not operationalizable.

I put a lot of thought into creating implementations and examples that are clear, well commented, and readable. In most case, the tools we build will be illuminating but impractical.

Book Contents

The book is 311 pages long and contains 25 chapters. It’s a classic O’Reilly book and is the perfect form factor to have open in front of you while you bash away at the keyboard implementing the code examples.

In this section we take a look at the table of contents:

a data scientist is someone who extracts insights from messy data

Chapter 1: Introduction (What is data science?)
Chapter 2: A Crash Course in Python (syntax, data structures, control flow, and other features)
Chapter 3: Visualizing Data (bar, line and scatter plots with matplotlib)
Chapter 4: Linear Algebra (vectors and matricies)
Chapter 5: Statistics (central tendency and correlations)
Chapter 6: Probability (Bayes’ Theorem, Random Variables, Normality)
Chapter 7: Hypothesis and Inference (confidence intervals, P values, Bayesian inference)
Chapter 8: Gradient Descent (gradients, steps, stochastic variation)
Chapter 9: Getting Data (scraping HTML, JSON APIs)
Chapter 10: Working with Data (basic viz, data transforms)

machine learning… [refers] to creating and using models that are learned from data […] this might be called predictive modeling or data mining

Chapter 11: Machine Learning (fitting, bias-variance, feature selection)
Chapter 12: k-Nearest Neighbors (also curse of dimensionality)
Chapter 13: Naive Bayes
Chapter 14: Simple Linear Regression (also gradient descent)
Chapter 15: Multiple Regression (also bootstrap, regularization)
Chapter 16: Logistic Regression (also SVM)
Chapter 17: Decision Trees (also random forest)
Chapter 18: Neural Networks (perceptron and back-prop)
Chapter 19: Clustering (k-Means)

Natural language process (NLP) refers to computational techniques involving language.

Chapter 20: Natural Language Processing (n-gram, grammars, Gibbs sampling)
Chapter 21: Network Analysis (Centrality and PageRank)
Chapter 22: Recommender Systems (user- and item-based)
Chapter 23: Databases and SQL (basic usage)
Chapter 24: MapReduce (various worked examples)
Chapter 25: Go Forth and Do Data Science (libs you should use)

Implementing things “from scratch” is great for understanding how they work. But it’s generally not great for performance …, ease of use, rapid prototyping, or error handling. In practice, you’ll want to use well-designed libraries that solidly implement the fundamentals.

Opinions of the Book

I generally liked the table of contents, except I would make some changes.

I would drop some of the later chapters like NLP, Network Analysis, and so on (Chapters 20-24) and rename the book “Machine Learning Algorithms from Scratch“. It would be a less sexy but a more honest and accurate title.

Data Science is about formulating the questions then gathering the data and building the models to answer them. We don’t really need a data science from scratch book unless it was a bunch of business case studies plus the modeling. From scratch in data science really means the algorithms part.

I’m not upset, in fact I had a great time reading this book, but I could imagine someone expecting systematic processes for formulating and working through business-data problems in addition the modeling feeling a little bit misleading.

I did not implement all of the algorithms from scratch. I read the whole book, studied all of the examples, but I only implemented a few for fun.

I found the code easy to read, commented just enough. I think going vanilla Python (over NumPy) was a good move. It lowered the bar just enough so that all you need is some basic Python syntax and away you go.

Resources

I’ve gathered up some additional resources related to the book if you’re interested in diving deeper.

Data Science from Scratch: First Principles with Python on Amazon
GitHub Repo providing all code from the book (with bug fixes and sample data)

Final Thoughts

I like the book. I had fun, I think primarily because I have always liked working through programming books and because I’ve written a book just like this myself (i.e. Clever Algorithms).

If you’ve been around the block and you’re hard core into scikit-learn or R right now and not interested in the distraction, this book is probably not for you. But remember, the learning never ends and it can be fun to go over the beginner stuff again and tighten up the screws.

If you know some Python (or you’re a solid dev and want to get into Python) and you want to get intimate with machine learning algorithms by implementing them, then this book is for you.

Grab a copy!

Did you read Data Science From Scratch? What did you think? Leave a comment.

33 Responses to Data Science From Scratch: Book Review

Sergey September 2, 2015 at 5:27 am #

I actually already have bought the book, but still haven’t read it, not even started! So much distracting stuff! 🙂 But I feel like your review gives me additional motivation to finally do it!

Reply
- Jason Brownlee September 3, 2015 at 11:16 am #
  
  Get Started Sergey, let me know what you think of it.
  
  Reply
Aman Tandon September 3, 2015 at 10:50 am #

I already has a copy of this book and till this time I just finished 1st chapter of the book.

Jason as you are also supporting this book, I am feeling great for picking this book for starting my journey to learn the data science.

Reply
- Jason Brownlee September 3, 2015 at 11:15 am #
  
  Nice one, I’d love to hear what you think about the book Aman.
  
  Reply
Jamal May 19, 2016 at 9:40 am #

I’m reading through this book now. Like you I’m reading it through, looking at the examples but not really implementing them myselves. Occasionally I play with the provided code.

It’s a well written book and I think is very good for getting a highlevel view of data science and ML. Although it implements things from scratch, it doesn’t go super deep into the mathematical concepts behind them or anything. But it lets you know what you can do.

As well as the code implementations I would have liked more mathematical notation to go with them.

A nice review thanks.

Reply
Ayan November 21, 2016 at 8:29 pm #

Hi,
I’ve ordered this book but I wonder if it’s the right book for me. Do you have any books on machine learning that you would recommend to someone who has an engineering degree. I’ve got a good mathematical background and can do a little programming though my stats and general computer science background is limited.
Thanks!

Reply
- Jason Brownlee November 22, 2016 at 7:04 am #
  
  Hi Ayan,
  
  I find that developers and engineers do like to implement the algorithms from scratch in order to really understand them.
  
  We also like to jump in there and start modeling. If this sounds like you, check out this book:
  https://machinelearningmastery.com/machine-learning-with-python/
  
  Reply
Proquotient January 3, 2017 at 5:32 pm #

Thank you for sharing this great detailed review of the book , many people want to learn data science and one of the best ways is by selecting a great book to get stared and this review will help people to decide if it is the correct choice for them!

Reply
- Jason Brownlee January 4, 2017 at 8:51 am #
  
  Thanks Proquotient.
  
  Reply
rahul November 8, 2017 at 5:51 pm #

Hi,
Can I know how many days you took to complete this book, just gives me a rough idea to fix my schedule.

Reply
- Jason Brownlee November 9, 2017 at 9:53 am #
  
  I read it over a few train rides in a week.
  
  Please read the book at your own pace. No need to go faster/slower.
  
  Reply
Sanjay Dasgupta December 8, 2017 at 2:56 pm #

Hi Jason,

Have you seen a notebooks-only (jupyter) version of the book code?

If not, do you think creating a notebook version of the repository would be a worthwhile endeavor?

Thanks for any ideas you may have.

Reply
- Jason Brownlee December 9, 2017 at 5:35 am #
  
  No, I have not. It might be a fun exercise for you, but I would not recommend posting it online to avoid copyright infringement.
  
  Reply
Siddharth September 9, 2018 at 4:56 am #

Data Science is about formulating the questions then gathering the data and building the models to answer them. We don’t really need a data science from scratch book unless it was a bunch of business case studies plus the modeling –
Is there any book to learn that?

Reply
- Jason Brownlee September 9, 2018 at 6:00 am #
  
  Perhaps you’re right.
  
  I have seen no such book.
  
  Reply
Harshali October 1, 2018 at 9:53 pm #

Hello Jason,

Data Science from scratch must be a great book. I am also willing to learn Python for Data Science. This blog ‘https://data-flair.training/blogs/best-python-book/’ on Python books suggested me “Python for Data Analysis” and “Python machine learning”. Is there any other you would recommend.

Thanks in advance.

Cheers!

Reply
- Jason Brownlee October 2, 2018 at 6:25 am #
  
  Thanks.
  
  I recommend that you start working through projects and only get a book if you need to go deeper into the topic.
  
  Reply
msdhoni January 8, 2019 at 6:24 pm #

Great content useful for all the candidates of Data Science training who want to kick start these career in Data Science training field.

Reply
- Jason Brownlee January 9, 2019 at 8:41 am #
  
  Thanks.
  
  Reply
TeeJ February 8, 2019 at 2:44 pm #

Best book I have read in a while! Not sure what to read next. Any recommendations?

Reply
- Jason Brownlee February 9, 2019 at 5:53 am #
  
  Thanks.
  
  Perhaps some of these tutorials will interest you:
  https://machinelearningmastery.com/start-here/#code_algorithms
  
  Reply
Dukool Sharma April 22, 2019 at 4:07 pm #

Hey Jason,
I have started to learn Data Science all by myself. I have some blogs too. I was looking for books to start from but wasn’t able to find the one. Your review of the book has helped me to choose it as a starting point. Thanks for the help! Looking for more suggestions. Any lead?

Reply
- Jason Brownlee April 23, 2019 at 7:52 am #
  
  Thanks, happy to hear that.
  
  It really depends on what you’re interested in learning about.
  
  Perhaps start here:
  https://machinelearningmastery.com/faq/single-faq/what-other-machine-learning-books-do-you-recommend
  
  Reply
Andres May 21, 2019 at 9:26 am #

The book was a little too loose mathematically for me. I don’t mean that in a “I prefer the theory” sense, I mean that many of the algorithms (especially chapter 8 and onward) would vastly benefit in clarity by stating the domain, range etc. of various functions. It would also help to accurately describe gradient descent, rather than python code with (x,y) as the data variable. I understand the temptation to write general code, but in doing so, it would help to vlarify what each variable means, and what function is being minimized etc.

These problems start small and grow. I have no problem with the mathematics of the algorithms, and I have no problem with Python. But at times, translating the mathematics to python in text is a nightmare, and makes certain sections unreadable due to the fact that the math is blackboxed but still included in the code to retain the “from scratch” principle.

Reply
- Jason Brownlee May 21, 2019 at 2:42 pm #
  
  Great note, thanks Andres.
  
  Reply
Heena January 14, 2020 at 5:05 pm #

Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data.
I had done Data Science course from TGC India. They offer a variety of tutorials covering everything from the processes of Data Science to how to get started with Data Science.

Reply
- Jason Brownlee January 15, 2020 at 8:18 am #
  
  Thanks for sharing.
  
  Reply
venkatesh February 24, 2020 at 5:27 pm #

Thanks for sharing such an informative post on Data Science, I was looking for this info for a really long time.

Reply
- Jason Brownlee February 25, 2020 at 7:42 am #
  
  You’re welcome.
  
  Reply
Danny August 3, 2020 at 10:55 pm #

I’ve just bought this book after reading your review. I’m happy about the learning path and how the table of contents is structured. I literally cannot wait for it to arrive 🙂

Reply
- Jason Brownlee August 4, 2020 at 6:40 am #
  
  Nice work!
  
  Reply
Anil306 October 24, 2024 at 8:11 pm #

This content is valuable for all aspiring Data Science candidates looking to launch their careers in the field of Data Science training.
Thank you.

Reply
- James Carmichael October 25, 2024 at 8:23 am #
  
  You are very welcome Anil! Let us know if you have any questions.
  
  Reply

Navigation