Last Updated on August 16, 2020
When I’m asked about resources for big data, I typically recommend people watch Peter Norvig’s Big Data tech talk to Facebook Engineering from 2009.
It’s fantastic because he’s a great communicator and clearly and presents the deceptively simple thesis of big data in this video.
In this blog post I summarize this video for you into cliff notes you can review.
Essentially, all models are wrong, but some are useful.
Quote by George Box.
Norvig starts out by summarizing that theories (models) are created by smart people that have insight. The process is slow and not reproducible and the models have flaws in them. If the models are going to be wrong anyway, can we come up with a faster and simpler process to create them.
Big Data Case Studies
Three case studies are presented that demonstrate that simple models can be created from large corpus of data. The three case studies are difficult problems from the field of natural language process (NLP):
The problem of separating unspaced characters into words so that the sentences have meaning. For example, Chinese characters do not have spacing. Use a simple probabilistic model of what constitutes a word and the Python program fits on one page.
The problem of determining whether a word is a typo and what the correction should be. Again, a simple probabilistic model that models what is a word and whether a word is a typo of a correction by looking at edit difference. It is a harder problem than segmentation.
Norvig compares his one page Python program to an open source project that has sophisticated models. He comments on the maintainability of the hand crafted models and the difficulty of adapting it to new languages. He contracts this with the big data solution that only requires the corpus to create the statistical model.
In addition to maintainability and adaptability Norvig comments that the simpler statistical model can capture the detail that is hand crafted into complex smarter models because this detail is in the data. It is not necessary to split out and maintain smaller complex models.
The problem of translating one language into another. This is a more complex problem than segmentation and spelling correction. It requires a corpus of translated text, for example newspapers that have an English and Chinese edition. The problem is addressed as an alignment problem between the two languages. Many fancy models were tried but failed to add benefit over the simple statistical model.
Big Data Principles
Big data promotes a different mode of thinking about machine learning algorithms and datasets. The data is the model.
More data versus Better Algorithms
Example problem by Microsoft Research on sentence disambiguation. The worst algorithm beats the best algorithm when the size of the dataset is dramatically increased. The lesson is to look to max-out the data for the model and find the plateau before moving onto the next model.
Parametric versus Non-parametric
When you are data poor, there is not much you can do unless you have a good theory. You essentially throw the data away and rely on your model. If you are data rich, you have something you can work from. Keep all the data because the situation could change which will change your model.
Norvig finished the talk with comments on supervised and unsupervised learning and the opportunity for semi-supervised methods that strike a balance and reap the benefits from both methods.
This is a great video and is well worth the one hour to watch. Highly recommended if you are looking for insight into the big data movement.
You can get a good treatment of the same material by reading Norvig’s chapter contribution to the book Beautiful Data: The Stories Behind Elegant Data Solutions. You can download this chapter for free on Norvig’s webpage Natural Language Corpus Data.
Below are a list of resources if you are interested in learning or reading more about Norvig’s take on big data.
- Peter Norvig on big data at Facebook Engineering (Video) The subject of this blog post
- How to Write a Spelling Corrector: Norvig’s tutorial on writing a spelling corrector in Python. I believe this is the example given in the talk.
- Google Web Trillion Word Corpus: The dataset commented on during the talk and the basis for the case studies. Also see Google’s announcement and the related Google books ngram viewer.
- Natural Language Corpus Data: Norvig’s Python examples of using the Google ngram data from his contribution to the book Beautiful Data: The Stories Behind Elegant Data Solutions.
- Scaling to very very large corpora for natural language disambiguation (Banko and Brill 2001): I believe this is the Microsoft Research paper referenced in the talk as an argument for more data over more complex models.
Have you watched this video? Leave a comment and let me know what you thought.
I think that next to Deep Learning the “next big thing” will come up from more sophisticated unsupervised learning methods, given that most of the data we have and generate every day is unlabeled. It is very likely that a revolution in how we understand and extract value from unlabeled data will affect supervised learning methods, making semi-supervised learning the most popular field of study and work for machine learning practitioners.
What do you think about it? I’d love to know!
I think there will be a lot more reuse of pre-trained models so we stop spending so many cycles re-solving the same problems.