Are you a Java programmer and looking to get started or practice machine learning?
Writing programs that make use of machine learning is the best way to learn machine learning. You can write the algorithms yourself from scratch, but you can make a lot more progress if you leverage an existing open source library.
In this post you will discover the major platforms and open source machine learning libraries you can use in Java.
This section describes Java-based environments or workbenches that can be used for machine learning. They are called environments because they provided graphical user interfaces for performing machine learning tasks, but also provided Java APIs for developing your own applications.
Waikato Environment for Knowledge Analysis (Weka) is a machine learning platform developed by the University of Waikato, New Zealand. It is written in Java and provides a graphical user interface, command line interface and Java API. It is perhaps the most popular Java machine learning library and a great place to start or practice machine learning.
The Konstanz Information Miner (KIME) is an analytics and reporting platform developed by Konstanz University, Germany. It was developed with a focus on pharmaceutical research, but has expanded into general business intelligence. It provides a graphical user interface (based on Eclipse) and a Java API.
RapidMiner used to be called Yet Another Learning Environment (YALE) and was developed at Technical University of Dortmund, Germany. It provides a GUI and a Java API for developing your own applications. It provides data handling, visualization and modeling with machine learning algorithms.
The Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a data mining workbench developed in Java by the Ludwig Maximilian University of Munich, Germany. It has a focus on working with data in relational database for tasks such as outlier detection and classification (distance function based methods). It provides a mini GUI, command line interface and Java API.
Practically every project listed on this page is/has a library with a Java API, those projects listed in this section only provide a Java API. They are machine learning libraries in the narrow sense.
The Java Machine Learning Library (Java-ML) provides a collection of machine learning algorithms implemented in Java. It provides a standard interface for each algorithm, no UIs and references to the relevant scientific literature for further reading. It includes methods for data manipulation, clustering, feature selection and classification. Note that at the time of writing, the last release was in 2012.
The Java Statistical Analysis Tool (JSAT) provides pure Java implementations of standard machine learning algorithms for modest sized problems. The author comments that he developed the library partly as a self-education exercise and partly to get things done. Nevertheless the list of algorithms is impressive. It includes classification, regression, ensemble, clustering and feature selection methods.
This section lists Java projects intended for use with Big Data, such as on clusters of machines.
Apache Mahout provides implementations of machine learning algorithms for use on the Apache Hadoop platform (distributed map-reduce). The project provides a focus on clustering and classification algorithms and a popular application driving implementation is its use in collaborative filtering for recommender systems. Reference implementations of algorithms that run on a single node are also included.
Apache Machine Learning Library provides implementations of machine learning algorithms for use on the Apache Spark platform (HDFS, but not map-reduce). Although Java, the library and the platform support Java, Scala and Python bindings. The library is new and the list of algorithms is short, but growing quickly.
Massive Online Analysis (MOA) is an open source platform designed for data stream mining by University of Waikato, New Zealand. Like Weka (developed at the same place), it provides a GUI, command line interface and Java API. It provides a long list of algorithms wit ha focus on classification and support for outlier detection and addressing concept drift. MOA uses the Advanced Data mining And Machine learning System (ADAMS) for managing workflows also developed at the same place.
Scalable Advanced Massive Online Analysis (SAMOA) is a distributed streaming machine learning framework developed by Yahoo!. It is designed to run on Apache Storm and Apache S4. The system can leverage the algorithms provided by the MOA project for tasks like classification.
Natural Language Processing
This section is dedicated to Java libraries and projects for addressing problems from the subfield of machine learning called Natural Language Processing (NLP).
NLP is not my area, so I’ll just point to the key libraries.
- OpenNLP: Apache OpenNLP is a toolkit for processing natural language text. It provides methods for NLP tasks such as tokenization, segmentation, and entity extraction.
- LingPipe: LingPipe is a toolkit for computational linguistics and includes methods for topic classification, entity extraction, clustering, and sentiment analysis.
- GATE: The General Architecture for Text Engineering (GATE) is an open source library for text processing. It provides an array of sub-projects targeted at different use cases.
- MALLET: Machine Learning for Language Toolkit (MALLET) is a Java toolkit fro statistical natural language processing, document classification, clustering, topic modeling and information extraction.
This section lists those libraries for the subfield of machine learning called Computer Vision (CV).
Again, CV is not my area, so I’ll just point to the key libraries.
- BoofCV: BoofCV is an open source library for computer vision and robotics applications. It supports features such as image processing, features, geometric vision, calibration, recognition and image data IO.
Neural Nets are hot again with the development of deep learning methods and faster hardware. This section lists key Java libraries for working with neural networks and deep learning.
- Encog: Encog is a machine learning library that provides algorithms such as SVM, classical neural networks, genetic programming, bayesian networks, HMM and genetic algorithms.
- Deeplearning4j: Deeplearning4j is claimed to be a commercial-grade deep learning library written in Java. It is described as being compatible with Hadoop and provides algorithms including Restricted Boltzmann machines, deep-belief networks and Stacked Denoising Autoencoders.
In this round-up post we have touched on the big name options when selecting a library or platform for machine learning when working in Java.
These are the players and the popular projects, but by no means is this list complete. For example, take a look at this page on MLOSS.org that lists (at the time of writing) 71 Java-based open source machine learning projects. That’s a lot and I’m sure there are more on GitHub and SourceForge.
They key is to think hard about your own project and it’s requirements. Figure out what you need from a library or platform and then pick and learn a project that best fits your needs.