Originally posted on Data Science Central

This article introduces Mahout, a library for scalable machine learning, and studies potential applications through two Mahout projects. It was written by Linda Terlouw. Linda is a computer scientist who works on Data Science (Data Analysis, Data Visualization, Process Mining).

Apache Mahout is a library for scalable machine learning. Originally a subproject of Apache Lucene (a high-performance text search engine library), Mahout has progressed to be a top-level Apache project.

While Mahout has only been around for a few years, it has established itself as a frontrunner in the field of machine learning technologies. Mahout has currently been adopted by: Foursquare, which uses Mahout with Apache Hadoop and Apache Hiveto power its recommendation engine; Twitter, which creates user interest models using Mahout; and Yahoo!, which uses Mahout in their anti-spam analytic platform. Other commercial and academic uses of Mahout have been catalogued at https://mahout.apache.org/general/powered-by-mahout.html.

This Refcard will present the basics of Mahout by studying two possible applications:

Training and testing a Random Forest for handwriting recognition using Amazon Web Services EMR AND
Running a recommendation engine on a standalone Spark cluster.

In this article there are 10 sections:

Machine Learning
Algorithms Supported in Apache Mahout
Installing Apache Mahout
Example of Multi-Class Classification Using Amazon Elastic MapReduce
Getting and Preparing the Data
Classifying From Command Line Using Amazon Elastic MapReduce
Interpreting the Test Results
Using Apache Mahout With Apache Spark for Recommendations
Running Mahout from Java or Scala

To check out all this information, click here.

DSC Resources

Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
Buzz: Business News | Announcements | Events | RSS Feeds
Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

Additional Reading

What statisticians think about data scientists
Data Science Compared to 16 Analytic Disciplines
10 types of data scientists
91 job interview questions for data scientists
50 Questions to Test True Data Science Knowledge
24 Uses of Statistical Modeling
21 data science systems used by Amazon to operate its business
Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
5 Data Science Leaders Share their Predictions for 2016 and Beyond
50 Articles about Hadoop and Related Topics
10 Modern Statistical Concepts Discovered by Data Scientists
Top data science keywords on DSC
4 easy steps to becoming a data scientist
22 tips for better data science
How to detect spurious correlations, and how to find the real ones
17 short tutorials all data scientists should read (and practice)
High versus low-level data science

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Source: Distributed Machine Learning with Apache Mahout