Machine Learning with Spark

Machine Learning with Spark

Delivery Option:
Purchase Now

Course ID: D8008

Duration: 3 Days

Location: Flex - San Francisco or Live Online | Click here to schedule private course.


Machine Learning with Apache Spark is designed to teach popular machine learning algorithms from scratch. Students will study the foundations, applicability, and limitations of each machine learning concept, as well as how to implement the concept. The course is split equally between lectures and practical work, giving students a hands-on learning experience.

What sets this course apart is that students will have access to real-life datasets (from global companies like Netflix and Uber) on which to work.

+ Who Should Attend

Machine Learning with Apache Spark is a beginner-friendly course, aimed at data scientists and software engineers. While we assume no previous knowledge of Machine Learning, a working knowledge of Spark is required, and knowledge of Python would be beneficial.

+ Course Outline

Module 1: Machine Learning (ML) Overview

  • Machine Learning landscape
  • Machine Learning applications
  • Understanding ML algorithms & models

Module 2: ML in Python and Spark

  • Spark ML Overview
  • Introduction to Jupyter notebooks
  • Lab: Working with Jupyter + Python + Spark
  • Lab: Spark ML utilities

Module 3: Machine Learning Concepts

  • Statistics Primer
  • Covariance, Correlation, Covariance Matrix
  • Errors, Residuals
  • Overfitting / Underfitting
  • Cross-validation, bootstrapping
  • Confusion Matrix
  • ROC curve, Area Under Curve (AUC)
  • Lab: Basic stats

Module 4: Feature Engineering (FE)

  • Preparing data for ML
  • Extracting features, enhancing data
  • Data cleanup
  • Visualizing Data
  • Lab: data cleanup
  • Lab: visualizing data

Module 5: Linear regression

  • Simple Linear Regression
  • Multiple Linear Regression
  • Running LR
  • Evaluating LR model performance
  • Lab
  • Use case: House price estimates

Module 6: Logistic Regression

  • Understanding Logistic Regression
  • Calculating Logistic Regression
  • Evaluating model performance
  • Lab
  • Use case: credit card application, college admissions

Module 7: Classification: SVM (Supervised Vector Machines)

  • SVM concepts and theory
  • SVM with kernel
  • Lab
  • Use case: Customer churn data

Module 8: Classification: Decision Trees & Random Forests

  • Theory behind trees
  • Classification and Regression Trees (CART)
  • Random Forest concepts
  • Labs
  • Use case: predicting loan defaults, estimating election contributions

Module 9: Classification: Naive Bayes

  • Theory
  • Lab
  • Use case: spam filtering

Module 10: Clustering (K-Means)

  • Theory behind K-Means
  • Running K-Means algorithm
  • Estimating the performance
  • Lab
  • Use case: grouping cars data, grouping shopping data

Module 11: Principal Component Analysis (PCA)

  • Understanding PCA concepts
  • PCA applications
  • Running a PCA algorithm
  • Evaluating results
  • Lab
  • Use case: analyzing retail shopping data

Module 12: Recommendations (Collaborative filtering)

  • Recommender systems overview
  • Collaborative Filtering concepts
  • Lab
  • Use case: movie recommendations, music recommendations

Module 13: Performance

  • Best practices for scaling and optimizing Apache Spark
  • Memory caching
  • Testing and validation

Module 14: Final workshop (time permitting)

Students will analyze a couple of datasets and run ML algorithms. This is done as a group exercise.  Each group will present their findings to the class.

+ Prerequisites

A working knowledge of Spark.

+ Certifications