Log On/Register  

855.838.5028

Spark for Data Analysts

Duration: 3 Days
Course Price: $2,195

Overview

Spark is a fast growing and very popular Big Data processing engine. Spark MLLib is the de facto standard for machine learning in Big Data.

This course is intended for data scientists and software engineers. It maintains an optimal balance of theory and practice. For each machine learning concept, we first discuss the foundations, its applicability and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

Audience

· Data Scientists · Software Engineers

 

Objectives

· attain thorough understanding of popular machine learning algorithms, their applicability and limitations · practice the application of these methods in the Spark machine learning environment · achieve clarity in the real-world use of machine learning by illustrating each method with practical use cases

Overview

Spark is a fast growing and very popular Big Data processing engine. Spark MLLib is the de facto standard for machine learning in Big Data.

This course is intended for data scientists and software engineers. It maintains an optimal balance of theory and practice. For each machine learning concept, we first discuss the foundations, its applicability and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

Audience

· Data Scientists · Software Engineers

 

Objectives

· attain thorough understanding of popular machine learning algorithms, their applicability and limitations · practice the application of these methods in the Spark machine learning environment · achieve clarity in the real-world use of machine learning by illustrating each method with practical use cases

Pre-requisites

· familiarity with programming in at least one language · be able to navigate Linux command line · basic knowledge of command line Linux editors (VI / nano)

Outline

Section 1: Introductions and overviews

  • Machine learning: goals, results, supervised/unsupervised
  • Spark as a tool for Big Data
  • Scala as the language of Spark (together with Python, Java and R)
    If the students do not have the Spark/Scala prerequisites, a thorough introduction of these is taught in the section

Section 2: SVM (Supervised Vector Machines)

  • Theory
  • Lab
  • Use case: anomaly detection

Section 3: Logistic Regression

  • Theory
  • Lab
  • Use case: healthcare prediction

Section 4: Linear regression

  • Theory
  • Lab
  • Use case: financial modelling

Section 5: Naive Bayes

  • Theory
  • Lab
  • Use case: spam filtering

Section 6: Decision Trees

  • Theory
  • Lab
  • Use case: vessel shipment planning

Section 7: Clustering (K-Means)

  • Theory
  • Lab
  • Use case: topic grouping


Section 8: LDA (Latent Dirichlet Allocation)

  • Theory
  • Lab
  • Use case: unsupervised topic discovery

Section 9: Principal Component Analysis (PCA)

  • Theory
  • Lab
  • Use case: stock analysis


Section 10: Recommendation (Collaborative filtering)

  • Theory
  • Lab
  • Use case: dating


Section 11: Graphs – graph operations

  • Theory
  • Lab
  • Use case: finding followers


Section 12: Graphs – optimizations with Pregel

  • Theory
  • Lab
  • Use case: shortest routes, PageRank

 

Learn More
Please type the letters below so we know you are not a robot (upper or lower case):