Apache Spark

Apache Spark

2,300.00
Date:
Delivery Option:
Quantity:
Purchase Now

Course ID: D8007

Duration: 3 Days

Location: Flex - San Francisco or Live Online | Click here to schedule private course.

Overview 

This 3-day Spark V2 For Developers course is designed to introduce Apache Spark to software developers and data analysts. Students will learn how to use Spark for data analysis, and also how to write Spark applications themselves, all within the cloud. In particular, students will learn about:

  • Spark Shell

  • Spark internals

  • Spark Data structures : RDDs, Dataframes, Datasets

  • Spark APIs

  • Spark SQL

  • Spark and Hadoop

  • Spark MLLib

  • Spark Graphx

  • Spark streaming

  • Tuning Spark applications 

+ Who Should Attend

As a moderately technical course, students should have some familiarity with the Linux development environment, as well as one of these programming languages: Java, Python, or Scala. Typical students include developers and data analysts.

+ Course Outline

Module 1: Scala primer

  • A quick introduction to Scala
  • Labs : Getting know Scala

Module 2: Spark Basics

  • Big Data, Hadoop, Spark
  • What’s new in Spark v2
  • Spark concepts and architecture
  • Spark eco system (core, spark sql, mlib, streaming)
  • Labs : Installing and running Spark

Module 3: Spark Shell

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

Module 4: RDDs (Condensed coverage)

  • RDDs concepts
  • Partitions
  • RDD Operations / transformations
  • More detailed coverage if required  : RDD types, Key-Value pair RDDs, MapReduce on RDD
  • Labs : Unstructured data analytics using RDDs

Module 5: Spark Dataframes & Datasets

  • Learning about Dataframe / Dataset
  • Programming in Dataframe / Dataset API
  • Loading structured data using Dataframes
  • Caching and persistence
  • Labs : Dataframes, Datasets, Caching

Module 6: Spark API programming (Scala / Python)

  • Introduction to Spark  API
  • Submitting the first program to Spark
  • Debugging / logging
  • Configuration properties
  • Labs : Programming in Spark API, Submitting jobs

Module 7: Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC
  • Labs : querying structured data using SQL; evaluating data formats

Module 8: Spark and Hadoop

  • Hadoop Primer : HDFS / YARN
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark & Hive

Module 9: Machine Learning (ML / MLib)

  • Machine Learning primer
  • Machine Learning in Spark : MLib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms : Clustering, Classifications, Recommendations
  • Labs : Writing ML applications

Module 10: GraphX

  • GraphX library overview
  • GraphX APIs
  • Labs: Processing graph data using Spark

Module 11: Spark Streaming

  • Streaming overview
  • Evaluating Streaming platforms
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Labs: Writing spark streaming applications

Module 12: Spark Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory management & caching

+ Prerequisites

  • Familiarity with either Java / Scala / Python language (our labs in Scala and Python – we provide a quick Scala introduction)
  • Basic understanding of Linux development environment (command line navigation / running commands)

What to Bring: 

+ Certifications

N/A