Log On/Register  

855.838.5028

Spark for Developers

Duration: 3 Days
Course Price: $2,495

Overview

This course will introduce Apache Spark. The students will learn how to use Spark for data analysis and write Spark applications.

 

Completely updated for latest Spark version 2.x! Spark version 2 has lots of changes compared to v1. This course covers the latest Spark v2 features.

Audience

Developers / Data Analysts

 

Objectives

· Spark Shell · Spark internals · Spark Data structures : RDDs, Dataframes, Datasets · Spark APIs · Spark SQL · Spark and Hadoop · Spark MLLib · Spark Graphx · Spark streaming · Tuning Spark applications

Overview

This course will introduce Apache Spark. The students will learn how to use Spark for data analysis and write Spark applications.

 

Completely updated for latest Spark version 2.x! Spark version 2 has lots of changes compared to v1. This course covers the latest Spark v2 features.

Audience

Developers / Data Analysts

 

Objectives

· Spark Shell · Spark internals · Spark Data structures : RDDs, Dataframes, Datasets · Spark APIs · Spark SQL · Spark and Hadoop · Spark MLLib · Spark Graphx · Spark streaming · Tuning Spark applications

Pre-requisites

· Familiarity with either Java / Scala / Python language (our labs in Scala and Python – we provide a quick Scala introduction) · Basic understanding of Linux development environment (command line navigation / running commands)

Outline

1. Scala primer

· A quick introduction to Scala Labs : Getting know Scala

2. Spark Basics

· Big Data, Hadoop, Spark

· What’s new in Spark v2

· Spark concepts and architecture

· Spark eco system (core, spark sql, mlib, streaming)

Labs : Installing and running Spark

 

3. Spark Shell

· Spark shell

· Spark web UIs

· Analyzing dataset – part 1

· Labs: Spark shell exploration

4. RDDs (Condensed coverage)

· RDDs concepts

· Partitions

· RDD Operations / transformations

· More detailed coverage if required : RDD types, Key-Value pair RDDs, MapReduce on RDD

· Labs : Unstructured data analytics using RDDs

 

5. Spark Dataframes & Datasets

· Learning about Dataframe / Dataset

· Programming in Dataframe / Dataset API

· Loading structured data using Dataframes

· Caching and persistence

· Labs : Dataframes, Datasets, Caching

 

6. Spark API programming (Scala / Python)

· Introduction to Spark API

· Submitting the first program to Spark

· Debugging / logging

· Configuration properties

· Labs : Programming in Spark API, Submitting jobs

 

7. Spark SQL

· Spark SQL concepts and overview

· Defining tables and importing datasets

· Querying data using SQL

· Handling various storage formats : JSON / Parquet / ORC

· Labs : querying structured data using SQL; evaluating data formats

 

8. Spark and Hadoop

· Hadoop Primer : HDFS / YARN

· Hadoop + Spark architecture

· Running Spark on Hadoop YARN

· Processing HDFS files using Spark

· Spark & Hive

 

9. Machine Learning (ML / MLib)

· Machine Learning primer

· Machine Learning in Spark : MLib / ML

· Spark ML overview (newer Spark2 version)

· Algorithms : Clustering, Classifications, Recommendations

· Labs : Writing ML applications

 

10. GraphX

· GraphX library overview

· GraphX APIs

· Labs : Processing graph data using Spark

 

11. Spark Streaming

· Streaming overview

· Evaluating Streaming platforms

· Streaming operations

· Sliding window operations

· Structured Streaming

· Labs : Writing spark streaming applications

 

12. Spark Performance and Tuning

· Broadcast variables

· Accumulators

· Memory management & caching

Learn More
Please type the letters below so we know you are not a robot (upper or lower case):