Developer Training for Spark & Hadoop

Developer Training for Spark & Hadoop

Delivery Option:
Delivery Time:
Purchase Now

Course ID: D8017

Duration: 4 Days

Location: Flex - San Francisco or Live Online | Click here to schedule private course.


This course has been designed to teach experienced developers how to build high-performance parallel applications with Apache Spark. On completing this course, students should be capable of solving real-world problems and building applications at work.

Specifically, students will learn to:

  • Distribute, store, and process data in a Hadoop cluster

  • Write, configure, and deploy Spark applications on a cluster

  • Use the Spark shell for interactive data analysis

  • Process and query structured data using Spark SQL

  • Use Spark Streaming to process a live data stream

+ Who Should Attend

While students are not expected to have any existing knowledge of Hadloop or Spark, there is a basic requirement for proficiency in programming. Students should be confident in programming in Scala or Python, as well as having a basic familiarity with Linux and SQL.

This course is designed for developers and engineers who have programming experience, but prior knowledge of Hadoop and/or Spark is not required.

+ Course Outline

Introduction to Apache Hadoop and the Hadoop Ecosystem 

  • Introduction to Apache Hadoop and the Hadoop Ecosystem
  • Apache Hadoop Overview
  • Data Ingestion and Storage
  • Data Processing 
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Hands-On Exercises

Apache Hadoop File Storage

  • Apache Hadoop Cluster Components
  • HDFS Architecture
  • Using HDFS 

Distributed Processing on an Apache Hadoop Cluster

  • YARN Architecture
  • Working With YARN

Apache Spark Basics

  • What is Apache Spark?
  • Starting the Spark Shell
  • Using the Spark Shell 
  • Getting Started with Datasets and DataFrames
  • DataFrame Operations

Working with DataFrames and Schemas

  • Creating DataFrames from Data Sources
  • Saving DataFrames to Data Sources
  • DataFrame Schemas 
  • Eager and Lazy Execution

Analyzing Data with DataFrame Queries

  • Querying DataFrames Using Column Expressions
  • Grouping and Aggregation Queries
  • Joining DataFrames 

RDD Overview

  • RDD Overview 
  • RDD Data Sources
  • Creating and Saving RDDs 
  • RDD Operations

Transforming Data with RDDs

  • Writing and Passing Transformation Functions 
  • Transformation Execution
  • Converting Between RDDs and DataFrames 

Aggregating Data with Pair RDDs

  • Key-Value Pair RDDs 
  • Map-Reduce
  • Other Pair RDD Operations 

Querying Tables and Views with Apache Spark SQL

  • Querying Tables in Spark Using SQL 
  • Querying Files and Views
  • The Catalog API 
  • Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark

Working with Datasets in Scala

  • Datasets and DataFrames 
  • Creating Datasets
  • Loading and Saving Datasets 
  • Dataset Operations

Writing, Configuring, and Running Apache Spark Applications

  • Writing a Spark Application 
  • Building and Running an Application
  • Application Deployment Mode 
  • The Spark Application Web UI
  • Configuring Application Properties 

Distributed Processing

  • Review: Apache Spark on a Cluster 
  • RDD Partitions
  • Example: Partitioning in Queries 
  • Stages and Tasks
  • Job Execution Planning 
  • Example: Catalyst Execution Plan
  • Example: RDD Execution Plan 

Distributed Data Persistence

  • DataFrame and Dataset Persistence 
  • Persistence Storage Levels
  • Viewing Persisted RDDs 

Common Patterns in Apache Spark Data Processing

  • Common Apache Spark Use Cases 
  • Iterative Algorithms in Apache Spark
  • Machine Learning 
  • Example: k-means

Apache Spark Streaming: Introduction to DStreams

  • Apache Spark Streaming Overview 
  • Example: Streaming Request Count
  • DStreams 
  • Developing Streaming Applications

Apache Spark Streaming: Processing Multiple Batches

  • Multi-Batch Operations 
  • Time Slicing
  • State Operations 
  • Sliding Window Operations
  • Preview: Structured Streaming 

Apache Spark Streaming: Data Sources

  • Streaming Data Source Overview 
  • Apache Flume and Apache Kafka Data Sources
  • Example: Using a Kafka Direct Data Source 


+ Prerequisites

This course is designed for developers and engineers who have programming experience, but prior knowledge of Hadoop and/or Spark is not required.

+ Certifications