Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop

Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop

3,195.00
Date:
Delivery Option:
Time:
Quantity:
Purchase Now

Course ID: D8018

Duration: 4 Days

Location: Flex - San Francisco or Live Online | Click here to schedule private course.

Overview 

This course is designed to teach students the essentials of Apache Pig, Apache Hive, and Apache Impala. Specifically, the goal of the course is to teach students how to apply traditional data analytics and business intelligence to Big Data, including the use of specialist tools.

Topics will be tackled using a combination of instructor-led training and practical exercises, and specific learning outcomes include:

  • Using data acquisition, storage, and analysis features of Pig/Hive/Impala

  • The fundamentals of Apache Hadoop and data ETL (extract, transform, load), ingestion

  • How Pig, Hive, and Impala improve productivity for typical analysis tasks

  • How to perform real-time, complex queries on data sets

+ Who Should Attend

While knowledge of Apache Hadoop is not required, students should have basic familiarity with Linux, as well as knowledge of SQL. It would also be beneficial to have knowledge of at least one scripting language, such as Python, Ruby, or Perl. Typical students include:

  • Data analysts
  • Business intelligence specialists
  • Developers
  • System architects
  • Database administrators

+ Course Outline

Introduction

Apache Hadoop Fundamentals

  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Pig, Hive, and Impala
  • Database Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenarios

Introduction to Apache Pig

  • What is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig

Basic Data Analysis with Apache Pig

  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly Used Functions

Processing Complex Data with Apache Pig

  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data

Multi-Dataset Operations with Apache Pig

  • Techniques for Combining Datasets
  • Joining Datasets in Pig
  • Set Operations
  • Splitting Datasets

Apache Pig Troubleshooting and Optimization

  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Pig Jobs

Introduction to Apache Hive and Impala

  • What is Hive?
  • What is Impala?
  • Why Use Hive and Impala?
  • Schema and Data Storage
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases

Querying with Apache Hive and Impala

  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Using Hue to Execute Queries
  • Using Beeline (Hive’s Shell)
  • Using the Impala Shell

Apache Hive and Impala Data Management

  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results

Data Storage and Performance

  • Partitioning Tables
  • Loading Data into Partitioned Tables
  • When to Use Partitioning
  • Choosing a File Format
  • Using Avro and Parquet File Formats

Relational Data Analysis with Apache Hive and Impala

  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing

Complex Data with Apache Hive and Impala

  • Complex Data with Hive
  • Complex Data with Impala

Analyzing Text with Apache Hive and Impala

  • Using Regular Expressions with Hive and Impala
  • Processing Text Data with SerDes in Hive
  • Sentiment Analysis and n-grams in Hive

Apache Hive Optimization

  • Understanding Query Performance
  • Bucketing
  • Indexing Data
  • Hive on Spark

Apache Impala Optimization

  • How Impala Executes Queries
  • Improving Impala Performance

Extending Apache Hive and Impala

  • Custom SerDes and File Formats in Hive
  • Data Transformation with
  • Custom Scripts in Hive
  • User-Defined Functions
  • Parameterized Queries

Choosing the Best Tool for the Job

  • Comparing Pig, Hive, Impala, and Relational Databases
  • Which to Choose?

Conclusion

 

+ Prerequisites

Knowledge of SQL is assumed, as is basic Linux command-line familiarity. Knowledge of at least one scripting language (e.g., Bash scripting, Perl, Python, Ruby) would be helpful but is not essential. Prior knowledge of Apache Hadoop is not required.

+ Certifications

N/A