Course Overview
Big Data course is designed to give newbie’s a quick start in Big Data world and insightful knowledge to candidates interested in moving beyond Hadoop. The course content is designed for candidates interested in certification or wants to make their profile in Big Data domain. It will provide you knowledge of various Hadoop ecosystems Hive, Pig, Flume, Sqoop, Zookeeper. It covers both real as well as batch processing frameworks like MapReduce and Spark. It also covers an overview of trending big-data platforms like Hortonwork’s and Cloudera’s.

Course Objectives

At ProTechSkills we cover a major section of Hadoop and its ecosystem components. At the end of the course you would learn:

  1. Virtualization and Cluster creation
  2. Hadoop HDFS, and MapReduce
  3. Next Generation Resource Manager : YARN
  4. Writing SQL queries using Hive
  5. Writing PIG scripts to query data.
  6. Executing Sqoop to port data from RDBMS to HDFS
  7. Using Flume to inject real time data
  8. Configuring Zookeeper to achieve HDFS High Availability
  9. Working with Hortonworks and Cloudera
  10. Real-time data processing using Spark

Who should go for this course?

Based on the current profile you can decide to opt any of the following profiles in Big Data Hadoop:

Best Suited For
Big-data Administrator
Unix/Linux Administrator, System Administrator
Big-data Developer
Java/ Python/.Net developers
Data Analyst
Data/Business Analyst, Data Warehousing

Course Duration

32-34 hours (2 Months Program – Weekend Only)

Course Structure

Introduction to Hadoop and its Architecture
Limitations of traditional large scale systems
Compare Hadoop with traditional systems
Understanding Hadoop Architecture
Hadoop Daemons – NameNode, DataNode, JobTracker, TaskTracker
Setting up Hadoop Single Node and Multi-Node Cluster using Oracle Virtual Box
Linux VM installation on Windows / Mac / Linux for Hadoop cluster using Oracle Virtual Box
Preparing nodes for Hadoop and VM settings (Java, Passwordless SSH, network settings etc.)
Basic Linux commands
Hadoop Deployment – Single Node
Hadoop configuration files and running Hadoop services
Important Web URL’s and Logs for Hadoop
Run HDFS and Linux commands
Hadoop Deployment – Clustered Mode
Understanding Hadoop Distributed File System
Design Goals
Blocks, FS Image and Edit Logs
Rack-Awareness in Hadoop
Replica Placement and Selection Policies
Hadoop File System Shell Commands
Safe Mode in HDFS
Hadoop DFSAdmin Commands
File Read / Write Anatomy in HDFS
Hadoop NameNode and DataNodes Directory Structure
Name and Space Quota in HDFS
HDFS Trash Concept
Understanding Hadoop DFS 2.x Concepts
HDFS High Availability
Configuring HDFS HA with two NameNodes
Automatic and Manual Fail-over techniques in HA
MapReduce Programming Framework – PART 1
MapReduce Architecture
Understand the concept of Mappers, Reducers
Anatomy of MapReduce Program and its phases
MapReduce Components – Mapper Class, Reducer Class
Splits, Blocks and Record Readers
Understand the concept and need of Combiner and Partitioner
Running and Monitoring MapReduce Jobs
MapReduce Programming Framework – PART 2
MapReduce Internals
Understanding Input and Output Formats in Hadoop
MapReduce API
Hadoop Data Types
Writing your own MapReduce job
YARN Concepts – MRv2
Hadoop 1.x Limitations
Design Goals for YARN
YARN Architecture
Components – Resource Manager / Node Manager / Application Master
Classic vs. YARN
Application Execution Flow
Life-Cycle Management
Schedulers and Queues
Running and Monitoring YARN applications
Job History Server and Web Application Proxy
Apache Hive
What is Hive?
Hive Architecture & Components
Hive Installation
Hive Metastore
Hive Data Model and Data Units
Hive DDL – Create/Show/Drop Database
Hive DDL – Create/Show/Drop Tables
Hive DML – Load Files into Tables
Hive DML – Inserting Data into Tables
Hive SQL – Select, Filter, Join, Group By
Multi-Table Inserts and Joins
Introduction to SerDe, UDF and UDAF
Apache Pig
PIG Installation
PIG Data types
PIG Architecture
PIG Latin
PIG Relational Operators
PIG Functions
Apache Zookeeper
What is Zookeeper
Installation – Standalone/Clustered mode
Zookeeper Command Line, ZNode and Watches
HDFS HA automatic failover using Zookeeper
Apache Sqoop
Sqoop Architecture and Installation
Import/Export Data using Sqoop
Apache Flume
Flume Architecture and Installation
Flume Use Cases