Overview

“Big data” analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You’ll learn those same techniques, using your own Windows system right at home. It’s easier than you might think.

Extremely Hands-On…

Incredibly Practical…

Unbelievably Real!

This course uses the familiar Python programming language.

Upon completing this course you will know:

Learn the concepts of Spark’s Resilient Distributed Datastores
Develop and run Spark jobs quickly using Python
Translate complex analysis problems into iterative or multi-stage Spark scripts
Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX

Targeted Audience

Students having prior knowledge of basic python and interested to choose their career as Big Data Scientist and Apache Spark developers.

Note: The following unit and exercise durations are estimates, and might not reflect every class experience. If the course is customized or abbreviated, the duration of unchanged units will probably increase.

Course Agenda

Unit 1. Getting started with Python

Why Python?
What is Python?
Who are using Python?
Where we are using Python?
Setting up Environment.

Unit 2. Core Python

Basics of Python
Basic Data Types and Objects
Conditioning in python
Looping and breaks
Class definition on python

Unit 3. Introducing Python Modules

Numpy

Working with Numpy.
Fast analysis and data handling with Pandas.
Exercise 1 : Average gold,silver and bronze medal problem

Pandas

Working with pandas data structures.
Working with pandas visualization.
Reading and Writing files with pandas

Matplotlib and Seaborn visualization

Working with matplotlib : creating figures and adding multiple axes.
Working with seaborn :add-on regression, distribution and matrix plots.
Activity : 3D – Plotting

Milestone Project

Titanic Survival data preprocessing

Unit 4: Introduction to Apache Spark

Why Apache Spark?
Spark Features.
Spark Ecosystem.
Environment setup.

Unit 5: Spark Basics and Simple Examples

The Resilient Distributed Dataset (RDD).
Pros and cons. Of RDDs.
Working with spark Data Frames.

Unit 6: SparkSQL

Introduction to SparkSQL.
Executing SQL commands and SQL-style functions on a DataFrame.
Using Spark DataFrames instead of RDDs.

Unit 7: Spark MLlib

Introducing MLlib.
Using machine learning techniques in spark.
Making movie reommendations with movie lens Dataset.

Unit 8: Spark streaming

Introduction to Spark streaming.
Streaming Twitter data with Spark streaming.
Twitter top hashtags using Spark in real-time.
Ending notes : GraphX

Projects inclosed

Movie Recommendation using Movielens Dataset
Twitter Top hashtags using spark streaming in realtime

Disclaimer

All the assignments and discussion links will be provided after the lecture of current topic.

Apache Spark with Python

PROTECHSKILLS