“Big data” analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You’ll learn those same techniques, using your own Windows system right at home. It’s easier than you might think.
This course uses the familiar Python programming language.
Upon completing this course you will know:
- Learn the concepts of Spark’s Resilient Distributed Datastores
- Develop and run Spark jobs quickly using Python
- Translate complex analysis problems into iterative or multi-stage Spark scripts
- Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX
Students having prior knowledge of basic python and interested to choose their career as Big Data Scientist and Apache Spark developers.
Note: The following unit and exercise durations are estimates, and might not reflect every class experience. If the course is customized or abbreviated, the duration of unchanged units will probably increase.
Unit 1. Getting started with Python
- Why Python?
- What is Python?
- Who are using Python?
- Where we are using Python?
- Setting up Environment.
Unit 2. Core Python
- Basics of Python
- Basic Data Types and Objects
- Conditioning in python
- Looping and breaks
- Class definition on python
Unit 3. Introducing Python Modules
- Working with Numpy.
- Fast analysis and data handling with Pandas.
- Exercise 1 : Average gold,silver and bronze medal problem
- Working with pandas data structures.
- Working with pandas visualization.
- Reading and Writing files with pandas
Matplotlib and Seaborn visualization
- Working with matplotlib : creating figures and adding multiple axes.
- Working with seaborn :add-on regression, distribution and matrix plots.
- Activity : 3D – Plotting
- Titanic Survival data preprocessing
Unit 4: Introduction to Apache Spark
- Why Apache Spark?
- Spark Features.
- Spark Ecosystem.
- Environment setup.
Unit 5: Spark Basics and Simple Examples
- The Resilient Distributed Dataset (RDD).
- Pros and cons. Of RDDs.
- Working with spark Data Frames.
Unit 6: SparkSQL
- Introduction to SparkSQL.
- Executing SQL commands and SQL-style functions on a DataFrame.
- Using Spark DataFrames instead of RDDs.
Unit 7: Spark MLlib
- Introducing MLlib.
- Using machine learning techniques in spark.
- Making movie reommendations with movie lens Dataset.
Unit 8: Spark streaming
- Introduction to Spark streaming.
- Streaming Twitter data with Spark streaming.
- Twitter top hashtags using Spark in real-time.
- Ending notes : GraphX
- Movie Recommendation using Movielens Dataset
- Twitter Top hashtags using spark streaming in realtime
All the assignments and discussion links will be provided after the lecture of current topic.Apache Spark with Python