Big Data Analysis with Scala and Spark

4.7
1,466 ratings
326 reviews

Course 4 of 5 in the Functional Programming in Scala Specialization

Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance. Learning Outcomes. By the end of this course you will be able to: - read data from persistent storage and load it into Apache Spark, - manipulate data with Spark and Scala, - express algorithms for data analysis in a functional style, - recognize how to avoid shuffles and recomputation in Spark, Recommended background: You should have at least one year programming experience. Proficiency with Java or C# is ideal, but experience with other languages such as C/C++, Python, Javascript or Ruby is also sufficient. You should have some familiarity using the command line. This course is intended to be taken after Parallel Programming: https://www.coursera.org/learn/parprog1.
Globe

Curso 100 % en línea

Comienza de inmediato y aprende a tu propio ritmo.
Clock

Aprox. 14 horas para completar

Sugerido: 5 hours/week
Comment Dots

English

Subtítulos: English

Habilidades que obtendrás

Big DataSqlFunctional ProgrammingMathematical Optimization
Globe

Curso 100 % en línea

Comienza de inmediato y aprende a tu propio ritmo.
Clock

Aprox. 14 horas para completar

Sugerido: 5 hours/week
Comment Dots

English

Subtítulos: English

Syllabus - What you will learn from this course

1

Section
Clock
12 hours to complete

Getting Started + Spark Basics

Get up and running with Scala on your computer. Complete an example assignment to familiarize yourself with our unique way of submitting assignments. In this week, we'll bridge the gap between data parallelism in the shared memory scenario (learned in the Parallel Programming course, prerequisite) and the distributed scenario. We'll look at important concerns that arise in distributed systems, like latency and failure. We'll go on to cover the basics of Spark, a functionally-oriented framework for big data processing in Scala. We'll end the first week by exercising what we learned about Spark by immediately getting our hands dirty analyzing a real-world data set....
Reading
7 videos (Total 105 min), 5 readings, 3 quizzes
Video7 videos
Data-Parallel to Distributed Data-Parallel10m
Latency24m
RDDs, Spark's Distributed Collection9m
RDDs: Transformation and Actions16m
Evaluation in Spark: Unlike Scala Collections!20m
Cluster Topology Matters!8m
Reading5 readings
Tools setup10m
Eclipse tutorial10m
Intellij IDEA Tutorial10m
Sbt tutorial10m
Submitting solutions10m

2

Section
Clock
7 hours to complete

Reduction Operations & Distributed Key-Value Pairs

This week, we'll look at a special kind of RDD called pair RDDs. With this specialized kind of RDD in hand, we'll cover essential operations on large data sets, such as reductions and joins....
Reading
4 videos (Total 59 min), 2 quizzes
Video4 videos
Pair RDDs6m
Transformations and Actions on Pair RDDs20m
Joins17m

3

Section
Clock
1 hour to complete

Partitioning and Shuffling

This week we'll look at some of the performance implications of using operations like joins. Is it possible to get the same result without having to pay for the overhead of moving data over the network? We'll answer this question by delving into how we can partition our data to achieve better data locality, in turn optimizing some of our Spark jobs....
Reading
4 videos (Total 57 min)
Video4 videos
Partitioning14m
Optimizing with Partitioners11m
Wide vs Narrow Dependencies16m

4

Section
Clock
8 hours to complete

Structured data: SQL, Dataframes, and Datasets

With our newfound understanding of the cost of data movement in a Spark job, and some experience optimizing jobs for data locality last week, this week we'll focus on how we can more easily achieve similar optimizations. Can structured data help us? We'll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We'll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL....
Reading
5 videos (Total 133 min), 2 quizzes
Video5 videos
Spark SQL17m
DataFrames (1)26m
DataFrames (2)30m
Datasets43m
4.7
Direction Signs

10%

started a new career after completing these courses
Briefcase

83%

got a tangible career benefit from this course
Money

12%

got a pay increase or promotion

Top Reviews

By CCJun 8th 2017

The sessions where clearly explained and focused. Some of the exercises contained slightly confusing hints and information, but I'm sure those mistakes will be ironed out in future iterations. Thanks!

By CRApr 10th 2017

Great introduction to spark. Fun assignments. Since it was the first ever session, there were quite a few kinks with the assignments. But the discussion forums rescued me any time I was stuck.

Instructor

Avatar

Dr. Heather Miller

Research Scientist

About École Polytechnique Fédérale de Lausanne

Frequently Asked Questions

  • Once you enroll for a Certificate, you’ll have access to all videos, quizzes, and programming assignments (if applicable). Peer review assignments can only be submitted and reviewed once your session has begun. If you choose to explore the course without purchasing, you may not be able to access certain assignments.

  • If you pay for this course, you will have access to all of the features and content you need to earn a Course Certificate. If you complete the course successfully, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. Note that the Course Certificate does not represent official academic credit from the partner institution offering the course.

  • Yes! Coursera provides financial aid to learners who would like to complete a course but cannot afford the course fee. To apply for aid, select "Learn more and apply" in the Financial Aid section below the "Enroll" button. You'll be prompted to complete a simple application; no other paperwork is required.

More questions? Visit the Learner Help Center