Apache Spark Performance Tuning with Scala

Apache Spark Performance Tuning with Scala

Learn how to optimize Apache Spark with Scala for peak performance with our comprehensive course. Master Spark internals and configurations to enhance speed and memory efficiency for your cluster.

Goal

They say Spark is fast. How do I make the best out of it?

I wrote a lot of Spark jobs over the past few years. Some of my old data pipelines are probably still running as you’re reading this. However, my journey with Spark had massive pain. You’ve probably seen this too.

  • You run 3 big jobs with the same DataFrame, so you try to cache it - but then you look in the UI and it’s nowhere to be found.
  • You’re finally given the cluster you’ve been asking for… and then you’re like “OK, now how many executors do I pick?”.
  • You have a simple job with 1GB of data that takes 5 minutes for 1149 tasks… and 3 hours on the last task.
  • You have a big dataset and you know you’re supposed to partition it right, but you can’t pick a number between 2 and 50000 because you can find good reasons for both!
  • You search for “caching”, “serialization”, “partitioning”, “tuning” and you only find obscure blog posts and narrow StackOverflow questions. Unless you have some massive experience or you’re a Spark committer, you’re probably using 10% of Spark capabilities.

In the Spark Optimization course you learned how to write performant code. It’s time to kick the high gear and tune Spark for the best it can be. You are looking at the only course on the web which leverages Spark features and capabilities for the best performance. With the techniques you learn here you will save time, money, energy and massive headaches.

Let’s rock.

In this course, we cut the weeds at the root. We dive deep into Spark and understand what tools you have at your disposal - and you might just be surprised at how much leverage you have. You will learn 20+ techniques for boosting Spark performance. Each of them individually can give at least a 2x perf boost for your jobs (some of them even 10x), and I show it on camera.

Skills You'll Learn

What’s in for you:

  • You’ll understand Spark internals to explain how Spark is already pretty darn fast
  • You’ll be able to predict in advance if a job will take a long time
  • You’ll diagnose hanging jobs, stages and tasks
  • You’ll spot and fix data skews
  • You’ll make the right performance tradeoffs between speed, memory usage and fault-tolerance
  • You’ll be able to configure your cluster with the optimal resources
  • You’ll save hours of computation time in this course alone (let alone in prod!)
  • You’ll control the parallelism of your jobs with the right partitioning

And some extra perks:

  • You’ll have access to the entire code I write on camera (~1400 LOC)
  • You’ll be invited to our private Slack room where I’ll share latest updates, discounts, talks,
  • (Soon) You’ll have access to the takeaway slides
  • (Soon) You’ll be able to download the videos for your offline view

Skills you’ll get:

  • Deep understanding of Spark internals so you can predict job performance
    • stage & task decomposition
    • reading query plans before jobs will run
    • reading DAGs while jobs are running
    • performance differences between the different Spark APIs
    • packaging and deploying a Spark app
    • configuring Spark in 3 different ways
    • understanding the state of the art in Spark internals
    • leveraging Catalyst and Tungsten for massive perf
  • Understanding Spark Memory, Caching and Checkpointing
    • Tuning Spark executor memory zones
    • caching for speedy data reuse
    • making the right tradeoffs between speed, memory usage and fault tolerance
    • using checkpoints when jobs are failing or you can’t afford a recomputation
  • Partitioning
    • leveraging repartitions
    • using coalesce to avoid shuffles
    • picking the right number of partitions at a shuffle to match cluster capability
    • using custom partitioners for custom jobs
  • Cluster tuning, fixing problems
    • allocating the right resources in a cluster
    • fixing data skews and straggling tasks with salting
    • fixing serialization problems
    • using the right serializers for free perf improvements

This course is for Scala and Spark programmers who need to improve the run time and memory footprint of their jobs. If you’ve never done Scala or Spark, this course is not for you. I’ll generally recommend that you take the Spark Optimization course first, but it’s not a requirement.

Meet Rock the JVM

Daniel Ciocîrlan

I'm a software engineer and the founder of Rock the JVM.

I'm a software engineer and the founder of Rock the JVM. I started the Rock the JVM project out of love for Scala and the technologies it powers - they are all amazing tools and I want to share as much of my experience with them as I can.

As of February 2024, I've taught Java, Scala, Kotlin and related tech (e.g. Cats, ZIO, Spark) to 100000+ students at various levels and I've held live training sessions for some of the best companies in the industry, including Adobe and Apple. I've also taught university students who now work at Google and Facebook (among others), I've held Hour of Code for 7-year-olds and I've taught more than 35000 kids to code.

I have a Master's Degree in Computer Science and I wrote my Bachelor and Master theses on Quantum Computation. Before starting to learn programming, I won medals at international Physics competitions.

What's Included

Loading...

Take this course now!

Apache Spark Performance Tuning with Scala - Lifetime License

Loading...

Just the course with a one-time payment

  • 8 hours of 4K content
  • 1400 lines of code written
  • All PDF slides
  • Access to the private Rock the JVM community
  • Free updates
  • Lifetime access
Get Now

All-Access Membership

Loading...
/monthly

All of the Rock the JVM courses

  • 320 hours of 4K content
  • 60660 lines of code written
  • All Scala courses
  • All Kotlin courses
  • All ZIO courses
  • All Typelevel courses
  • All Apache Flink courses
  • All Apache Spark courses
  • All Akka/Pekko courses
Join Now

The Apache Spark Bundle with Scala

Become a Apache Spark and big data expert from scratch with our all-inclusive course bundle: master everything you need using Scala in one complete package, at a discount

100% Money Back Guarantee

If you're not happy with this course, I want you to have your money back. If that happens, contact me with a copy of your welcome email and I will refund you the course.

Less than 0.05% of students refunded a course on the entire site, and every payment was returned in less than 72 hours.

FAQ