Back to home

PySparkInspector – Spark Job & Stage Execution Analyzer

A Scala SparkListener that plugs into PySpark and AWS Glue jobs to generate clean, human-readable summaries of jobs, stages, and I/O — without modifying your ETL code. It turns the SparkScheduler events you rarely see into a concise execution report in your logs.

Developer tooling · 2025PySpark · Scala · SparkListener · AWS GlueView project on GitHub
Situation
Debugging Spark jobs usually means clicking through the Spark UI or guessing from scattered logs. On AWS Glue or in CI, you often don't even have easy UI access — just a long CloudWatch log stream.
Task
Build a lightweight observer that attaches to any PySpark or Glue job, listens to scheduler events, and prints a compact "execution report" at the end: which jobs ran, how long they took, and what they actually read/wrote.
Action
Implemented a Scala SparkListener that aggregates job, stage, and task metrics (records, bytes, durations), classifies operations (read/write/action), and prints structured summaries into stdout. A small Python helper lets you pull raw task metrics into a DataFrame if you want deeper analysis.
Result
You get an instant textual picture of your Spark pipeline at the end of every run — even in Glue or headless environments. It's much easier to see "what actually happened" without instrumenting every step or relying on the Spark UI being available.

High-level architecture

PySparkInspector is a Scala JAR that hooks into Spark's scheduler layer. You enable it via spark.extraListeners, and it quietly listens to Spark events while your job runs. At the end, it prints a structured summary of jobs and stages to your logs, so you can reason about performance and data movement.

1Step 1 – PySpark job runs as usual

Your existing PySpark or AWS Glue job executes with no code changes — you just attach the PySparkInspector listener JAR.

Existing PySpark / Glue job
  • Your code stays the same — you only add the listener JAR via spark.jars or Glue's --extra-jars.
  • Works for local development, EMR, and AWS Glue (tested on Glue).
  • Supports both small experiments and production ETL pipelines.
SparkListener & metrics store
  • Listens to onJobStart, onStageCompleted, onTaskEnd and aggregates metrics in memory.
  • Tracks records read/written, bytes, durations, and task counts.
  • Stores optional task-level details in a JVM-side MetricsStore accessible from PySpark.
Summaries in logs (and optional DataFrames)
  • On onApplicationEnd, prints job and stage summaries to stdout / logs.
  • A small Python helper can convert the summaries into DataFrames if you want to persist or visualize them.
  • Designed to be copy-pastable into incident tickets and performance reviews.

Demo: what the summaries look like in your logs

In a real run, PySparkInspector writes plain-text summaries to stdout/CloudWatch. Below is a visual mock of that output, cycling between job summary, stage summary, and slowest stages to show what you'd typically see after a PySpark or Glue ETL job.

(Auto-rotating every few seconds)
Spark execution summary
High-level view of each Spark job: operation, duration, records, and bytes.
====== Spark Job Summary ====== Job JobName Operation Duration(s) Records Read Records Written Bytes Read Bytes Written Comment -------------------------------------------------------------------------------------------------------------------------------- 0 customer-etl Write Parquet 2.134 - 4 - 18.2 KB - 1 customer-etl Read Parquet 0.842 - - 18.2 KB - - 2 customer-etl DataFrame Show 0.512 4 - - - - 3 customer-etl DataFrame Count 0.337 4 - - - -
This is the kind of snippet you'd copy into a Slack thread when someone asks, "what did this Spark job actually do?"Derived from SparkListener events

Tech stack

PySparkScalaSparkListener APIAWS Glue (tested)Data Engineering

PySparkInspector is intentionally small: a single listener and a few helpers. But it dramatically improves how you debug and explain Spark jobs in environments without the UI. It's the kind of tool you can drop into any PySpark or Glue project to make execution behavior visible in plain text.

Explore PySparkInspector on GitHub