PySparkInspector – Spark Job & Stage Execution Analyzer
A Scala SparkListener that plugs into PySpark and AWS Glue jobs to generate clean, human-readable summaries of jobs, stages, and I/O — without modifying your ETL code. It turns the SparkScheduler events you rarely see into a concise execution report in your logs.
High-level architecture
PySparkInspector is a Scala JAR that hooks into Spark's scheduler layer. You enable it via spark.extraListeners, and it quietly listens to Spark events while your job runs. At the end, it prints a structured summary of jobs and stages to your logs, so you can reason about performance and data movement.
Your existing PySpark or AWS Glue job executes with no code changes — you just attach the PySparkInspector listener JAR.
- Your code stays the same — you only add the listener JAR via
spark.jarsor Glue's--extra-jars. - Works for local development, EMR, and AWS Glue (tested on Glue).
- Supports both small experiments and production ETL pipelines.
- Listens to
onJobStart,onStageCompleted,onTaskEndand aggregates metrics in memory. - Tracks records read/written, bytes, durations, and task counts.
- Stores optional task-level details in a JVM-side
MetricsStoreaccessible from PySpark.
- On
onApplicationEnd, prints job and stage summaries to stdout / logs. - A small Python helper can convert the summaries into DataFrames if you want to persist or visualize them.
- Designed to be copy-pastable into incident tickets and performance reviews.
Demo: what the summaries look like in your logs
In a real run, PySparkInspector writes plain-text summaries to stdout/CloudWatch. Below is a visual mock of that output, cycling between job summary, stage summary, and slowest stages to show what you'd typically see after a PySpark or Glue ETL job.
Tech stack
PySparkInspector is intentionally small: a single listener and a few helpers. But it dramatically improves how you debug and explain Spark jobs in environments without the UI. It's the kind of tool you can drop into any PySpark or Glue project to make execution behavior visible in plain text.
Explore PySparkInspector on GitHub