Airflow Smart Retries using LLMs

A lightweight, explainable retry controller for Airflow tasks that uses a local LLM (Ollama) to read task logs and decide whether to retry or fail fast — reducing wasted compute and noisy pipelines.

Open source · 2025Airflow · Reliability · Cost optimisationView repository on GitHub

Situation

In many Airflow setups, tasks blindly retry on failure. Transient issues might resolve themselves, but permanent errors (bad config, missing tables, code bugs) simply burn through all retries, delay SLAs, and waste compute.

Task

Build a small, explainable controller that can read task failures, understand why they happened, and decide when to retry vs fail fast — without shipping logs to external cloud LLMs.

Action

Implemented a smart retry layer that hooks into Airflow, streams recent task logs to a local Ollama LLM, classifies the failure, and returns a decision: retry, fail fast, or mark as “needs manual attention”. The controller is deployed as Python utilities and a thin integration in DAG code.

Result

Airflow pipelines stop wasting retries on obvious permanent errors and focus on real transient issues. This reduces compute spend, shortens incident time-to-detection, and makes failure behaviour much more predictable for downstream teams.

High-level architecture

Smart retries are implemented as a thin orchestration around the existing Airflow scheduler: task failures trigger a log-inspection-and-decision flow, powered by a local LLM.

1Step 1 – Task fails in Airflow

On failure, the smart retry controller inspects the error, logs, and task context.

Airflow task & log collection

Hooks into task failure callbacks or a wrapper around the operator.
Collects last N lines of the task log and context (dag_id, task_id, try number).
Normalises messages (stack traces, error types).

→

Local LLM decision engine (Ollama)

Runs against a local Ollama endpoint — no external API.
Prompt contains failure category examples and what “retryable” vs “fatal” looks like.
Returns a structured JSON decision: action + explanation.

→

Retry controller in Airflow

If decision is retry, requeues the task with backoff.
If decision is fail_fast, marks task as failed and surfaces the LLM explanation.
Optionally tags the DAG run or pushes a notification (Slack/email) with the summary.

How it works

The system is intentionally simple: it's a decision helper wrapped around Airflow's existing retry semantics, driven by structured prompts instead of brittle regex.

Wrap the operator. Tasks use a thin wrapper (or on_failure_callback) that calls into the smart retry controller when a failure occurs.
Gather evidence. The controller pulls recent log lines, error type, and key context (dag_id, task_id, execution_date, try number).
Call the local LLM. A prompt is constructed with examples of transient vs permanent failures and the current error. Ollama returns a classification + recommended action.
Apply the decision. The controller decides whether to increment retries, apply backoff, or short-circuit and mark the task as failed with the LLM's explanation embedded in logs/XCom.
Observe & tune. Because outputs are structured JSON, teams can log decisions, build dashboards, and tune prompts or rules over time.

Demo: what a smart retry decision looks like

A typical decision combines raw error context with an LLM explanation and a clear action. This is representative of the JSON the controller uses under the hood.

Task failure snapshot

dag_id: customer_metrics_daily

task_id: fetch_source_data

try_number: 1

Log excerpt

botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://s3.eu-west-1.amazonaws.com/..."

LLM decision (local Ollama)

action: retrytransient-network-error

explanation: The error indicates a temporary connectivity issue with S3 rather than invalid credentials or missing buckets. A retry with backoff is likely to succeed.

recommended_backoff_seconds: 120

Example of a fail_fast decision:

action: "fail_fast"
explanation: "The error shows a SQL syntax problem in the DAG's SELECT statement. This will not be fixed by retrying."

Tech stack

Apache AirflowPythonLocal LLM via OllamaTask failure callbacksStructured JSON decisionsOptional: dbt & Spark tasks

Smart retries make your scheduler feel less like a dumb loop and more like a collaborator that understands why things are failing. The implementation is intentionally small so teams can inspect, extend, and trust the behaviour.

See code and usage in the repo