Airflow Smart Retries using LLMs
A lightweight, explainable retry controller for Airflow tasks that uses a local LLM (Ollama) to read task logs and decide whether to retry or fail fast — reducing wasted compute and noisy pipelines.
High-level architecture
Smart retries are implemented as a thin orchestration around the existing Airflow scheduler: task failures trigger a log-inspection-and-decision flow, powered by a local LLM.
On failure, the smart retry controller inspects the error, logs, and task context.
- Hooks into task failure callbacks or a wrapper around the operator.
- Collects last N lines of the task log and context (dag_id, task_id, try number).
- Normalises messages (stack traces, error types).
- Runs against a local Ollama endpoint — no external API.
- Prompt contains failure category examples and what “retryable” vs “fatal” looks like.
- Returns a structured JSON decision:
action+explanation.
- If decision is
retry, requeues the task with backoff. - If decision is
fail_fast, marks task as failed and surfaces the LLM explanation. - Optionally tags the DAG run or pushes a notification (Slack/email) with the summary.
How it works
The system is intentionally simple: it's a decision helper wrapped around Airflow's existing retry semantics, driven by structured prompts instead of brittle regex.
- Wrap the operator. Tasks use a thin wrapper (or on_failure_callback) that calls into the smart retry controller when a failure occurs.
- Gather evidence. The controller pulls recent log lines, error type, and key context (dag_id, task_id, execution_date, try number).
- Call the local LLM. A prompt is constructed with examples of transient vs permanent failures and the current error. Ollama returns a classification + recommended action.
- Apply the decision. The controller decides whether to increment retries, apply backoff, or short-circuit and mark the task as failed with the LLM's explanation embedded in logs/XCom.
- Observe & tune. Because outputs are structured JSON, teams can log decisions, build dashboards, and tune prompts or rules over time.
Demo: what a smart retry decision looks like
A typical decision combines raw error context with an LLM explanation and a clear action. This is representative of the JSON the controller uses under the hood.
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://s3.eu-west-1.amazonaws.com/..."
action: "fail_fast" explanation: "The error shows a SQL syntax problem in the DAG's SELECT statement. This will not be fixed by retrying."
Tech stack
Smart retries make your scheduler feel less like a dumb loop and more like a collaborator that understands why things are failing. The implementation is intentionally small so teams can inspect, extend, and trust the behaviour.
See code and usage in the repo