Back to home

Documentation Assistant for Data Engineers

An Airflow-integrated documentation builder that uses LLMs to generate and update markdown docs for DAGs and tasks — so data engineers spend less time writing docs and more time shipping pipelines.

Open source · 2025Airflow · LLMs · Developer experienceView repository on GitHub
Situation
Most data teams want good documentation for DAGs and tasks, but reality is usually a scattered mix of wiki pages, half-filled doc_md fields and tribal knowledge.
Task
Build a documentation assistant that can live inside Airflow, read DAG definitions and metadata, and produce markdown docs that are good enough to ship with minimal edits.
Action
Implemented an Airflow plugin that walks DAGs and tasks, extracts key details (purpose, dependencies, owners, schedules, retries), and calls an LLM to generate structured markdown sections. Docs are written to files in a repo-friendly layout.
Result
Documentation time dropped by ~50%. New DAGs ship with docs by default, and onboarding engineers can browse markdown instead of reverse-engineering DAG code from scratch.

High-level architecture

The assistant is built as a thin layer around Airflow's DagBag: it inspects DAGs, sends structured prompts to an LLM, and writes markdown documentation that can live alongside your code.

1Step 1 – Airflow plugin scans DAGs

The plugin inspects DAGs, tasks, and metadata from the DagBag.

Airflow plugin & DAG inspection
  • Loads DAGs from Airflow's DagBag.
  • Extracts dag_id, owners, schedule, retries, and tags.
  • Walks tasks to capture operators, dependencies and any inline comments / docstrings.
LLM documentation engine
  • Sends a structured prompt to an LLM (Ollama or OpenAI), describing the DAG and its tasks.
  • Asks for markdown with sections like “Overview”, “Data flow”, “Schedule & SLAs”, and “Key tasks”.
  • Optionally includes sample runs or schema hints from upstream sources.
Markdown output & distribution
  • Writes docs to markdown files organised by dag_id.
  • Files can live in a docs repo, wiki sync, or alongside the DAG code.
  • Airflow plugin panel links directly to the generated docs.

How it works

Instead of asking engineers to open a blank page, the assistant starts from code and metadata and drafts docs they can quickly accept or tweak.

  1. Run inside Airflow. The plugin exposes an endpoint or UI button that says “Generate docs for this DAG”.
  2. Collect context. It gathers DAG properties, task graph, owners, schedules, retries, and any inline hints (docstrings, comments).
  3. Build a structured prompt. The assistant sends a compact JSON-like description of the DAG to the LLM with instructions to generate crisp, non-fluffy documentation.
  4. Write markdown. The response is parsed and written as markdown with stable headings so it can be diffed and version-controlled.
  5. Link it back into Airflow. The plugin shows a link to the markdown for each DAG, so the docs stay one click away from where engineers debug pipelines.

Example: generated DAG documentation (excerpt)

Instead of dumping raw markdown, this is how a generated document might look when rendered in a docs site or markdown viewer.

DAG: customer_metrics_daily

Daily computation of core customer metrics for analytics and CRM stakeholders.

Overview

This DAG computes daily customer metrics used by the analytics and CRM teams. It pulls raw events from the data lake, aggregates them into a fact table, and publishes a curated mart for downstream dashboards.

Data flow

  1. Extract raw events from s3://raw/events/customer/*.
  2. Load them into the staging.customer_events table.
  3. Aggregate metrics into mart.customer_daily_metrics.
  4. Trigger downstream dashboards via a lightweight notification task.

Schedule & SLAs

  • Schedule: 0 3 * * * (03:00 UTC every day)
  • Expected duration: ~20 minutes
  • SLA: Ready before 04:00 UTC for business stakeholders.

Ownership

  • Data owner: marketing-analytics@company.com
  • Tech owner: data-platform@company.com
  • Slack channel: #data-customer-metrics

Key tasks

  • extract_events — loads raw event files from S3 into staging.
  • transform_metrics — dbt / Spark transformation building the mart table.
  • quality_checks — runs freshness and row-count checks.
  • notify_downstream — posts to Slack if the DAG finishes successfully.

Reliability notes

Retries are configured for extraction and transformation tasks with exponential backoff. Data quality checks will fail the DAG if thresholds are not met, preventing stale or incomplete metrics from reaching downstream dashboards.

Tech stack

Apache Airflow (plugin)PythonLLMs (Ollama / OpenAI)Markdown-based docsGit / repo-native docs

The assistant turns documentation into a byproduct of pipeline development instead of an afterthought. Because it's built as an Airflow plugin, it stays close to the actual DAGs and evolves with them.

Explore the documentation assistant on GitHub