Documentation Assistant for Data Engineers
An Airflow-integrated documentation builder that uses LLMs to generate and update markdown docs for DAGs and tasks — so data engineers spend less time writing docs and more time shipping pipelines.
doc_md fields and tribal knowledge.High-level architecture
The assistant is built as a thin layer around Airflow's DagBag: it inspects DAGs, sends structured prompts to an LLM, and writes markdown documentation that can live alongside your code.
The plugin inspects DAGs, tasks, and metadata from the DagBag.
- Loads DAGs from Airflow's DagBag.
- Extracts
dag_id, owners, schedule, retries, and tags. - Walks tasks to capture operators, dependencies and any inline comments / docstrings.
- Sends a structured prompt to an LLM (Ollama or OpenAI), describing the DAG and its tasks.
- Asks for markdown with sections like “Overview”, “Data flow”, “Schedule & SLAs”, and “Key tasks”.
- Optionally includes sample runs or schema hints from upstream sources.
- Writes docs to markdown files organised by
dag_id. - Files can live in a docs repo, wiki sync, or alongside the DAG code.
- Airflow plugin panel links directly to the generated docs.
How it works
Instead of asking engineers to open a blank page, the assistant starts from code and metadata and drafts docs they can quickly accept or tweak.
- Run inside Airflow. The plugin exposes an endpoint or UI button that says “Generate docs for this DAG”.
- Collect context. It gathers DAG properties, task graph, owners, schedules, retries, and any inline hints (docstrings, comments).
- Build a structured prompt. The assistant sends a compact JSON-like description of the DAG to the LLM with instructions to generate crisp, non-fluffy documentation.
- Write markdown. The response is parsed and written as markdown with stable headings so it can be diffed and version-controlled.
- Link it back into Airflow. The plugin shows a link to the markdown for each DAG, so the docs stay one click away from where engineers debug pipelines.
Example: generated DAG documentation (excerpt)
Instead of dumping raw markdown, this is how a generated document might look when rendered in a docs site or markdown viewer.
DAG: customer_metrics_daily
Daily computation of core customer metrics for analytics and CRM stakeholders.
Overview
This DAG computes daily customer metrics used by the analytics and CRM teams. It pulls raw events from the data lake, aggregates them into a fact table, and publishes a curated mart for downstream dashboards.
Data flow
- Extract raw events from
s3://raw/events/customer/*. - Load them into the
staging.customer_eventstable. - Aggregate metrics into
mart.customer_daily_metrics. - Trigger downstream dashboards via a lightweight notification task.
Schedule & SLAs
- Schedule: 0 3 * * * (03:00 UTC every day)
- Expected duration: ~20 minutes
- SLA: Ready before 04:00 UTC for business stakeholders.
Ownership
- Data owner: marketing-analytics@company.com
- Tech owner: data-platform@company.com
- Slack channel:
#data-customer-metrics
Key tasks
- extract_events — loads raw event files from S3 into staging.
- transform_metrics — dbt / Spark transformation building the mart table.
- quality_checks — runs freshness and row-count checks.
- notify_downstream — posts to Slack if the DAG finishes successfully.
Reliability notes
Retries are configured for extraction and transformation tasks with exponential backoff. Data quality checks will fail the DAG if thresholds are not met, preventing stale or incomplete metrics from reaching downstream dashboards.
Tech stack
The assistant turns documentation into a byproduct of pipeline development instead of an afterthought. Because it's built as an Airflow plugin, it stays close to the actual DAGs and evolves with them.
Explore the documentation assistant on GitHub