Back to home

LLM-Powered Automatic Documentation for dbt

dbt Power Tools is a CLI that auto-generates dbt model and column documentation using LLMs (Ollama or OpenAI). It parses your dbt project, understands model SQL and data, and writes clear documentation back into schema.yml.

Open source · 2025dbt · Analytics engineeringView README on GitHub
Situation
Analytics teams had hundreds of dbt models but barely any useful schema.yml descriptions. Documentation was always “later”, and never caught up with reality.
Task
Design a workflow that could generate high-quality, business-friendly documentation automatically — without breaking existing dbt projects or forcing a new UI.
Action
Built a Python CLI that reads dbt's manifest.json, inspects model SQL and (optionally) warehouse data, prompts an LLM with rich context, and writes structured descriptions back into schema.yml using Jinja templates.
Result
Teams get complete dbt documentation in minutes: model purpose, grain, logic, and column-level summaries with stats. Documentation becomes a repeatable step in CI instead of a one-off chore.

High-level architecture

The tool plugs into an existing dbt project and uses dbt's own artifacts as the contract: manifest.json for structure and schema.yml for documentation as code.

1Step 1 – dbt project inputs

Read manifest.json, existing schema.yml, and optional warehouse profile.

dbt project inputs
  • manifest.json (models, refs, sources)
  • Existing schema.yml (if present)
  • Optional warehouse connection via profiles.yml
LLM documentation engine
  • Parses model SQL and dependency graph
  • Samples data (row counts, missing %, example values) when enabled
  • Prompts LLM (Ollama / OpenAI) with structured context per model and column
Generated documentation
  • Updates schema.yml in-place
  • Model descriptions: purpose, grain, business meaning
  • Column docs: semantics + derived stats
  • Ready for dbt docs site / PR review

How it works

The CLI is designed to be safe, repeatable, and friendly to CI. Documentation is generated from the project artifacts – not from ad-hoc inspection.

  1. Discover models. The CLI parses manifest.json to enumerate models, sources, and their dependencies.
  2. Collect context. For each model, it gathers SQL, upstream dependencies, tags, and existing docs. If configured, it queries the warehouse for simple profile stats.
  3. Prompt the LLM. It builds a structured prompt that explains the model's intent, joins, and transformations in plain language, and asks for concise documentation.
  4. Write docs as code. The resulting descriptions are written into schema.yml using a Jinja-based template so teams can tweak the style without changing Python.
  5. Run in CI. The CLI can be wired into a dbt or GitHub Actions workflow so new or changed models are documented automatically in pull requests.

Example generated documentation

An example of how a model and its columns look after running the tool. The actual text is controlled via templates so teams can match their documentation tone.

models:
  - name: fct_orders
    description: >
      Fact table representing one row per customer order, joined to core
      customer and product dimensions. Used for revenue, margin and volume
      reporting at the daily grain.
    columns:
      - name: order_id
        description: >
          Surrogate key for the order. Unique per order across all channels.
      - name: customer_id
        description: >
          Links to the dim_customers table to enrich with customer attributes.
      - name: order_date
        description: >
          Business date of the order, used as the primary reporting date.
      - name: gross_revenue
        description: >
          Pre-discount revenue in the transaction currency.
      - name: net_revenue
        description: >
          Revenue after discounts and returns. Used in margin calculations.

Tech stack

dbt CorePythonTyper CLIOllamaOpenAIJinja2 templatesPandasSQLAlchemyWarehouse profiling (optional)

The goal of dbt Power Tools is to make documentation the easiest part of analytics engineering, not the guiltiest secret in the repo.

Read usage & installation on GitHub