dbt (data build tool) is a command-line tool and framework for writing data transformations. It lets you write transformations as SQL SELECT statements, organize them as a project, run them in your data warehouse, and test them. dbt brings software engineering best practices to data transformation work: version control, testing, documentation, and dependency management.
Before dbt, data transformations were scattered and hard to manage. Some lived in database stored procedures, some in Python scripts, some in proprietary ETL tools. This made pipelines fragile and difficult to understand. dbt centralized transformation in a framework that enforces standards. A dbt project is a Git repository with SQL files and YAML configuration. This enables version control, code review, and reproducibility. Teams can collaborate on transformations the way they collaborate on application code.
dbt is often described as the T in ELT (Extract, Load, Transform). Data is extracted from sources and loaded into a data warehouse (the E and L parts). dbt handles the T: transformation in the warehouse. This is a shift from traditional ETL where transformation happened before loading. ELT is more practical with modern cloud data warehouses that are fast and cheap enough to store raw data and transform it in-place.
The impact of dbt has been significant. It has become the standard tool for data transformation in many organizations. Before dbt, data engineering was often ad-hoc. dbt provides structure, encouraging best practices like testing and documentation from the start. This has improved data reliability and team productivity across the industry.
A dbt model is a SQL SELECT statement saved as a file. When you run dbt, it executes these SELECT statements in your warehouse, creating or updating tables or views. Each model is a single file in the models/ directory. The filename becomes the table or view name: customers.sql creates a table or view named customers.
Models reference other models using the ref() function. If you write SELECT * FROM {{ref('raw_customers')}}, dbt knows that this model depends on the raw_customers model. It automatically creates a dependency and runs raw_customers before this model. This dependency tracking eliminates manual orchestration. You do not need to remember the order or write scripts to enforce it. dbt figures it out.
Models can be tables or views. Tables are persistent: dbt creates them once and updates them on each run. Views are recomputed on every query. Tables are faster for downstream queries but slower to create. Views are fast to create but slower to query. You configure which strategy to use per model. Early models might be views (fast to build), later models might be tables (fast to query). The choice depends on how often they are queried and how expensive the computation is.
dbt handles incremental models: models that only process new or changed data rather than recomputing everything. For large tables, incremental processing is much faster. A model might say 'only process rows where created_at is after the last run date'. This reduces processing time from hours to seconds. Incremental processing adds complexity, so start with simpler models and add incrementalism only when it matters.
dbt includes built-in generic tests for common data quality checks. The unique test verifies that a column has no duplicate values. The not_null test verifies that a column has no missing values. The relationships test verifies referential integrity: a foreign key column only contains values that exist in the parent table. The accepted_values test verifies that a column only contains values from a specified list.
You define tests in YAML files alongside models. A test definition might look like: a column 'customer_id' is not null and unique, and a column 'status' only contains values from a specific list. dbt runs tests after transformations. If tests fail, dbt reports failures. You configure whether test failures block deployments. In development, you might ignore failures. In production, failures should block downstream processing to prevent bad data reaching consumers.
Custom tests let you write SQL to check data conditions. A test might query a table and return rows where a business rule is violated. If the query returns any rows, the test fails. Custom tests are powerful for business logic: you can test that amounts are positive, dates make sense, or totals add up correctly. Writing tests forces you to think about what good data looks like and document business rules.
Without data quality tests, bad data silently flows downstream, causing incorrect analysis and wrong decisions. With tests, you catch issues immediately. This is often the highest-impact change teams make when adopting dbt: instead of discovering data issues when someone complains, you catch them automatically. Over time, test coverage should grow, catching more types of issues. A mature dbt project has comprehensive tests ensuring data reliability.
dbt generates documentation automatically from your code. You describe models and columns in YAML files. For each model, you can write a description explaining what it is and how to use it. For each column, you document its meaning and business context. When you run dbt docs generate, it creates a static website showing all this information.
The documentation website is crucial for data discovery. New team members can browse it to understand what data exists. Analysts can find tables relevant to their questions. The documentation answers: 'What does this field mean?', 'Where does this data come from?', 'Is this table current?', 'Who should I ask about this data?'. Without documentation, people guess or ask repeatedly. With dbt's auto-generated documentation, answers are readily available.
dbt also generates lineage automatically. Lineage is a visualization showing how data flows through transformations. It shows raw tables feeding into staging models, staging models feeding into intermediate models, intermediate models feeding into marts. If you change a raw table, lineage shows all downstream models that might be affected. This enables safe changes: before modifying something, you understand the impact.
Documentation and lineage are often overlooked but provide immense value. Teams that invest in them move faster because everyone understands the data. Code review is easier when lineage shows the transformation logic. Debugging is faster when documentation explains what each model does. These are non-sexy improvements that nonetheless have high impact on team productivity.
dbt Core is the open-source command-line tool. You install it, write dbt projects, and run dbt commands manually or from your own orchestration tools. dbt Core is free and gives you complete control. You can use it with any Git repository, any orchestration tool, and any workflow that suits your team. The downside is that you handle everything: scheduling, monitoring, notifications, and CI/CD integration.
dbt Cloud is a paid web platform built by the dbt creators. It adds a web interface for writing dbt code, job scheduling (run transformations on a schedule), CI/CD integration (run tests on pull requests), environment management (development, staging, production), and notifications. dbt Cloud's web IDE is simpler than local development and integrated with Git. Scheduling is simpler: define a job and schedule it without writing orchestration code. CI/CD integration means pull requests automatically run tests, catching issues before merging.
Small teams often use dbt Core: one or two people writing transformations, pushing to Git, running dbt with Airflow or cron jobs. It is simple and free. Larger teams often move to dbt Cloud: multiple people writing code, needing coordination, wanting alerts and monitoring. dbt Cloud also handles updates and maintenance, freeing your team to focus on transformations.
The choice is pragmatic. If you are comfortable with the command line and already have job scheduling, dbt Core is sufficient. If you want a managed solution and can budget for it, dbt Cloud simplifies operations. Most teams can start with dbt Core and migrate to dbt Cloud later if needed. dbt projects are portable: you can move between Core and Cloud without rewriting.
A dbt project is a directory with configuration and model files. The essential file is dbt_project.yml which configures the project: name, version, profiles location, and model defaults. profiles.yml contains database connection information: which warehouse, credentials, schema. Separating these files prevents secrets from ending up in version control.
Models are organized in subdirectories. A common structure is: models/staging for models that clean raw data, models/intermediate for models that combine multiple sources, models/marts for final tables for consumption. This layering makes data flow clear. Raw data enters staging where it is cleaned, flows through intermediate where it is combined, and exits as marts ready for consumption.
Models are named using conventions. Staging models start with stg_: stg_customers.sql. Fact tables start with fct_: fct_orders.sql. Dimension tables start with dim_: dim_products.sql. These conventions make the purpose of each model clear. Tests go in tests/ or are defined inline in model YAML. Macros (reusable SQL functions) go in macros/.
As projects grow, organization matters. A project with hundreds of models needs clear structure. A project with ten models can be simpler. Start with basic structure (models/ and tests/ directories), then add sophistication as you grow. A well-organized project is easy to navigate and modify. A disorganized project becomes hard to work with. Taking time early to establish structure pays dividends later.
dbt is one piece of a data platform. It assumes data is already in your warehouse. Something must extract data from sources and load it there. This might be Fivetran, Airbyte, custom Python scripts, or managed data integration tools. dbt does not handle ingestion; it handles transformation once data is loaded.
After dbt transforms data, downstream tools consume it. Analysts query tables with SQL editors or BI tools like Tableau. Applications read data via APIs. Data scientists use tables for training models. dbt produces the clean, reliable tables that everyone depends on. This makes dbt central to the data pipeline. If dbt works well, everyone downstream has good data. If dbt fails, everything downstream fails.
dbt integrates with orchestration tools. Airflow, Prefect, Dagster, and cloud-native schedulers (Prefect Cloud, Astronomer) can run dbt jobs. This coordination allows complex workflows: perhaps extracting data with Fivetran, transforming with dbt, then loading results elsewhere. dbt Cloud provides its own orchestration but still integrates with external tools for complex workflows.
The dbt ecosystem includes complementary tools. dbt packages are reusable dbt code libraries: common transformations that many organizations need. Some packages handle specific tasks like data quality (dbt_utils), others provide transformations for specific tools (shopify, salesforce packages). Using packages reduces code you need to write. The broader ecosystem also includes tools that use dbt outputs: dbt-core consumers that trigger alerts when tests fail, lineage tools that visualize data flow, and governance tools that enforce standards.
Getting modeling conventions right is harder than it sounds. Teams often struggle with deciding what belongs in staging, intermediate, and marts layers. Models grow without clear organization. After six months, no one remembers why a model exists or what it does. The solution is taking time upfront to establish clear conventions and sticking to them. Documenting conventions in your team's dbt guide prevents confusion.
Test coverage is often insufficient. Teams write transformations but skip testing, thinking they will add tests later. Later never comes. Starting without tests and adding them retrospectively is hard because you are testing production code. The solution is building testing discipline from day one. Even simple tests (unique and not-null on key columns) catch many issues. Tests are not overhead; they are insurance.
Documentation falls behind quickly. You write documentation when building a model, but when you modify it weeks later, you forget to update docs. Documentation becomes outdated and unreliable, then people stop reading it. The solution is making documentation easy: small descriptions in YAML right alongside code, not separate documents. Auto-generated documentation from code is always current.
Performance issues emerge as projects grow. A model that ran in seconds when the table was small now takes minutes as data accumulates. The solution is monitoring performance and optimizing proactively. Sometimes you need incremental models. Sometimes warehouse configuration needs tuning. Understanding your warehouse's cost and performance characteristics helps prevent surprises.
Over-engineering is a trap. Teams build complex multi-layer architectures with dozens of intermediate models when a simpler approach would work. This makes projects hard to understand and maintain. Start simple, add complexity only when it provides clear benefits. A project with five clear models beats one with fifteen confusing models.
dbt stands for data build tool. It is a command-line tool and framework for writing, testing, and deploying data transformations. dbt solves the problem of transformation complexity in data pipelines. Before dbt, data transformations were scattered: some in stored procedures in the database, some in Python scripts, some in ETL tools. This was hard to maintain, impossible to test, and difficult to version control.
dbt brings software engineering discipline to data transformation. You write transformations as SQL SELECT statements and organize them as a project with version control. dbt handles running them in the right order, testing outputs, and documenting what you have built. The phrase 'the T in ELT' describes dbt's role: it handles transformation in ELT (Extract, Load, Transform) pipelines. Data is extracted and loaded into the warehouse, then dbt transforms it.
dbt has had tremendous impact on how organizations approach data transformation. Before dbt, data pipelines were often fragile and undocumented. After adopting dbt, teams move faster, ship higher-quality data, and collaborate more effectively. This is why dbt has grown so rapidly in the industry.
A dbt model is a SQL SELECT statement saved as a file. That is it. A file named customers.sql containing SELECT * FROM raw_customers WHERE deleted = false is a model. When you run dbt, it converts the SELECT statement into a CREATE TABLE or CREATE VIEW statement and executes it in your warehouse.
Models are organized in a project directory structure with a dbt_project.yml configuration file. Models reference other models: a model might SELECT FROM {{ref('customers')}} to reference the customers model. The ref() function creates dependencies automatically. If customers depends on orders, and orders depends on raw_orders, dbt figures out the right order to run them.
dbt generates lineage showing how data flows through models. This makes understanding the transformation pipeline easy. Models can be tables or views. Tables are persistent, views are computed on query. Views are faster to create but slower to query. The choice depends on downstream usage. Start with views, convert to tables only if performance matters.
dbt tests are data quality checks. A test verifies that data meets expectations. dbt includes generic tests: unique (a column has no duplicates), not_null (a column has no nulls), relationships (a foreign key references an existing value), and accepted_values (a column only contains certain values). You define tests in YAML files.
A simple test might be: a column 'id' is unique and not null. dbt runs tests after transformations and fails if tests fail. This prevents bad data from reaching downstream consumers. Tests are essential for reliability. Without tests, bad data silently flows downstream and causes wrong decisions. With tests, you catch issues early.
dbt also supports custom tests: you write SQL that returns rows that violate a condition. If the query returns any rows, the test fails. Tests are documentation: they clarify what good data looks like. A test saying 'amount must be positive' documents a business rule. Tests encourage quality thinking: as you build transformations, you think about what could go wrong and write tests for it.
Stored procedures are code written in the database's language (PL/SQL, T-SQL, etc.) and stored in the database. They are useful but have drawbacks. Stored procedures are hard to version control: database version history is messy. They are hard to test: testing stored procedures requires setting up database state, which is complex. They are tied to one database: moving pipelines to a different database requires rewriting procedures.
dbt solves these problems. dbt code is in files, version controlled with Git. Testing is built-in. dbt supports multiple databases: Snowflake, BigQuery, Redshift, Postgres. The same dbt code works across databases (with minor syntax differences). Development is easier: write SQL, run dbt, see results. dbt is also collaborative: code review, CI/CD pipelines, and team workflows work naturally.
Stored procedures are still useful for complex procedural logic, but for standard transformations, dbt is better. Most teams have moved from stored procedures to dbt. Stored procedures will eventually disappear from data stacks as dbt becomes the standard.
dbt Core is the open-source command-line tool. You install it, write dbt projects, and run dbt commands. dbt Core is free. dbt Cloud is a paid web interface and platform built on dbt Core. dbt Cloud adds features: a web IDE for writing dbt, job scheduling (run dbt on a schedule), CI/CD integration (run tests on pull requests), notifications (Slack alerts), environment management (dev, staging, production), and lineage visualization in the web UI.
If you are comfortable with the command line and already have job scheduling (Airflow, Prefect), dbt Core alone is sufficient. If you want a simpler, more integrated experience, dbt Cloud is convenient. Most small teams use dbt Core. Most enterprises eventually move to dbt Cloud for the operational features. The choice depends on team preference and operational setup. Both are valid. dbt Core projects can be moved to dbt Cloud and vice versa, so you are not locked in.
A reasonable approach is starting with dbt Core (free) and migrating to dbt Cloud once your team and project mature. You learn dbt first, then decide if Cloud's features are worth the cost. Many teams find the cost justified by reduced operational burden.
dbt handles dependencies automatically. Models reference other models using ref(). dbt builds a dependency graph: if model A depends on model B, B must run before A. dbt figures this out and runs models in the right order. This is simpler than writing orchestration logic manually. When you run dbt run, it parses all models, builds a dependency graph, and executes models in parallel where possible.
Models with no dependencies run first, then models that depend on them. This minimization of wait time improves speed. dbt also supports sources: dbt source() references raw data that dbt does not create. If a model reads from a raw table, you declare it as a source. This documents dependencies and enables freshness checking (alert if raw data is not updated recently).
For complex orchestration (running dbt, then an API call, then another dbt job), you use external orchestration: Airflow, Prefect, dbt Cloud Jobs, etc. These tools coordinate multiple steps. dbt handles the data transformation part; orchestration tools handle overall workflows. This division of responsibility is clean: dbt focuses on SQL transformation, orchestration tools focus on coordination.
dbt generates documentation automatically. You write YAML files describing your models, columns, and tests. dbt compiles this into a website showing what data you have, where it comes from, what it means, and what tests ensure quality. This documentation is crucial: it answers questions like 'what is this field', 'where did this number come from', and 'is this data fresh'.
With dbt, documentation is generated from code, not written separately. This means documentation stays current. With traditional documentation, it falls out of date. dbt lineage shows how data flows through transformations. A visualization shows which raw tables feed into which models and which downstream models use them. Lineage answers 'if this source changes, what models are affected'. This is critical for understanding impact before making changes.
Both documentation and lineage reduce confusion and enable safe changes. Teams that invest in documentation with dbt move faster because everyone understands the data. This is one of dbt's most underrated features. People focus on transformations, but documentation and lineage are where the value is.
A dbt project is a directory with configuration and model files. The standard structure has models/ subdirectories for different layers: staging, marts, intermediate. Staging models clean raw data, renaming columns and handling nulls. Intermediate models combine multiple sources. Marts are final tables for consumption. This layering makes data pipelines clear: raw data flows through stages, becoming progressively cleaner and more useful.
Models are named descriptively: stg_customers.sql is a staging model for customers, fct_orders.sql is a fact table, dim_products.sql is a dimension. These names follow conventions that make understanding projects easier. Macros (reusable SQL functions) go in macros/. Tests go in tests/ or inline in YAML. Configurations go in dbt_project.yml and profiles.yml.
Most projects are much simpler than this, with just a models/ directory. As you grow, structure becomes important. A well-organized project is easy to navigate and modify. A disorganized project becomes chaos. Start simple, add structure as you grow. Document your conventions so team members follow the same patterns. This prevents wild inconsistency.