Data modeling is the work of deciding how data is structured, named, and related before anyone queries it. It is the difference between a warehouse where an analyst can answer a question in five minutes and one where they spend two days figuring out which of four "revenue" tables is the right one. The model is the shape of your data and the agreement about what that shape means. Get it right and the warehouse stays usable as it grows; get it wrong and every new question gets harder than the last.
The discipline has a long history in transactional databases, where normalization theory dominated for decades. The goal there was to eliminate redundancy so that every fact lived in exactly one place, which kept updates consistent and storage tight. That logic was right for systems that write constantly and read narrowly. It transferred badly to analytics, where you read broadly, write in batches, and care more about query simplicity than storage efficiency. A lot of bad warehouse design comes from applying transactional modeling instincts to an analytical problem.
Analytics developed its own approaches in response. Dimensional modeling, formalized by Ralph Kimball, organizes data into facts (the events and measurements) and dimensions (the descriptive context), arranged so that business questions map cleanly onto joins. Data Vault offered a different answer aimed at auditability and change tracking in large enterprises. And the cheap storage and columnar engines of the cloud era made wide, denormalized tables practical, giving rise to the one big table pattern that earlier theory would have called wasteful.
By 2026 the modeling conversation happens largely inside the transformation layer, with dbt as the common tool. Teams model in SQL, version it in git, test it, and document it, so the model is code rather than a diagram nobody updated. The old debate between normalized and dimensional has softened into a practical question: which approach makes the queries people actually run simple, fast, and trustworthy, given how cheap storage and compute now are.
This page walks through the real modeling approaches, where each one fits, and the failure modes that show up when teams pick a model for the wrong reasons. The tools and engines keep changing. The core job, structuring data so that the questions people ask map cleanly onto it, has not.
Dimensional modeling is the workhorse of analytics. You separate the things that happen, called facts, from the context that describes them, called dimensions. An order is a fact; the customer, product, store, and date are dimensions. Queries then become "sum this measure from the fact table, grouped by these dimension attributes," which is exactly how business questions are phrased. The star schema, one fact table surrounded by dimension tables, is the canonical shape, and it is popular because it is both intuitive for analysts and efficient for query engines.
Normalized modeling, the third normal form taught in database courses, splits data into many narrow tables so each fact is stored once. This is correct and valuable in the systems that capture the data, where you are constantly inserting and updating and cannot afford inconsistency. In the warehouse it tends to produce queries with a dozen joins that are slow to write and slow to run. Some enterprises keep a normalized layer as a clean staging area and then build dimensional models on top of it, which gets the consistency benefit without forcing analysts to live in the normalized world.
Data Vault is a specialized approach for large, regulated enterprises that need to track every change and load from many source systems. It splits data into hubs, links, and satellites, which makes it highly auditable and resilient to source changes, at the cost of being verbose and hard for analysts to query directly. Teams that use it almost always build a dimensional presentation layer on top, treating the Vault as an integration and history layer rather than a query layer.
One big table is the pattern cheap storage made viable. Instead of joining facts to dimensions at query time, you pre-join everything into a single wide table with hundreds of columns. Queries become trivially simple, with no joins, which both analysts and BI tools love. The cost is duplication and the difficulty of keeping a wide denormalized table consistent when the underlying data changes. It works best for well-understood, stable subject areas where the simplicity payoff is large and the change rate is low.
Reach for dimensional modeling as the default for most analytics. It is the right balance of query simplicity, performance, and flexibility for the broad range of questions a business asks. Analysts understand it, BI tools are built around it, and it accommodates new questions without a redesign. Unless you have a specific reason to do otherwise, facts and dimensions are where you start, and most teams never need anything more exotic.
Keep a normalized staging layer when you load from messy or numerous sources and need a clean, deduplicated foundation before building presentation models. The normalized layer is not what analysts query; it is the trustworthy base that dimensional models are built from. This two-layer pattern, clean normalized staging feeding dimensional marts, is common and sensible in larger shops, and tools like dbt make it natural to express as a chain of transformations.
Use Data Vault only when the requirements actually call for it: many source systems, strict auditability, heavy regulatory pressure to track every change, and an enterprise large enough to absorb the complexity. For most teams it is overkill, and adopting it because it sounds rigorous, rather than because the requirements demand it, buys a lot of pain for benefits you will not use. The verbosity that makes it auditable also makes it slow to work with.
Choose one big table for stable, high-traffic subject areas where the queries are well understood and the join simplicity pays off repeatedly. A wide table powering a heavily used dashboard, rebuilt on a schedule, can be faster and simpler than the equivalent star schema. The trade you are accepting is flexibility: when a new question needs a column the table does not have, you rebuild it. For stable reporting that is a fine trade; for exploratory analysis it is not.
The biggest practical shift in modeling over the last several years is that it moved into version-controlled code. With dbt, each model is a SQL file in a git repository. The transformation that builds a table is the definition of that table, so the model and its implementation cannot drift apart the way a schema diagram and a database always did. This sounds mundane and is actually the thing that made disciplined modeling achievable for normal teams rather than just enterprises with dedicated architects.
Testing is what code-based modeling unlocks. You can assert that a primary key is unique and not null, that a foreign key actually matches the dimension it points to, that a status column only contains expected values. These tests run every time the models build, so a source change that would have silently corrupted a join gets caught before anyone queries bad data. Modeling without tests means you find out the model is broken when a number looks wrong in a board meeting.
Documentation and lineage come along for the ride. Because the models are code with declared dependencies, the tooling can generate a lineage graph showing how each table is built from its sources, and surface the description of each column. An analyst trying to understand which revenue table to trust can trace it back to source rather than asking around. This turns the model from tribal knowledge into something a new hire can read.
The discipline this enables matters more than the specific tool. Models reviewed in pull requests, tested on every build, documented inline, and deployed through CI behave like software, which means they can grow without rotting. The alternative, a pile of hand-written SQL views that nobody fully understands and nobody dares change, is how warehouses become unmaintainable. The tool may change; treating the model as code is the durable part.
A technically perfect schema is useless if people disagree about what the columns mean. The hardest part of modeling is rarely the table shapes; it is getting the organization to agree on definitions. What counts as an active user. Whether revenue is booked or recognized. When a customer is considered churned. These are business decisions encoded in the model, and when they are left implicit, every team computes them slightly differently and the numbers never reconcile.
Naming is where this agreement becomes visible. Consistent, clear names, a documented convention for what a column called customer\_id contains versus account\_id, the difference between created\_at and updated\_at applied the same way everywhere, all of this is what lets someone use a table they did not build. Sloppy or inconsistent naming forces everyone to reverse-engineer meaning from the data, which is slow and error-prone. Good naming is unglamorous and pays off every single day.
The semantic layer is the newer attempt to centralize meaning above the physical model. Instead of each BI tool and each analyst encoding "monthly recurring revenue" in their own query, the definition lives in one place that everything queries through. Tools like dbt's semantic layer, Cube, and the metrics features in BI platforms aim at this. The promise is that a metric is defined once and computed the same way everywhere, which is the only real cure for the everlasting problem of two dashboards showing different numbers for the same thing.
Whether or not you adopt a formal semantic layer, the underlying need is the same: one agreed definition per concept, enforced somewhere rather than re-derived everywhere. The model carries the structure; the definitions carry the meaning; and a model with clean structure but contested meaning will still produce arguments about whose number is right. Modeling that ignores the human agreement problem solves the easy half and leaves the hard half open.
The most common failure is modeling for storage efficiency in a world where storage is cheap and compute is the constraint. Engineers trained on transactional databases normalize everything out of habit, producing analytical schemas that require ten joins for a simple question. The queries are slow to write, slow to run, and intimidating to analysts. The fix is to denormalize deliberately for the read patterns that matter, accepting duplication that costs almost nothing in modern storage.
The opposite failure is denormalizing everything into giant tables with no discipline, so the warehouse becomes a pile of overlapping wide tables that nobody can reconcile. Each new dashboard gets its own bespoke table, definitions drift between them, and within a year there are six tables that all claim to hold orders and disagree about the totals. One big table is a fine pattern applied deliberately; it is a disaster applied as a default by people avoiding the work of modeling.
Modeling around a specific report rather than the underlying business process locks you into that report. The moment someone asks a slightly different question, the model cannot answer it and you start over. Good dimensional models capture the business process, the order, the shipment, the payment, at the grain it actually happens, so that many questions can be asked of the same model. Modeling to the question instead of the process is a trap that feels efficient and ages badly.
Finally, treating modeling as a one-time design rather than a living system guarantees decay. The business changes, new sources arrive, definitions shift, and a model that is never revisited slowly stops matching reality. Numbers quietly go wrong, trust erodes, and people start exporting to spreadsheets because they no longer believe the warehouse. The model needs the same ongoing ownership as any other production system, or it rots like any other unmaintained code.
A fact is a thing that happened or was measured: an order, a payment, a page view, with numeric measures like amount or quantity. A dimension is the descriptive context you slice those facts by: the customer, product, store, date. In a query you aggregate facts and group by dimension attributes. The split exists because it maps directly onto how business questions are asked, which makes both the model and the queries intuitive.
Usually not the way you would normalize a transactional database. Heavy normalization optimizes for consistent writes and tight storage, which are not the constraints in analytics. Most warehouses are better served by dimensional models with deliberate denormalization for query simplicity. A normalized staging layer feeding dimensional marts is a reasonable middle path, but analysts should generally query the dimensional layer, not a fully normalized one.
Not when used deliberately. For stable, well-understood subject areas with heavy query traffic, a wide pre-joined table removes joins and makes queries simple and fast, which both analysts and BI tools benefit from. It becomes a problem when used as a default to avoid modeling, producing many overlapping tables with drifting definitions. The pattern is fine; the lack of discipline around it is what causes trouble.
It turns models into version-controlled SQL code that is tested, documented, and deployed like software. The transformation that builds a table is the table's definition, so model and implementation cannot drift. Tests catch broken keys and bad joins before anyone queries them, and lineage shows how each table is built. This is what made disciplined, maintainable modeling practical for ordinary teams rather than only large enterprises with dedicated architects.
Model facts at the lowest grain the business process actually happens, for example one row per order line rather than one row per order summary. Fine grain lets you aggregate up to answer many questions; coarse grain throws away detail you can never recover. The grain is one of the first decisions to make and one of the most consequential, because changing it later usually means rebuilding the model.
When you have many source systems to integrate, strong auditability and change-tracking requirements, regulatory pressure, and an organization large enough to absorb the added complexity. In those conditions its resilience to source change and full history are genuinely valuable. For most teams it is overkill, and the verbosity that makes it auditable also makes it slow to build and query, so it should be a deliberate requirement-driven choice, not a default.
A semantic layer centralizes metric definitions above the physical tables, so a metric like monthly recurring revenue is defined once and computed the same way everywhere instead of being re-encoded in each dashboard. You need one when inconsistent metric definitions are causing different tools to report different numbers for the same concept. Even without a formal semantic layer tool, you need the underlying discipline of one agreed definition per metric, enforced somewhere.
Treat it as a production system with an owner, version control, and tests, and revisit it as the business changes rather than freezing the original design. Run tests on every build so source changes that break the model are caught early. Keep definitions and naming documented so knowledge does not live only in people's heads. The model decays the same way any unmaintained code decays, so the cure is the same: ongoing ownership and engineering discipline.
A two-layer approach is common and usually worth it for anything beyond a small setup. A staging layer cleans, standardizes, and deduplicates raw source data into a trustworthy base, and a presentation layer builds the dimensional models analysts actually query on top of it. This separates the messy work of integrating sources from the work of shaping data for consumption, which keeps both cleaner. For a very small warehouse with few sources, a single layer can be fine, but the separation pays off as sources and consumers grow.