Data Modeling — A Practical Guide

What Data Modeling Is

Data modeling is the process of defining how data is structured, related, and stored — before or while building the systems that use it. A data model describes what entities exist (customers, orders, products), what attributes each entity has, how entities relate to each other, and what rules govern the data. Done well, a data model makes queries fast, prevents duplicate or inconsistent data, and gives engineers and analysts a shared vocabulary for talking about the domain. Done poorly — or skipped — it creates a system where a simple question like “how much did we sell last month?” requires reconciling three different definitions of “sale” from three different tables.

Data modeling is not exclusively a data warehouse concern. Application databases (PostgreSQL, MySQL, MongoDB) need modeling too. But the term is most commonly used in analytics and data engineering contexts, where the goal is not just storing data but making it queryable for analysis and reporting.

Three Abstraction Levels

Data models are typically built at three levels of abstraction, each serving a different audience.

Conceptual Model

The conceptual model captures the high-level entities and their relationships without regard to technology or implementation. It answers: what are the key things in this domain and how do they relate? A conceptual model for an e-commerce system might show Customer, Order, Product, and Supplier as entities, with relationships like “a Customer places many Orders” and “an Order contains many Products.” This model is for business stakeholders and architects, not engineers. It fits on a whiteboard.

Logical Model

The logical model adds detail to the conceptual model: it defines all attributes, specifies data types in a technology-neutral way (string, integer, boolean, date), identifies primary keys, and makes relationships explicit (one-to-many, many-to-many). A logical model is complete enough to review with a domain expert and check for accuracy, but it doesn’t reference specific database technology. Normalization decisions — whether to split an address into separate fields or keep it as one string — happen here.

Physical Model

The physical model is the implementation: actual SQL DDL, column types specific to the target database (VARCHAR(255), BIGINT, TIMESTAMP WITH TIME ZONE), indexes, partitioning, clustering, and constraints. It’s the model that gets deployed. Changes at this level have real performance and storage implications. A physical model review should include a query engineer, not just the data modeler.

Key Modeling Approaches

Entity-Relationship Modeling (OLTP)

ER modeling is standard for operational databases — the ones powering applications: e-commerce platforms, CRMs, ERP systems, mobile apps. The goal is to eliminate data redundancy (normalization) so that updates to a piece of data only happen in one place. A customer’s email address appears once in the customers table; every other table that needs it joins to customers. Third Normal Form (3NF) is the typical target. The resulting schema has many tables with precise relationships, optimised for writes and transactional consistency. It is not optimised for analytical queries that need to join 12 tables.

Dimensional Modeling / Kimball Star Schema

Kimball’s dimensional modeling is designed for analytical queries — the opposite of OLTP. It deliberately denormalizes data into fact tables (measurements: sales amounts, event counts, durations) and dimension tables (context: who, what, when, where). A star schema has one central fact table surrounded by dimension tables, connected by foreign keys. Querying is fast and intuitive because most questions require joining only one or two dimensions to the fact table. The tradeoff is data duplication in dimensions and more complex ETL to build the model from source systems. Star schemas are the standard approach for data warehouses built on Snowflake, BigQuery, and Redshift.

Data Vault

Data Vault is a modeling methodology designed for enterprise data warehouses that ingest from many source systems over long periods. It separates structure into three object types: Hubs (unique business keys, e.g. customer ID), Links (relationships between hubs), and Satellites (descriptive attributes and their history). Data Vault excels at historical tracking, auditability, and parallel loading from multiple sources — you can add a new source system without restructuring existing tables. The cost is query complexity: answering a simple business question often requires joining multiple hubs, links, and satellites. Data Vault is most appropriate for regulated industries (finance, insurance) with complex, multi-source environments. It’s overkill for most analytics use cases.

dbt and the Modern Data Stack

dbt (data build tool) changed data modeling practice more than any tool since SQL. Before dbt, data transformations in the warehouse were managed with a mix of stored procedures, Python scripts, and manual SQL files, with no consistent testing, documentation, or version control. dbt made SQL transformations first-class software artifacts:

Version control. All transformation SQL lives in Git.
Dependency management. Models reference each other with {{ ref('model_name') }}, and dbt builds the correct execution order automatically.
Testing. Column-level tests (not null, unique, accepted values, referential integrity) run after each build. Failures are visible in CI.
Documentation. Column descriptions, model descriptions, and data lineage graphs are generated from the same YAML files that define tests.
Modularity. The staging → intermediate → mart layer pattern (popularised by the dbt community) gives teams a standard way to structure transformations: staging models clean and rename source data; intermediate models join and aggregate; marts are the final business-facing models.

dbt has become the default transformation tool for teams working on Snowflake, BigQuery, Redshift, and Databricks. A data engineer who doesn’t know dbt in 2025 is missing the core tool of the discipline.

When to Use Which Model

Scenario	Recommended Approach
Application database (OLTP, transactional)	ER modeling, 3NF
Analytics / BI reporting, single source	Star schema (Kimball)
Analytics / BI reporting, multi-source, regulated	Data Vault
ML feature store, exploratory analytics	Flexible staging layer on a data lake
Small team, fast iteration needed	dbt + star schema, defer complexity

The most common mistake is applying Data Vault to a use case that doesn’t justify its complexity. If you have one or two source systems and a team of three, a well-built star schema in dbt will outperform a Data Vault implementation in every dimension that matters: query speed, developer velocity, and maintainability.

Testing and Documentation

A data model without tests is a bet that nothing will ever change. It will change. Source systems change their schemas, business definitions evolve, and edge cases appear. Standard testing practice with dbt:

Not null tests on all columns that should never be null (foreign keys, date columns, amount fields).
Unique tests on primary keys and business keys.
Accepted values tests on status columns and categorical fields.
Referential integrity tests on foreign keys (does every order have a customer that exists in the customer dimension?).
Custom business logic tests for domain-specific rules (an order’s total should equal the sum of its line items).

Documentation should be co-located with the model, not in a separate wiki that goes stale. dbt’s schema.yml files support column-level descriptions, model descriptions, and owner metadata. Tools like dbt Docs (built in) and Metaphor, Atlan, or DataHub (external catalogs) build data dictionaries from these files.

Common Modeling Mistakes

Skipping the logical model. Teams jump straight to physical implementation and discover three months later that they disagree on what a “customer” is. The logical model surfaces these disagreements early.

Over-normalizing the analytics layer. What works for OLTP is painful for analytics. A 3NF schema in a data warehouse means 10-table joins for simple business questions. Dimensional modeling exists for a reason.

Fact tables without a grain definition. A fact table’s grain is the most specific thing one row represents (one order line, one daily summary per customer, one event). Without an explicit grain decision documented, different engineers build the table at different granularities and the results don’t reconcile.

Dimension tables with no history. If a customer changes their segment, does the sales history reflect the old segment or the new one? Slowly changing dimensions (SCD Type 2) track historical attribute values with start and end dates. Skipping this makes historical analysis unreliable.

No one owns the model. Data models drift when no one is accountable for their accuracy. Assign ownership — a named person or team — and include data model review in the definition of done for any feature that changes source data.