Should You Build a Data Lake, a Data Warehouse, or a Data Lakehouse?
You’re planning a data platform from scratch. Your CTO friend says you need a data lake. Your data consultant swears by data warehouses. Your vendor just launched a lakehouse product. Your board wants to know what you’ll build and when.
Everyone has opinions. Almost nobody has the right constraints to match them to your situation.
The answer isn’t actually about the architecture. It’s about what you’re going to do with the data and who has to maintain it.
What These Terms Actually Mean
Let’s start with definitions, because half the confusion comes from the fact that people use these words differently.
A data lake is fundamentally a storage system. It’s usually cloud object storage (S3, GCS, Azure Blob Storage) where you dump data in whatever format it arrives: JSON, Parquet, CSV, unstructured logs, compressed files. The appeal is simplicity and cost: storage is cheap, ingestion is simple, and you’re not forcing anyone to transform data upfront. The tradeoff is that querying a data lake is slow. You’re reading raw files from object storage, which isn’t optimized for fast queries.
A data warehouse is the opposite. It’s optimized for querying. Data goes through an ETL process where it’s cleaned, transformed, and loaded into structured tables. Your queries run fast because the warehouse is designed for analytical access patterns. Snowflake, Redshift, BigQuery. These are data warehouses. The tradeoff is that transformation is manual and expensive. You need engineers to write ETL pipelines. You’re paying for compute during that process. And you’re limited by the structure of the schema your engineers designed.
A data lakehouse is the new compromise. It’s a layer of metadata and indexing on top of a data lake that makes it queryable like a warehouse. Apache Iceberg, Delta Lake, Apache Hudi. These are lakehouse formats. The idea is: dump data cheaply into storage, add metadata that makes it queryable, skip the expensive transformation step. The reality is more complex, as we’ll discuss.
That’s the technical definition. But you don’t actually care about the technology. You care about whether it fits your situation.
The Real Decision: Who Owns the Transformation?
Here’s what matters: In a data lake, transformation happens downstream. In a data warehouse, transformation happens upfront. In a lakehouse, you’re hoping the transformation problem goes away but it doesn’t.
Let’s be concrete. Your mobile app generates 50 million events a day. Each event is a JSON blob with 80+ fields. Some fields are useful. Some are noise. Some are user IDs in three different formats. Some are null 40% of the time.
If you build a data lake: You dump all 50 million events as JSON into S3 every day. Storage costs you $200/month. Then when someone wants to analyze retention, they query the raw events. The query has to parse JSON, handle the multiple ID formats, deal with nulls, and filter to only the fields they care about. The query takes 45 minutes and costs $50 in compute. They run it twice. Every data scientist and engineer has to solve the same JSON parsing problem independently.
If you build a data warehouse: You spend a week writing an ETL pipeline that parses the events, deduplicates them, standardizes the IDs, creates clean tables. You run it daily. Compute costs $500–1000/month to run the pipeline. But once it’s done, queries are instant and cheap. The same retention analysis runs in 2 seconds and costs 50 cents.
If you build a lakehouse: You dump the JSON into a table format (Iceberg or Delta), add metadata and indexes, and hope it behaves like a warehouse. In theory, you get warehouse-speed queries without the ETL cost. In practice, you still need to handle the schema evolution, the data quality issues, and the JSON parsing. The tooling is newer and less mature. You spend weeks fighting with it.
So the question is actually: Who’s going to own the transformation work, and when should they do it?
When Each Model Makes Sense
Choose a data lake if:
- You have low query volume and irregular access patterns. You’re not running thousands of queries a month. You’re doing occasional investigations.
- Your data structure is uncertain or evolving rapidly. You don’t yet know what questions you’ll ask, so you can’t pre-define the schema.
- Your users are technically sophisticated and can handle raw data. Your data scientists are fine parsing JSON and running complex queries.
- You have extreme cost sensitivity and can tolerate slow queries. You’re bootstrapping and need the cheapest possible infrastructure.
- You don’t have a dedicated data engineering team. You can’t maintain ETL pipelines.
This is maybe 10–15% of organizations. Usually early-stage companies or organizations with simple data needs.
Choose a data warehouse if:
- You have consistent analytical queries and known use cases. You run the same analyses repeatedly.
- Your data structure is stable. You know what your schema looks like and it won’t change dramatically.
- Your users expect fast queries and don’t want to learn complex querying. Your executives need dashboards that load in 2 seconds.
- You have a data engineering team (even a small one) to maintain pipelines. You can afford 1–2 engineers working on data infrastructure.
- Your query volume is high. You’re running hundreds or thousands of queries daily.
This is 70–80% of organizations. Most mid-market and larger companies.
Choose a lakehouse if:
- You want the cost of a data lake but the performance of a warehouse AND
- You have the engineering capacity to adopt newer tooling AND
- Your data is primarily tabular or semi-structured AND
- You’re willing to invest 3–6 months learning and troubleshooting a new paradigm
Honestly, this is 5–10% of organizations right now. The lakehouse promise is real, but the tooling maturity hasn’t caught up to the marketing.
The Hidden Cost Most People Miss
Whatever architecture you choose, the real cost isn’t storage. It’s not even compute. It’s data quality and pipeline maintenance.
In a data lake, you pay this cost downstream. Every team that touches the data has to understand it, clean it, fix it. You pay in analyst time and in bad decisions made on bad data.
In a data warehouse, you pay this cost upfront. Your ETL pipeline needs to handle edge cases, schema changes, and data quality issues. You have engineers maintaining this.
In a lakehouse, you’re hoping you don’t have to pay it at all. But you still do.
We worked with a financial services company that chose a data lake because they wanted to avoid the “complexity” of warehouse ETL. Eighteen months later, they had 45 TB of data in S3 and no one could trust any of it. Data scientists were writing custom parsers. Analysts were finding contradictory values in the same dataset. They eventually rebuilt as a warehouse. The same timeline and cost they’d “saved” upfront, plus two years of delay.
A Practical Approach
Here’s what we recommend:
Start with a data warehouse unless you have a specific, strong reason not to. You’re not giving up speed. Snowflake costs $2–5 per TB queried. BigQuery’s model-based pricing averages to $5–10 per TB. For most organizations, the cost is negligible compared to the ROI of fast, reliable analytics.
Size your engineering team accordingly. One engineer per 100 TB of structured data is a reasonable ratio. That engineer maintains pipelines, handles schema changes, and keeps things reliable.
Plan for 20% of data engineering time to go to data quality and maintenance. Not upfront. Ongoing. Build this into headcount and budget.
Use a lakehouse if your specific problem is: Semi-structured data (logs, events) that you need both for real-time streaming and historical batch analytics. Don’t use it because it sounds modern. Use it because the technology solves a specific pain point.
Start with managed services, not open source. Snowflake, BigQuery, and Databricks (as a managed platform) have teams that handle operational complexity. Open-source Iceberg or Delta requires you to manage that. Unless you love operational work, let someone else do it.
Where AI Agents Help
An AI agent can significantly reduce the operational burden:
- Pipeline monitoring: Detect data quality issues in real-time before they cause problems downstream.
- Schema evolution: Understand when your data structure changes and suggest pipeline updates.
- Cost optimization: Find underutilized tables, identify expensive queries, recommend storage tuning.
- Documentation: Automatically generate and maintain metadata on datasets, pipeline lineage, and usage patterns.
We’ve seen agents reduce data engineering operational overhead by 30–40%, which often means you can support 2–3x more data with the same team.
The Decision Framework
Ask yourself three questions:
Who’s going to query this data? Analysts who want instant answers → warehouse. Data scientists who can handle complexity → lake. Internal dashboards → warehouse. Ad hoc investigations → lake.
How much will pipelines cost to maintain? If you have ETL complexity that requires 2+ engineers full-time, consider a lakehouse. If it’s 1 engineer part-time, warehouse is cheaper.
What’s the cost of stale or bad data? If bad analytics cost you millions, invest in warehouse quality. If you can tolerate imperfection, lake is fine.
Most organizations should pick a data warehouse. Most of the ones that pick a data lake wish they hadn’t. And most of the ones considering a lakehouse should start with a warehouse and revisit in two years.
Pick the right tool for your constraints, not for the talk-track.