You Probably Don't Have a Big Data Problem

Let me tell you about a company I once consulted for. They had a “big data problem.” At least, that is what the VP of Engineering told me during our first call. They had hired a team of five data engineers, spun up a Hadoop cluster on AWS EMR, licensed Confluent Kafka, and were in the process of evaluating Spark Streaming versus Flink for their “real-time analytics pipeline.”

Their total data volume? 50 gigabytes. Not per day. Total. Fifty gigabytes, sitting in a PostgreSQL database that was running at about 4% CPU utilization.

I wish this were a rare story. It is not. It is practically an industry archetype.

The Big Data Delusion

The term “big data” has done more damage to engineering budgets than any single technology trend in the last two decades. Somewhere around 2012, the entire industry collectively decided that every company was sitting on a goldmine of data and needed distributed computing to unlock it. Vendors were happy to oblige. Consultancies were thrilled. Resume-driven development did the rest.

Here is the uncomfortable truth: most companies do not have a big data problem. They have a bad data problem. Or a no one is actually looking at the data problem. Or, my personal favorite, a we built the pipeline but never defined what question we are trying to answer problem.

What “Big Data” Actually Means

Let us put some numbers on it. Big data, in the original sense, referred to datasets that exceeded the capacity of a single machine to store or process in a reasonable time. We are talking terabytes to petabytes, millions of events per second, or data so varied in structure that traditional databases genuinely could not handle it.

In 2026, a decent laptop has 32 GB of RAM and a 1 TB SSD. A single PostgreSQL instance on a reasonably sized cloud VM can handle tens of terabytes with proper indexing. DuckDB can tear through analytical queries on datasets measured in hundreds of gigabytes without breaking a sweat, running entirely in-process with zero infrastructure.

If your “big data” fits on a USB drive, you do not have a big data problem. You have a skills problem, or a tooling problem, or an architecture-astronaut problem. But not a big data problem.

The Three V’s and the One V That Actually Matters

You have probably heard the canonical definition: Volume, Velocity, Variety. The “three V’s” of big data. Some people add Veracity and Value, turning it into a marketing pentagram.

In practice, the V that kills most companies is none of these. It is Vanity. The vanity of believing your 200 MB CSV export from Salesforce requires a distributed lakehouse architecture. The vanity of thinking that because Netflix uses Spark, your B2B SaaS with 3,000 customers should too.

Netflix processes trillions of events per day across hundreds of microservices serving 250 million subscribers. You are not Netflix. I am not Netflix. And that is perfectly fine, because not being Netflix means we get to use simpler, cheaper, faster tools.

The Modern Data Stack: A Love-Hate Relationship

To be fair, the modern data stack is genuinely impressive. The ecosystem has matured enormously, and the tooling available today would have seemed like science fiction a decade ago.

Here is what a typical “modern data stack” looks like:

Ingestion: Fivetran, Airbyte, or Stitch pull data from your SaaS tools, databases, and APIs into a central warehouse.
Warehousing: Snowflake, BigQuery, Databricks, or Redshift store your data in a columnar format optimized for analytics.
Transformation: dbt (data build tool) lets you write SQL-based transformations that are version-controlled, tested, and documented.
Orchestration: Airflow, Dagster, or Prefect schedule and monitor your data pipelines.
Visualization: Looker, Metabase, Tableau, or Power BI turn your transformed data into charts and dashboards.

This stack is powerful. It is also, for the median company, absurdly expensive and unnecessarily complex.

The Cost Problem Nobody Talks About

I have watched companies burn through six-figure annual Snowflake bills processing data that could live in a single PostgreSQL instance. The warehouse model — where you pay per compute second and per byte stored — is brilliant for Snowflake’s revenue and terrible for your CFO’s blood pressure.

A mid-size company running Fivetran ($1,000-5,000/month), Snowflake ($2,000-20,000/month), dbt Cloud ($100-500/month), and Looker ($3,000-5,000/month) is spending $75,000 to $360,000 a year on their data stack. That is before you count the salaries of the data engineers maintaining it.

For some companies, this spend is justified. For most, it is not even close. They are paying enterprise prices for what amounts to a glorified reporting layer on top of their CRM and product database.

When the Modern Stack Makes Sense

This is not an anti-modern-data-stack rant. The modern stack genuinely shines when:

You have dozens of data sources that need to be joined and reconciled
Your data volume is actually large (multiple terabytes, growing rapidly)
You have a dedicated data team that can maintain the infrastructure
Your business decisions are genuinely data-driven (not just data-decorated)

If those criteria describe your company, invest in the stack. If they do not, keep reading.

The SQL Renaissance

Here is the plot twist that nobody in the data engineering hype cycle wants to talk about: the most valuable data skill in 2026 is the same one it was in 1986. It is SQL.

SQL refuses to die. Every few years, someone declares it obsolete, and every few years, SQL proves them wrong. MapReduce was going to replace SQL. Then Spark was going to replace SQL. Then NoSQL was going to replace SQL. Now we have Spark SQL, and every NoSQL database has bolted on a SQL query layer, because it turns out that a declarative language for querying structured data is actually a really good idea.

The New SQL Movement

What has changed is where SQL runs and what it can do. The “New SQL” movement is less about replacing SQL and more about making it absurdly powerful in contexts where it previously struggled.

DuckDB is the poster child of this movement. It is an in-process analytical database — think SQLite, but for analytics instead of transactions. It can query Parquet files, CSV files, and even remote datasets over HTTP, all with standard SQL. No server. No cluster. No infrastructure.

Here is a real example. A company I worked with had a Spark job that ran nightly to aggregate user activity data. The job processed about 15 GB of Parquet files, ran on a cluster of four m5.xlarge instances, took 25 minutes, and cost roughly $3 per run.

Here is the DuckDB equivalent:

-- This replaces a 200-line PySpark job and a 4-node EMR cluster.
-- Runs locally in about 45 seconds on a laptop.

CREATE TABLE activity_summary AS
SELECT
    user_id,
    DATE_TRUNC('week', event_timestamp) AS week,
    COUNT(*) AS total_events,
    COUNT(DISTINCT session_id) AS unique_sessions,
    COUNT(*) FILTER (WHERE event_type = 'purchase') AS purchases,
    SUM(event_value) FILTER (WHERE event_type = 'purchase') AS revenue,
    AVG(session_duration_seconds) AS avg_session_duration,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95_response_time
FROM read_parquet('s3://data-lake/activity/year=2026/month=03/*.parquet',
                   hive_partitioning = true)
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY user_id, DATE_TRUNC('week', event_timestamp)
ORDER BY week DESC, revenue DESC;

Forty-five seconds. On a laptop. No cluster, no orchestration, no YARN resource negotiations, no dependency conflicts between PySpark and your system Python. Just a query.

ClickHouse occupies a different niche — it is a columnar database built for real-time analytics on massive datasets. If you actually have big data (billions of rows, sub-second query requirements), ClickHouse will often outperform Spark at a fraction of the cost. It is what companies like Cloudflare and Uber use for their analytics workloads.

Why SQL Wins

SQL wins for the same reason English is the lingua franca of international business: not because it is the best language, but because everyone already speaks it. Your data analysts know SQL. Your backend engineers know SQL. Your product managers can learn SQL in a weekend. Try saying that about PySpark.

The organizational leverage of using SQL as your primary data transformation language is enormous. When your transformation logic is written in SQL and managed through dbt, anyone in the company with basic SQL skills can read, understand, and even contribute to your data models. That is not true of a Python-heavy Spark pipeline or a custom Scala data processing framework.

Data Quality: The Silent Killer

I want you to do something right now. Go look at the most important dashboard in your company. The one your CEO checks every Monday morning, the one that drives quarterly planning, the one that everyone trusts.

Now ask yourself: when was the last time someone validated that the numbers on it are correct?

If you felt a chill just then, congratulations. You have just identified the actual problem with your data stack, and it has nothing to do with volume or velocity.

Your Dashboards Are Lying to You

Data quality issues are endemic, insidious, and almost always invisible until they cause a disaster. Here are the usual suspects:

Missing values that silently distort aggregations. Your revenue dashboard shows $2.3M for March. The actual number is $2.7M, but 15% of transactions have a NULL in the currency conversion field and are silently dropped by the SUM() aggregation. Nobody notices because $2.3M is a plausible number.

Schema drift. Your upstream application team renamed a field from user_email to email_address in their API. Your ingestion pipeline dutifully created a new column. Your transformation layer references the old column name. No errors are thrown — the old column just contains NULLs for all new records. Your “active users” metric quietly trends downward, and the data team spends two weeks investigating a “user engagement problem” that does not exist.

Timezone bugs. This one is a classic. Your application stores timestamps in UTC. Your BI tool renders in the user’s local timezone. Your dbt model converts to US Eastern. Your Salesforce data is in US Pacific. Your revenue by day report has transactions bleeding across date boundaries, and the numbers never quite match what Finance sees in their own systems.

Duplicate records. Your Salesforce-to-warehouse sync occasionally retries on timeout, creating duplicate records. Your pipeline does not deduplicate. Every downstream metric is inflated by 0.3%, which is just enough to be invisible but just enough to make your quarter-over-quarter comparisons meaningless.

The Data Swamp

You have heard of a “data lake.” A data swamp is what happens when you dump everything into a data lake without governance, cataloging, or quality controls. It is a data lake where nobody knows what data exists, nobody trusts any of it, and the only person who understood the schema left the company eight months ago.

Data swamps are depressingly common. The typical trajectory looks like this:

Company decides to be “data-driven”
Company buys a data lake / warehouse
Every team dumps their data in with no standardization
Data engineering team builds pipelines to “make sense of it”
Nobody maintains the pipelines when source systems change
Trust in the data erodes
Teams go back to exporting CSVs from their own tools
The data lake becomes a write-only archive that costs $4,000/month

What Actually Works: Contracts and Observability

The antidote to data quality chaos is boring, but effective: data contracts and data observability.

Data contracts are explicit agreements between data producers (the teams or systems that generate data) and data consumers (the teams or pipelines that use it). A data contract specifies: these are the fields, these are the types, these are the SLAs for freshness and completeness, and if any of this changes, we notify you before we deploy.

Data observability is monitoring for your data. Tools like Monte Carlo, Elementary, and Great Expectations let you set expectations on your data (this column should never be NULL, this metric should not change by more than 20% day-over-day) and alert you when those expectations are violated.

Neither of these is glamorous. Neither will get a standing ovation at a conference. But they are the difference between a data stack that people trust and a very expensive storage bill.

The Dashboard Graveyard

Open your company’s BI tool right now. Scroll through the list of dashboards. Count how many were created, viewed a handful of times, and never touched again.

If your company is anything like the dozens I have seen, the answer is “most of them.” Welcome to the dashboard graveyard, where good intentions go to die in a pile of stale bar charts and broken filters.

How Dashboards Die

The lifecycle of the typical corporate dashboard goes like this:

A stakeholder asks for “a dashboard that shows X”
A data analyst spends two weeks building it
The stakeholder looks at it once and says “this is great, but can you also add Y?”
The analyst adds Y, and Z, and a dozen filters that nobody will ever use
The stakeholder shows it in one meeting
Nobody ever opens it again

The dashboard is not the problem. The process is the problem.

Start With Questions, Not Data

The most effective analytics teams I have worked with do not start by building dashboards. They start by asking: “What decision are you trying to make, and what information do you need to make it?”

This sounds obvious. It is not. The default mode in most organizations is “we have data, let us visualize it and see what insights emerge.” This is the data equivalent of opening the fridge and hoping dinner assembles itself.

Good analytics is decision support, not data art. Every chart should answer a specific question that leads to a specific action. If you cannot articulate what someone would do differently based on a metric moving up or down, that metric does not belong on a dashboard. It belongs in an ad-hoc query that someone runs when they need it.

The Three-Dashboard Rule

Here is a rule of thumb that has served me well: any team should have at most three dashboards.

The operational dashboard: Real-time or near-real-time metrics that someone is actively monitoring. Think system health, current pipeline status, or today’s sales numbers. This is the dashboard that is on a TV screen in the office (or a Slack alert in the modern equivalent).
The strategic dashboard: Weekly or monthly metrics that inform planning. Revenue trends, user growth, retention cohorts. This is the one your leadership team reviews in their weekly meeting.
The investigative workspace: Not a dashboard at all — it is a sandbox where analysts can explore data ad-hoc. Jupyter notebooks, SQL IDEs, or a self-serve BI tool with a clean semantic layer.

Everything else is a report that should be generated on demand, not a dashboard that sits there consuming warehouse compute credits to refresh every hour for an audience of zero.

Right-Sizing Your Data Stack

Alright, enough ranting. Let us talk about what to actually do.

If you are a company with fewer than 500 employees, fewer than 10 million rows in your largest table, and a data team of one to three people, here is what I recommend:

The 80% Stack

PostgreSQL as your analytical database. Yes, really. Modern PostgreSQL with proper indexing, partitioning, and materialized views can handle analytical workloads that would shock you. It is free, battle-tested, and your backend engineers already know how to operate it.

dbt Core (the open-source version) for transformations. Write your business logic in SQL, version-control it in Git, test it with built-in assertions, and document it with auto-generated docs. dbt is the single best tool to emerge from the modern data stack era, and it works beautifully on top of PostgreSQL.

Metabase (or Apache Superset) for visualization. Metabase is open-source, connects directly to PostgreSQL, and is simple enough that non-technical stakeholders can build their own charts. It is not as powerful as Looker, but it is also not $60,000 a year.

That is the stack. PostgreSQL + dbt + Metabase. Total infrastructure cost: the price of a single cloud VM (or free if you are using a managed PostgreSQL service you already pay for). Total licensing cost: zero.

The Decision Tree

Here is how to think about when to graduate from the 80% stack:

“Our PostgreSQL queries are getting slow.” Before you reach for Snowflake, try: adding indexes, creating materialized views, partitioning large tables, or running VACUUM ANALYZE. If you have genuinely outgrown PostgreSQL for analytics (which usually means 500GB+ of actively queried data), consider DuckDB for batch analytics or ClickHouse for real-time queries before jumping to a cloud warehouse.

“We have too many data sources to manage manually.” This is where Fivetran or Airbyte earns its keep. If you are writing custom ingestion scripts for more than five data sources, a managed ingestion tool will save you more in engineering time than it costs in licensing.

“We need real-time analytics.” First, question whether you actually do. “Real-time” in most business contexts means “within the hour,” not “sub-second.” If hourly is fine, a cron job and a materialized view will do. If you genuinely need sub-second latency on streaming data, now you are in ClickHouse or Kafka territory, and that spend is justified.

“Our data team is growing and we need better governance.” When you have five or more people writing dbt models and building dashboards, invest in a data catalog (like Atlan or DataHub), a proper orchestrator (Dagster is excellent), and data observability tooling. The cost of misaligned definitions and broken pipelines grows nonlinearly with team size.

The Uncomfortable Question

Before you add any tool to your data stack, ask this question: “If we removed this tool tomorrow, would anyone outside the data team notice?”

If the answer is no, you do not need that tool. You need better alignment between your data team and the rest of the business. No amount of infrastructure can fix a lack of clarity about what questions matter.

The Bottom Line

The data industry has a complexity addiction. We love building sophisticated systems because sophisticated systems are intellectually stimulating and look great on a resume. But the goal of a data stack is not to be architecturally interesting. The goal is to help people make better decisions.

For most companies, that means: collect the data you actually need, clean it properly, store it somewhere reliable, transform it into something meaningful, and present it in a way that drives action. You can do all of that with tools that have been around for decades.

The most impressive data engineering work I have ever seen was not a planet-scale streaming pipeline. It was a single, meticulously maintained PostgreSQL database with clean schemas, well-documented dbt models, and three dashboards that the entire company actually used every single day.

That is not a big data solution. That is a right data solution. And for most of us, that is exactly what we need.