The modern data stack we run for you
by QuackyData, Data Engineering
"Modern data stack" has become a marketing phrase, but underneath it is a real and useful idea: instead of one monolithic platform, you assemble a handful of focused, open tools that each do one job well and hand off cleanly to the next. The hard part isn't picking tools — it's making them fit together, run reliably, and stay affordable.
This is the stack we deploy and operate for clients. Here's each layer, what it does, and why it's there.
Ingestion: Airbyte
Everything starts with getting data out of the tools your business already uses — Postgres, Stripe, Salesforce, your product database, a dozen SaaS APIs — and into one place.
Airbyte handles that. It's an open-source ingestion platform with hundreds of pre-built connectors, so we rarely write bespoke extraction code. When a new source appears, there's usually already a connector for it, and when there isn't, the framework makes one straightforward to add.
The reason we use Airbyte specifically: connectors are commodities, and you shouldn't pay per-row to move your own data. Running Airbyte in your environment means ingestion cost scales with your infrastructure, not with a vendor's pricing meter.
Storage: a DuckLake lakehouse on object storage
Ingested data has to land somewhere. We land it in a DuckLake lakehouse sitting on object storage — your S3, GCS, or equivalent bucket.
A lakehouse combines two things that used to be separate: the cheap, open storage of a data lake and the reliable, queryable tables of a warehouse. DuckLake gives us proper table semantics — schemas, snapshots, and consistent reads — directly on top of files in your bucket.
Two consequences matter here:
- The data lives in your cloud account, in open formats. You're never extracting your own data from a vendor's proprietary store, because it was always sitting in your storage.
- Storage is decoupled from compute. You pay object-storage prices to keep data, and you only spin up compute when you actually query. There's no always-on warehouse cluster quietly billing you overnight.
Modeling compute: DuckDB and dbt
Raw ingested data is rarely usable as-is. It needs to be cleaned, joined, deduplicated, and shaped into tables that answer business questions. That's modeling, and it's where two tools work together.
DuckDB is the engine. It's a fast, in-process analytical database that reads directly from the lakehouse and crunches through transformations without a heavyweight cluster. For the data volumes most startups and mid-market companies actually have, DuckDB is not a compromise — it's genuinely fast, and it runs the same on a laptop as it does in production.
dbt is how we organize the transformations. Instead of brittle, hand-run SQL scripts, dbt turns modeling into version-controlled, dependency-aware code: staging models that clean each source, then marts that combine them into the tables analysts and dashboards use. It also runs tests and tracks lineage, so a change in one model has a traceable effect downstream.
The combination gives you warehouse-grade modeling without a per-seat, always-on warehouse bill. The compute is DuckDB; the discipline is dbt.
Orchestration: Dagster
A stack is only useful if it runs on its own, in the right order, every day. Dagster is the orchestrator that ties the layers together.
Dagster knows that ingestion must finish before modeling starts, that staging models come before marts, and that data-quality checks come last. It schedules the whole graph, retries what fails, and surfaces a clear view of what ran, what's stale, and what broke.
This is the layer that turns a collection of tools into a pipeline you don't have to think about. When a sync fails at 3 a.m., Dagster catches it and we act on it — you don't discover it in a board meeting.
Data quality and observability across the top
Running on a schedule isn't enough; the data also has to be trustworthy. Sitting across the stack are data-quality and observability checks: tests on the models (uniqueness, non-null keys, accepted values), freshness checks that flag stale sources, and monitoring that tells us when something drifts before it reaches a dashboard.
This is what separates "we have a pipeline" from "we trust the numbers."
Why this combination
Step back and the through-line is clear. Every layer is open source, the data sits in your own cloud in open formats, and compute is decoupled from storage so you're not paying for an idle warehouse or per-seat licenses. You get the capability of an expensive managed platform with none of the lock-in.
The catch is that assembling, tuning, and operating these five tools well is real engineering work — which is the part we take off your plate. You own the stack and the data; we run it.