Observability Grew Up. Your Practices Didn't.
Stop calling it monitoring. Schema, lineage, governance, cost: the four words that change how you run telemetry.
Twenty years ago, observability fit in one command.
ping -c1 server. If it came back, you were up. If it didn’t, you grabbed coffee and started reading logs by hand. That was the whole job. One box, one human, one pager that went off at 3am.
That world is gone. Today you don’t have a server. You have petabytes of what people politely call “data exhaust,” and a fleet of AI models drinking straight from it. The pager still goes off at 3am. But what’s on the other end is a whole data platform now, one you probably never agreed to run.
Your observability pipeline is your company’s most honest data warehouse. Nobody curates it to look good in a board deck. It just tells you what’s actually happening. And most teams are still treating it like a pile of logs.
Give me a few minutes. I’ll show you how observability quietly became a data problem, why that’s good news, and the four words that change how you work. You don’t need to be a data engineer to follow along. That’s sort of the whole point.
How we got here, fast
I’ll skip the full history. Let’s use five frames to set the scene, because the shape of the story is what matters.
Firefighting. One server, one human, grep and hope. Monitoring meant “is it up?”
Productization. Someone realized this pain was worth money. APM vendors showed up and the checkbooks came out. Thoma Bravo took Dynatrace private for $2.4B in 2014. Cisco grabbed AppDynamics for $3.7B in 2017, days before its IPO. Observability became an industry.
Cloud native. Kubernetes hit 1.0 in July 2015. Prometheus followed in July 2016. Suddenly you didn’t have one server, you had thousands of ephemeral ones, and the old “is it up?” question stopped making sense.
Scale. The numbers got absurd. Uber’s M3 ended up holding 6.6 billion time series. eBay was moving 1.2 petabytes of logs a day, and a few years later that had climbed to 15 petabytes and 10 billion active time series a day. Read those again. That’s not monitoring. That’s a data problem wearing a monitoring costume.
Standardization. OpenTelemetry showed up to stop everyone from reinventing the wheel: one set of conventions, one wire format (OTLP), vendor-neutral. And the industry voted with its keyboards. OTel is now the second most active project in the entire CNCF, behind only Kubernetes, with more than 1,200 developers committing code every month. It graduated in May 2026 as the de facto observability standard. When 12,000 contributors from 2,800 companies converge on one project, what they’re really standardizing is a data model.
All those frames piled up, each one making observability more complex and more critical.
Big data didn’t die. It won.
Remember when “big data” was the future? Ten years ago it was the title everyone wanted. Big data engineer. There were conferences, certifications, a whole priesthood built around moving data at volume. Then the hype moved on, the way hype does, and we all started talking about something else.
But big data is still everywhere. It won so completely that it stopped being worth a name. And few systems on earth generate more of it than observability. Remember those numbers from earlier? Petabytes of logs a day, billions of active series. That’s telemetry. Your telemetry.
If you operate a telemetry pipeline or a telemetry backend, you are doing big data engineering in its purest form. Ingestion, storage, query, retention, all under a latency budget, all while the thing is on fire at 3am. You just never got the fancy title for it.
So let’s claim it. At every one of those leaps, observability didn’t invent anything. It borrowed.
Batch ingestion. Rollups and downsampling. Kafka streams in front of storage. Columnar compression in the TSDB. Sampling. Dimensional modeling, where your labels are a star schema you just never called one. Data lakes on object storage. And now, with OpenTelemetry, schemas and contracts.
Every one of those is data engineering. We’ve been running a data platform for a decade and calling it “monitoring” so we didn’t have to take it that seriously.
So let’s stop pretending it’s something else.
If you’re already doing data engineering, do it on purpose. Treat telemetry like what it actually is, an engineered data pipeline, and four words stop being someone else’s problem and become yours: schema, lineage, governance, cost.
Those four words are the whole thesis of this newsletter.
The four words, one by one
Each one asks something specific of you, with a question to check yourself against.
Schema. Treat your semantic conventions as a versioned data contract. When a team renames an attribute, that’s a breaking change to every dashboard, alert, and query downstream, exactly like dropping a column in a database. So ask yourself: if someone renamed http.status_code tomorrow, would you find out from a contract test, or from a 3am page?
Lineage. Know where your telemetry comes from and what transforms it on the way in. A metric that passes through three Collector processors, a sampler, and a relabeling rule has a lineage, and right now most teams can’t draw it. You’d never tolerate that in a pipeline feeding a financial report. Pick any number on your most important dashboard. Can you name every hop between the running code and the pixel?
Governance. Retention, cardinality, and PII are policy decisions you make up front, before they turn into an end-of-month firefight. The label that blows up your cardinality, the number of unique series you end up storing, is the same shape as the PII field nobody flagged. Both are governance gaps. Decide on purpose what you keep, for how long, and what you never collect in the first place. When a service wants to add a high-cardinality label, who owns that call, and do they make it before it ships or after the bill arrives?
Cost. Make cost-per-insight a design metric you watch on purpose, before it becomes a surprise on the invoice. How big does that surprise get? Datadog once disclosed a single customer that ran up around $65 million in one year. One company’s telemetry bill, roughly the price of a mid-size acquisition. That’s the extreme, but the shape is universal. Every signal you collect is a bet that it’ll be worth more than it costs to store and query, and most teams never check whether the bet paid off. They keep everything and flinch at the total. One number tells the story: what fraction of the metrics you pay to store has been queried even once in the last 90 days?
None of this is exotic. Data teams have done it for years. We’re just late.
Why schema comes first
Of those four, schema is the one I’d start with, and it’s where I spend most of my time.
The uncomfortable part is that most teams have no schema at all. Telemetry just flows. Metrics, logs, and spans pour out of SDKs and exporters, and nobody ever wrote down what any of it means. We accepted that for a decade. We’d never accept it from a database table.
OpenTelemetry’s semantic conventions are where the industry is converging on this: a shared vocabulary for what an attribute is called and what it means, so http.request.method means the same thing in your service and in mine. That’s a data contract. And they aren’t the first attempt. Elastic Common Schema got there years earlier and did it well, but it lived inside one vendor’s world. The tell came in 2023, when Elastic donated ECS to OpenTelemetry so the two could merge into a single open schema. That’s the whole story in one move: the field is consolidating on one vendor-neutral contract, governed in the open.
And there’s finally tooling that treats it like one. Weaver, inside the OpenTelemetry project, takes a schema-first approach. You define your telemetry schema, generate type-safe code and docs straight from it, and get automatic diffs when it changes, the same upgrade-and-downgrade discipline you’d give a database migration. It’s growing into a telemetry catalog, a place where your signals and their meaning actually live.
The catch is the one every data team already knows: writing schemas by hand is miserable, so nobody does it. It’s the OpenAPI story all over again, a wild west of undocumented surfaces everyone agrees should be documented and no one wants to document. So with Arthur Sens I went after the other direction. Instead of authoring a schema, infer it from the telemetry you’re already emitting. Point it at a pipeline, read the runtime data, pull out the metric names, types, and label values, and you have a schema nobody had to write. We prototyped it as a standalone tool first, then archived that and built the capability straight into Weaver, where it belongs. Discover the schema instead of authoring it, then feed it back into your instrumentation.
This isn’t a fringe bet either. Prometheus is moving the same way: ingesting OTLP natively since 3.0, and working through how OpenTelemetry resource attributes and type-and-unit metadata should live in its data model. The whole ecosystem is quietly admitting the same thing: telemetry needs a schema, and that schema has to be engineered with intent.
That’s a lot for one newsletter to chew on. Good. We’ve got time.
Why me, and why now
Quick, because I’d rather earn your trust over time than spend a paragraph on a résumé.
I maintain Prometheus Operator and Perses, and I contribute across the OpenTelemetry, Thanos, and Prometheus ecosystems. So I see this problem from both chairs: the person operating telemetry at scale, and the person building the projects everyone else runs on. The gap between what the tools can do and what teams actually do with them is wider than anyone likes to admit. That gap is what I want to write about.
One thing I’ll be explicit about: this newsletter is vendor-agnostic, and it stays that way. The leaps that got us here came from open source and from teams operating at scale, not from picking tool X over tool Y. In a community where “competition” so often takes a backseat to collaboration, I’d rather help you reason about your own pipeline than sell you someone’s product. That’s the line, and I’m not crossing it.
What to expect
Deep, opinionated, no hello-world tutorials you could’ve gotten from the docs. Four pillars, the same four words from above: pipeline, cost, governance, and OSS running in real production.
If you’ve ever stared at an observability bill, or a dashboard nobody trusts, or a cardinality explosion at 3am and thought there has to be a more deliberate way to do this, there is. That’s what we’re going to build, one edition at a time.
My only ask: subscribe. The premise is simple, and I think it earns a spot in your inbox. Telemetry is a data discipline, and we’re going to treat it like one.
See you in the next one.
References
Cisco completes its acquisition of AppDynamics ($3.7B, 2017)
eBay at 15 PB of logs and 10B active series/day, KubeCon EU 2025
CNCF project velocity: OpenTelemetry is second only to Kubernetes
Elastic contributes Elastic Common Schema (ECS) to OpenTelemetry (2023)
Schema inference talk, PromCon EU 2025 (Nicolas Takashi & Arthur Sens)
How should Prometheus handle OpenTelemetry resource attributes?


