Observability with OpenTelemetry: a practical guide for getting out of vendor lock-in

Se você tem mais de 50 serviços em produção, provavelmente já sentiu a dor:

Datadog sobe de custo a cada contratação de developer
New Relic muda de pricing e sua conta dobra num trimestre
Time quer migrar para Grafana Cloud mas “tudo está instrumentado com o SDK do Datadog”
Cada vendor manda usar o agent deles, com sintaxe própria, com escopo diferente

A conversa sobre observabilidade tem duas partes: coletar (onde os dados vêm) e consumir (onde você olha). O problema é que vendors te vendem os dois juntos, acoplados, e depois você não consegue separar.

OpenTelemetry resolve a primeira parte. Com rigor.

1. O que OpenTelemetry é (e não é)

É: um padrão aberto, mantido pela CNCF, para instrumentar aplicações (traces, metrics, logs) de forma vendor-neutral.

Ou seja: você instrumenta uma vez usando SDK e APIs do OTel. Os dados saem em um formato padrão (OTLP). Você pluga em qualquer vendor (Datadog, New Relic, Honeycomb, Grafana, Jaeger, AWS X-Ray, Azure Monitor, GCP Cloud Trace) sem tocar o código.

Não é: um backend de observabilidade. OTel não armazena dados, não renderiza dashboards, não alerta. Para isso você ainda precisa de uma ferramenta.

Pense assim: OTel está para observabilidade como USB está para periféricos. Conecta tudo em tudo.

2. Por que isso importa agora

Três coisas mudaram nos últimos 18 meses:

Auto-instrumentation madura. Em Java, Python, Node, Go e .NET você roda um agent e 80% das libs comuns (HTTP clients, DBs, message queues, frameworks web) ficam instrumentadas sem mudança de código.
OTLP virou default. Datadog, New Relic, Grafana, Elastic aceitam OTLP direto. Não precisa mais converter.
Custo de observabilidade explodiu. APMs tradicionais cobram por host + por custom metric + por retention. Em contas acima de R$ 2M/ano, lock-in vira risco real de negócio.

Se você está começando observabilidade hoje ou está na hora de renovar contrato, começar por OTel é a decisão default.

3. A arquitetura mínima

┌─────────────┐     OTLP     ┌──────────────┐
│ App (SDK)   │─────────────▶│   Collector  │
└─────────────┘              └──────┬───────┘
                                    │
                      ┌─────────────┼─────────────┐
                      │             │             │
                      ▼             ▼             ▼
                ┌─────────┐   ┌──────────┐  ┌──────────┐
                │ Datadog │   │ Grafana  │  │ S3 / GCS │
                │ (prod)  │   │(staging) │  │ (raw)    │
                └─────────┘   └──────────┘  └──────────┘

Peças:

SDK — instrumenta a aplicação. Pode ser auto (agent) ou manual (API calls no código).
Collector — recebe OTLP, aplica filtros/sampling/redact e encaminha para um ou mais destinos.
Backend(s) — qualquer vendor compatível.

A mágica está no Collector: você pode mandar os mesmos dados para 2 destinos diferentes (produção + analytics de longo prazo em S3, por exemplo) sem custo de instrumentar duas vezes.

4. Adoção em 4 fases

Guia realista baseado em 6 clientes que adotamos OTel nos últimos 12 meses.

Fase 1 — Piloto em 1 serviço (2 semanas)

Escolha:

1 serviço de backend importante (não o mais crítico)
1 linguagem que sua equipe domina
1 vendor já contratado (Datadog ou Grafana Cloud)

Instrumentar via auto-instrumentation agent. Zero mudança de código na primeira semana.

Objetivo: validar que os traces chegam ao vendor, que latências batem com o que você já vê, e que a carga no serviço é tolerável (tipicamente 1-3% de overhead).

Fase 2 — Roll-out por linguagem (1-2 meses)

Padronize via:

Template de Dockerfile com o agent
Variáveis de ambiente padrão (OTEL_SERVICE_NAME, OTEL_RESOURCE_ATTRIBUTES, OTEL_EXPORTER_OTLP_ENDPOINT)
Sampling policy declarada em config, não em código

Faça uma linguagem por vez. Não tente adotar em 4 stacks simultâneas — você vai quebrar em aprendizados perdidos.

Fase 3 — Deploy do Collector como componente central (1 mês)

Até agora os apps mandam OTLP direto para o vendor. Agora suba um Collector (deployment Kubernetes com 3 replicas, ou ECS service).

Ganhos imediatos:

Redact de PII no pipeline (CPF, email, tokens) antes de sair da sua rede
Sampling inteligente (100% de erros, 10% de sucessos) — reduz custo em 50-80%
Failover: se o vendor cair, collector buffera localmente
Fan-out: mesmo trace vai para vendor A (prod hot) e vendor B (analytics cold)

Fase 4 — Logs e metrics correlacionados (2-3 meses)

Agora expanda OTel para logs (via logback/slf4j/python logging appenders) e metrics (MeterProvider).

O grande ganho aqui é correlação: trace_id em todos os logs, exemplars em histogramas de métricas apontando para traces concretos. Quando um alerta dispara, você salta de gráfico → trace → log em um clique.

5. Armadilhas comuns

Armadilha 1 — Instrumentar manualmente demais

Começo o post dizendo “auto-instrumentation cobre 80%”. Esse 80% é suficiente para os primeiros 6 meses. Não deixe o time perder semanas criando spans manuais para tudo.

Regra: só instrumente manualmente o que tem valor de negócio (operações de domínio, decisões de autorização, efeitos colaterais caros). Resto, auto cobre.

Armadilha 2 — Sampling errado

Sampling é a ferramenta mais poderosa e a mais fácil de usar mal.

Head-based sampling (decidir na entrada): simples, mas pode perder erros raros
Tail-based sampling (decidir no Collector depois que o trace terminou): pega 100% dos erros + % dos sucessos, é o que você provavelmente quer

Regra: error + slow first, depois amostra o resto.

Armadilha 3 — Vendor “suporta OTLP” mas com asterisco

Nem todo vendor suporta OTLP igualmente. Leia a letra miúda:

Alguns não aceitam logs via OTLP ainda (só metrics + traces)
Alguns exigem tags específicas para certos dashboards
Alguns limitam custom attributes em spans

Antes de apostar em vendor X, rode um POC de 1 semana com seu dado real.

Armadilha 4 — Cardinality explosion

OpenTelemetry não protege você de criar métricas com labels de alta cardinalidade (user_id, trace_id, request_id em métricas). Isso mata backends.

Regra: em métricas, labels devem ser enumeráveis em menos de 10.000 valores distintos. Para o resto, use atributos de span (em traces) ou logs estruturados.

6. Comparação prática: OTel-first vs. vendor-SDK

Aspecto	Vendor SDK (Datadog, etc.)	OTel + Vendor
Setup inicial	1 hora	3-5 horas
Cobertura auto	Excelente	Boa (melhorando rápido)
Vendor lock-in	Alto	Mínimo
Custo de troca de vendor	Alto (refactor)	Baixo (muda config)
Custo de multi-destino	Alto	Baixo
Compliance / redact local	Depende do vendor	Nativo no Collector
Comunidade / ecossistema	Fechado	Aberto, CNCF

OTel é mais lento no setup inicial, mas paga o investimento em 3-6 meses quando o primeiro contrato de vendor precisa ser renegociado.

7. Stack recomendado por porte

Startup / mid-market (< 20 serviços): Auto-instrumentation + Grafana Cloud Free/Pro. Custo baixo, onboarding rápido. Adicionar Collector quando ultrapassar 50 GB/mês de ingestão.
Enterprise (20-200 serviços): Collector deploy + vendor principal (Datadog/New Relic/Grafana Cloud Enterprise) + cold storage em S3/GCS para retention longa barata.
Enterprise regulado (bancos, saúde): Collector com redact nativo + vendor homologado + exportação duplicada para SIEM (Splunk/Elastic).

8. Pergunta que você deveria estar se fazendo agora

“Se meu contrato de Datadog/New Relic dobrar de preço na renovação, quanto tempo e esforço levo para migrar?”

Se a resposta for “6-12 meses de trabalho”, você tem lock-in. Começar a adotar OTel nos próximos 90 dias cortaria esse número para ~1 mês.

Observabilidade é infraestrutura de decisão. Lock-in em infraestrutura de decisão é risco estratégico.

Na Redgator: rodamos 7 projetos de observabilidade baseados em OpenTelemetry nos últimos 18 meses, do e-commerce ao setor bancário. Se quiser um diagnóstico da sua stack atual e um plano de transição, fale com um especialista.

Publicado em 28/02/2026. Última revisão: 28/02/2026.

If you have more than 50 services in production, you have probably felt the pain:

Datadog cost goes up with every new developer hire
New Relic changes pricing and your bill doubles in a quarter
The team wants to migrate to Grafana Cloud but “everything is instrumented with the Datadog SDK”
Each vendor pushes their agent, with their own syntax, with different scope

The observability conversation has two parts: collecting (where data comes from) and consuming (where you look). The problem is that vendors sell both bundled, coupled, and then you cannot separate them.

OpenTelemetry solves the first part. Rigorously.

1. What OpenTelemetry is (and is not)

Is: an open standard, maintained by CNCF, to instrument applications (traces, metrics, logs) in a vendor-neutral way.

That is: you instrument once using OTel SDK and APIs. Data comes out in a standard format (OTLP). You can plug into any vendor (Datadog, New Relic, Honeycomb, Grafana, Jaeger, AWS X-Ray, Azure Monitor, GCP Cloud Trace) without touching code.

Is not: an observability backend. OTel does not store data, does not render dashboards, does not alert. For that you still need a tool.

Think of it like this: OTel is to observability what USB is to peripherals. Connects anything to anything.

2. Why this matters now

Three things changed in the last 18 months:

Auto-instrumentation matured. In Java, Python, Node, Go and .NET you run an agent and 80% of common libs (HTTP clients, DBs, message queues, web frameworks) are instrumented with no code change.
OTLP became default. Datadog, New Relic, Grafana, Elastic accept OTLP directly. No more conversion needed.
Observability cost exploded. Traditional APMs charge per host + per custom metric + per retention. On accounts above R$ 2M/year, lock-in becomes real business risk.

If you are starting observability today or it is time to renew the contract, starting with OTel is the default decision.

3. The minimum architecture

┌─────────────┐     OTLP     ┌──────────────┐
│ App (SDK)   │─────────────▶│   Collector  │
└─────────────┘              └──────┬───────┘
                                    │
                      ┌─────────────┼─────────────┐
                      │             │             │
                      ▼             ▼             ▼
                ┌─────────┐   ┌──────────┐  ┌──────────┐
                │ Datadog │   │ Grafana  │  │ S3 / GCS │
                │ (prod)  │   │(staging) │  │ (raw)    │
                └─────────┘   └──────────┘  └──────────┘

Pieces:

SDK — instruments the application. Can be auto (agent) or manual (API calls in code).
Collector — receives OTLP, applies filters/sampling/redact and forwards to one or more destinations.
Backend(s) — any compatible vendor.

The magic is in the Collector: you can send the same data to 2 different destinations (production + long-term analytics in S3, for example) without the cost of instrumenting twice.

4. Adoption in 4 phases

A realistic guide based on 6 clients we adopted OTel with in the last 12 months.

Phase 1 — Pilot on 1 service (2 weeks)

Pick:

1 important backend service (not the most critical one)
1 language your team masters
1 vendor already contracted (Datadog or Grafana Cloud)

Instrument via auto-instrumentation agent. Zero code change in the first week.

Goal: validate that traces arrive at the vendor, that latencies match what you already see, and that the load on the service is tolerable (typically 1-3% overhead).

Phase 2 — Roll-out by language (1-2 months)

Standardize via:

Dockerfile template with the agent
Standard environment variables (OTEL_SERVICE_NAME, OTEL_RESOURCE_ATTRIBUTES, OTEL_EXPORTER_OTLP_ENDPOINT)
Sampling policy declared in config, not in code

Do one language at a time. Do not try to adopt in 4 stacks at once — you will break in lost learnings.

Phase 3 — Collector deploy as central component (1 month)

So far apps send OTLP straight to the vendor. Now bring up a Collector (Kubernetes deployment with 3 replicas, or ECS service).

Immediate gains:

PII redact in the pipeline (CPF, email, tokens) before leaving your network
Smart sampling (100% of errors, 10% of successes) — cuts cost by 50-80%
Failover: if the vendor goes down, collector buffers locally
Fan-out: same trace goes to vendor A (hot prod) and vendor B (cold analytics)

Phase 4 — Correlated logs and metrics (2-3 months)

Now expand OTel to logs (via logback/slf4j/python logging appenders) and metrics (MeterProvider).

The big gain here is correlation: trace_id in every log, exemplars in metric histograms pointing to concrete traces. When an alert fires, you jump from chart → trace → log in one click.

5. Common pitfalls

Pitfall 1 — Instrumenting manually too much

I started the post saying “auto-instrumentation covers 80%”. That 80% is enough for the first 6 months. Do not let the team waste weeks creating manual spans for everything.

Rule: only instrument manually what has business value (domain operations, authorization decisions, expensive side effects). The rest, auto covers.

Pitfall 2 — Wrong sampling

Sampling is the most powerful tool and the easiest to misuse.

Head-based sampling (decide at entry): simple, but can miss rare errors
Tail-based sampling (decide at the Collector after the trace ended): catches 100% of errors + % of successes, this is probably what you want

Rule: error + slow first, then sample the rest.

Pitfall 3 — Vendor “supports OTLP” with an asterisk

Not every vendor supports OTLP equally. Read the fine print:

Some do not accept logs via OTLP yet (only metrics + traces)
Some require specific tags for certain dashboards
Some limit custom attributes in spans

Before betting on vendor X, run a 1-week POC with your real data.

Pitfall 4 — Cardinality explosion

OpenTelemetry does not protect you from creating metrics with high-cardinality labels (user_id, trace_id, request_id in metrics). That kills backends.

Rule: in metrics, labels must be enumerable in fewer than 10,000 distinct values. For the rest, use span attributes (in traces) or structured logs.

6. Practical comparison: OTel-first vs. vendor-SDK

Aspect	Vendor SDK (Datadog, etc.)	OTel + Vendor
Initial setup	1 hour	3-5 hours
Auto coverage	Excellent	Good (improving fast)
Vendor lock-in	High	Minimal
Cost to switch vendor	High (refactor)	Low (change config)
Cost of multi-destination	High	Low
Compliance / local redact	Depends on vendor	Native in Collector
Community / ecosystem	Closed	Open, CNCF

OTel is slower to set up initially, but pays the investment back in 3-6 months when the first vendor contract needs to be renegotiated.

7. Recommended stack by size

Startup / mid-market (< 20 services): Auto-instrumentation + Grafana Cloud Free/Pro. Low cost, fast onboarding. Add Collector when crossing 50 GB/month of ingest.
Enterprise (20-200 services): Collector deploy + main vendor (Datadog/New Relic/Grafana Cloud Enterprise) + cold storage on S3/GCS for cheap long retention.
Regulated enterprise (banking, healthcare): Collector with native redact + approved vendor + duplicated export to SIEM (Splunk/Elastic).

8. Question you should be asking yourself now

“If my Datadog/New Relic contract doubles at renewal, how much time and effort would it take me to migrate?”

If the answer is “6-12 months of work”, you have lock-in. Starting to adopt OTel in the next 90 days would cut that number to ~1 month.

Observability is decision infrastructure. Lock-in in decision infrastructure is strategic risk.

At Redgator: we have run 7 OpenTelemetry-based observability projects in the last 18 months, from e-commerce to banking. If you want a diagnosis of your current stack and a transition plan, talk to a specialist.

Published on 2026-02-28. Last review: 2026-02-28.