What’s this OTCA exam?
The Linux Foundation offers the OpenTelemetry Certified Associate (OTCA) credential to confirm expertise in OpenTelemetry and observability practices. The exam might be of interest to you if you know and love observability best practices and are one of the following:
- Software developer
- DevOps and SRE practitioner
- Cloud architect
- Technical professional with a passion for OpenTelemetry
With the OTCA certification, your toolbox will be overflowing with an understanding of:
- Core observability concepts and data types
- OpenTelemetry API and SDK implementations
- Deploying and managing OpenTelemetry Collector pipelines
- Debugging and maintaining observability setups
Sound like something you're into? Let’s breakdown the certification structure and then get into how to prepare.
OTCA Domains & Competencies
According to The Linux Foundation’s OTCA certification page, an OpenTelemetry Certified Associate understands the basic concepts of observability, and how the OpenTelemetry project enables these concepts. The exam focuses on the major components of OpenTelemetry, how they are used, and the best practices for instrumenting cloud-native applications for observability with OpenTelemetry.
The exam is broken down into four main sections:
- Fundamentals of Observability (18%):
- Telemetry data
- Semantic conventions
- Instrumentation
- Analysis and outcomes
- The OpenTelemetry API and SDK (46%):
- Data model
- Composability and extension
- Configuration
- Signals (metrics, logs, traces)
- SDK pipelines
- Context propagation
- Agents
- The OpenTelemetry Collector (26%)
- Configuration
- Deployment
- Scaling
- Pipelines
- Transforming data
- Maintaining and Debugging Observability Pipelines (10%)
- Context propagation
- Debugging pipelines
- Error handling
- Schema management
The details
To pass the test, you’ll need a minimum score of 75%. There are 60 multiple choice questions, and you’ll have 90 minutes to complete the test. The exam is online and proctored, and you’ll need a clean, quiet workspace to be able to take the test. You’ll chat with your proctor to make sure your testing environment meets all the stated requirements.
You’ll be allowed two testing attempts, and both attempts are included in the $250 test cost.
Still sounds good? Let’s jump into what and how to study.
Study guide
Based on my research, my teammate Moss’ study guide, and with a splash of AI, here are some key components and areas to focus on when studying for the exam. This is by no means comprehensive, and I strongly encourage you to use this as a jumping off point in your own studying journey.
Domain 1: Fundamentals of Observability (18%)
- Core Signals: Traces, Metrics, and Logs ("Events" are often part of logs/spans but are not a standalone core signal in the same way).
- The "Why": OpenTelemetry is vendor-agnostic, allowing users to switch backends without re-instrumenting.
- Semantic Conventions: A shared naming scheme (e.g., service.name, http.response.status_code, messaging.system).
- Schema URL: Included in telemetry to declare which version of semantic conventions is being followed.
- If schemas don’t match:
|
Situation
|
Recommended Tool
|
Action
|
|
OTel Version Mismatch
|
schemaprocessor
|
Set a target version; the processor handles the rest.
|
|
Custom/Legacy Mapping
|
transform processor
|
Write OTTL statements to rename or move attributes.
|
|
Simple Renaming
|
attributes processor
|
Use the upsert or insert actions for basic key changes.
|
|
Data Cleaning
|
filter processor
|
Drop attributes or spans that don't fit the destination schema.
|
- Observability Backends: Tools like Jaeger (traces), Zipkin (traces), Prometheus (metrics), Grafana Loki (logs) that receive and store OTel data.
Domain 2: The OpenTelemetry API and SDK (46%)
API vs. SDK
- API: Defines how telemetry is generated. Use only the opentelemetry-api package when writing libraries to avoid dependencies.
- SDK: The language-specific implementation of the API. It handles the processing and exporting. The SDK is separate from the API and can be swapped at runtime without changing instrumentation code.
- In existing legacy code bases, you can use the and the SDK to manage and export the data.
Traces and Spans
- Span: Represents a single operation. A SpanID identifies a unique span; a TraceID links spans across services. TraceIDs must be passed to services to create traces that can be viewed in observability tooling (see Context Propagation, below).
- Span Hierarchy: Spans form a parent-child hierarchy. A Child Span is created when a service receives a request from an upstream traced service.
- Span Kinds:
- Producer: Sending a message to a broker.
- Consumer: Processing a message from a broker.
- Span Links: Used to associate spans that are causally related but not in a direct parent-child relationship (e.g., fan-out/batch processing).
Metrics Instruments
- Counter: Monotonic (only goes up).
- UpDownCounter: Can increase or decrease (e.g., queue length).
- Gauge: Captures a non-additive value at a point in time (e.g., CPU usage).
- Histogram: Records a distribution of values into "buckets" (e.g., request latency).
- Meter: The API component responsible for creating these instruments. The MeterProvider is the factory for Meters. The Meter is the factory for Instruments (Counters, Histograms, etc.).
- Views: Used in the SDK to rename, aggregate, or filter attributes of metrics before they are exported.
Context Propagation & Baggage
- Context Propagation: Passing trace/context info across service boundaries (generic for all signals).
- Default propagators:
- W3C Trace Context: The default standard; uses the traceparent header.
- Baggage: A key-value store that travels with the context across services. Unlike span attributes, baggage is propagated over the wire.
- Propagators: Configured via OTEL_PROPAGATORS (e.g., tracecontext, baggage).
Resources & Instrumentation
- Resources: Attributes describing the entity (e.g., KubernetesResourceDetector for pod info adds resources like container.id or k8s.pod.uid). If two detectors attempt to add the same resource, the last detector wins.
- Zero-Code Instrumentation: Collecting data without modifying application code (e.g., Java Agent or OTel Operator).
- Code-Based: Manual instrumentation using the API.
Domain 3: The OpenTelemetry Collector (26%)
The Pipeline
- Components: A pipeline consists of Receivers, Processors, and Exporters.
- Connectors: A unique component that acts as an Exporter in one pipeline and a Receiver in another. Connectors are used to "bridge" different types of telemetry data or to route data based on specific logic. They are the primary way to turn one signal (like a Trace) into another (like a Metric) within the Collector.
- Extensions: Components that live outside the data pipeline (e.g., health checks, zpages).
Key Processors
- Batch Processor: Buffers data to reduce overhead. send_batch_size triggers a flush.
- Memory Limiter: Prevents OOM crashes by dropping data. Look for otelcol_processor_refused_spans metrics.
- Filter Processor: Used to drop specific data (e.g., only keep HTTP 500 errors).
- Transform Processor: Used to sanitize or normalize attribute values.
- K8s Attributes: Automatically adds pod/container metadata to all signals.
Exporters & Protocols
- OTLP (OpenTelemetry Protocol): the “native language” of OTel. It is a general-purpose telemetry data delivery protocol to standardize how traces, metrics, and logs are encoded and transported between your applications, the OTel Collector, and observability backends
- Encoding formats: Binary Protobuf, JSON
- Supports gRPC (Port 4317) and HTTP (Port 4318)
- Servers MUST support Gzip compression
- Exporters MUST use exponential back-off with jitter for retries
- Used for: standardization, vendor neutrality, efficiency (it uses Protocol Buffers for binary serialization), reliability (built-in features for backpressure signaling and retry logic)
- Prometheus Remote Write: Used to push metrics directly to a Prometheus-compatible backend.
Deployment & Scaling
- Agent: Collocated with the app (Sidecar/DaemonSet). Improves gRPC load balancing.
- Gateway: Centralized instances for a cluster/region.
- Kubernetes: Collector default Deployment
|
Mode
|
Kubernetes Resource
|
Best For...
|
|
deployment
|
apps/v1.Deployment
|
Centralized Gateways & stateless processing.
|
|
daemonset
|
apps/v1.DaemonSet
|
Collecting host metrics and logs from all nodes.
|
|
statefulset
|
apps/v1.StatefulSet
|
Tail-sampling and predictable scaling.
|
|
sidecar
|
Injected into Pod
|
Fargate, serverless, or per-app isolation.
|
- Kubernetes-specific:
- resourceDetection Processor: used to detect cluster-level or host-level information (k8s_node detector)
- Kubeletstats Receiver: pulls node, pod, and container-level metrics (CPU, Memory, network) directly from the Kubelet API on each node
- filelog Receiver: Collects logs from the standard /var/log/pods directory on the host
- Scaling:
- Routing key: you configure the exporter to hash based on a specific key
- For tail sampling: you set routing_key: traceID
- For span metrics, you might set routing_key: service
- Scale up when the exporter queue reaches 60-70% capacity.
- Do not scale if the backend is the bottleneck (indicated by otelcol_exporter_send_failed_spans).
- Target Allocator: a specific component used when you want to scale Prometheus-style scraping. Automatically distributes scrape targets across multiple Collector replicas to prevent duplicate data
- Loadbalancing Exporter: Uses DNS to distribute OTLP/gRPC traffic across backend collectors
- L7 Load Balancer: Required for distributing gRPC traffic properly
- The loadbalancing exporter is the “secret sauce” for scaling stateful collectors. It doesn’t use simple round-robin; instead, it uses consistent hashing.
|
Component
|
Role in Scaling
|
|
loadbalancing exporter
|
Uses consistent hashing to keep related spans together.
|
|
routing_key: traceID
|
The setting required to make Tail Sampling work at scale.
|
|
StatefulSet
|
The K8s workload used for the processing layer to provide stable identities.
|
|
Headless Service
|
Allows the load-balancing exporter to discover individual pod IPs via DNS.
|
|
Target Allocator
|
Distributes Prometheus scrape targets across multiple collectors.
|
|
Two-Layer Design
|
Layer 1 (Load Balancer) $\rightarrow$ Layer 2 (Stateful Processor).
|
Domain 4: Sampling and Maintenance (10%)
Sampling Types
- Head Sampling: Decision (to sample or to drop) made at the start.
- AlwaysOn: Keeps 100%.
- TraceIDRatioBased: Keeps a specific percentage (e.g., 10%).
- ParentBased: Respects the decision of the upstream parent.
- Tail Sampling: Decision made after the trace is complete. This is stateful and requires careful scaling.
Critical Environment Variables
|
Variable
|
Purpose
|
|
OTEL_SERVICE_NAME
|
Sets the logical name of the service.
|
|
OTEL_EXPORTER_OTLP_ENDPOINT
|
Points the SDK/Agent to the Collector or SaaS backend.
|
|
OTEL_TRACES_SAMPLER
|
Changes the active sampler without recompiling.
|
|
OTEL_EXPORTER_OTLP_HEADERS
|
Attaches custom HTTP headers to exports.
|
|
OTEL_JAVAAGENT_ENABLED=false
|
Disables the Java agent at runtime.
|
|
OTEL_NODE_RESOURCE_DETECTORS
|
Used to disable specific detectors in Node.js.
|
Debugging & Testing
- Self-Telemetry: Collector metrics are typically exposed (when enabled) on Port 8888.
- If the Collector is dropping data...
- Monitor internal metrics at Port 8888
- Enable debug logging
- Enable the zpages extension to get a live, web-based view of trace and metric buffers
- The Collector’s default behavior depends on where the failure occurred:
|
Component
|
Scenario
|
Default Action
|
|
Memory Limiter
|
RAM usage hits the "Hard Limit"
|
Drop immediately. It refuses to accept any more data from receivers until memory is freed.
|
|
Exporter Queue
|
Queue is full (queue_size reached)
|
Drop immediately. The setting block_on_overflow defaults to false, so new data is rejected rather than making the app wait.
|
|
Exporter
|
Transient Error (e.g., 503, 429)
|
Retry. It uses exponential backoff (starting at 5s, up to 300s total) before finally dropping the data.
|
|
Exporter
|
Permanent Error (e.g., 400, 401)
|
Drop immediately. The Collector assumes the data is invalid and will never be accepted.
|
- Unit Testing: Use the InMemorySpanExporter (in Python/other SDKs) to assert that spans are created correctly without a backend.
Read the Docs
At the end of the day, the only real way to study for the OTCA exam is to read the docs (that or use OpenTelemetry in a production environment every day and gain experience with every type of OTel scenario).
Hopefully, the above guide helps distill some of the details provided in the docs, but, again, it’s no substitute for digging into the docs yourself.
Need some docs links to get started? Here are some pages to focus on:
Best of luck on your exam!
Additional Resources
Never miss a new post. Check out this short guide on how to subscribe to the blog and get updates.