My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of all sizes as they implement OpenTelemetry in their organizations.
If you read my earlier article, 3 Things I Love About OpenTelemetry, you'll know that I'm a huge fan of OpenTelemetry. But like any technology, there's always room for improvement.
In this article, I'll share three areas where I think OpenTelemetry could be improved to make it even better.
While OpenTelemetry has come a long way in the past few years, making it even easier would allow more organizations to adopt it, and result in a faster time to value for everyone. I’ll share a few specific examples below.
One example where ease of use could be improved is in the instrumentation of languages that don’t support auto-instrumentation, such as Golang.
The good news is that efforts are already underway to build a solution that provides auto-instrumentation for Go using eBPF. While this is still a work in progress, you can learn more about it in GitHub and even try it out yourself (on a non-production app, please!)
Many practitioners that I work with find it challenging to troubleshoot OpenTelemetry-related issues. While some of these issues are with the OpenTelemetry collector, the majority are at the instrumentation level. For example, they may not see the spans they’re expecting. Or perhaps the application even crashes on startup when auto-instrumentation is added.
Regardless of the specific instrumentation issue, there’s frequently confusion about where to start troubleshooting. More often than not, the focus is on the OpenTelemetry Collector logs, due to a lack of understanding of where the instrumentation occurs vs. what role the collector plays.
I believe the OpenTelemetry community as a whole would benefit from further guidance on troubleshooting techniques. This could take the form of expanded documentation, videos demonstrating the troubleshooting process for real-world issues, or tutorials that let practitioners go through these processes themselves with a mock application.
Generative AI can also play a role in providing troubleshooting guidance. For example, as of July 2024, ChatGPT is already able to help with collector configuration tasks and provide direction on how to troubleshoot OpenTelemetry-related issues. For example, if we provide the following prompt:
I'm having trouble with the OpenTelemetry collector. Can you help me to troubleshoot this error?
warn kubelet/accumulator.go:102 failed to fetch container metrics {"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "pod": "mypod-5888f4d9fb-lbbww", "container": "mypod", "error": "failed to set extra labels from metadata: pod \"a3473219-4ab1-427c-b0f2-226a6e5271e5\" with container \"mypod\" has an empty containerID"}
ChatGPT was able to dissect the error message and provide the following interpretation:
“The key part of the error message is has an empty containerID. This suggests that the OpenTelemetry collector is attempting to fetch metrics for a container within a pod (mypod-5888f4d9fb-lbbww), but it cannot proceed because the container ID is empty.”
It also provided suggested troubleshooting steps such as confirming that the Kubernetes API Server is up and running, and ensuring the container has a valid Container ID associated with it.
While it’s not perfect, Generative AI is already helpful for troubleshooting OpenTelemetry issues today, and it will only continue to get better in the future.
One of my favorite aspects of traditional APM solutions that rely on proprietary APM agents is their ability to apply code-level instrumentation without requiring code changes. It would be great to see similar “no-code change required” capabilities added to OpenTelemetry. I’ve provided a few examples below.
It’s sometimes helpful to capture spans that go above and beyond what auto-instrumentation provides. With OpenTelemetry today, this typically requires making a code change. And while the code change itself is straightforward, with a few additional lines of code at most, it can take time to get this prioritized in a team’s sprint, tested, and pushed out to production.
OpenTelemetry does provide some support for this today with Java. Specifically, with Java it’s possible to create additional spans by adding a system property to the JVM at startup. See Creating spans around methods with otel.instrumentation.methods.include for further details.
It would be great to see similar capabilities added to languages beyond Java.
As an observability enthusiast, I believe it’s critical to capture span attributes to ensure engineers have the context they need during troubleshooting.
For example, let’s say we have a user profile service, and one of the endpoints is /get-profile, which retrieves the profile of a particular user. We may find that the response time of this service varies widely. Sometimes it responds in a few milliseconds, and other times it takes upwards of 1-2 seconds. Adding span attributes to provide context about the request, such as the user ID and the number of items in that user’s history, is critical to ensure the engineer has the information they need for troubleshooting. The /get-profile operation might run slowly for users that have a large number of items in their history, but it wouldn’t be possible to determine this without having those attributes included with the trace.
While some traditional APM solutions provide similar capabilities without code changes, capturing span attributes with OpenTelemetry currently requires code changes. As with creating spans, these code changes aren’t difficult. But it can be challenging to prioritize these types of changes amongst competing feature requests and get them into production in a timely manner.
There is some support in OpenTelemetry today for capturing HTTP headers as span attributes with the Java agent.
It would be great to see OpenTelemetry extend these capabilities, and add further support for capturing span attributes without requiring code changes.
Spans that are captured with OpenTelemetry’s auto-instrumentation will tell us how long calls between services are taking, and whether any errors are occurring. This can be supplemented with manually created spans to provide insight into long-running tasks that require a deeper level of visibility.
But when it comes to finding the exact method or line of code in your code that’s causing an application to run slowly, we need to move beyond spans and look at profiling instead. Many traditional APM tools provide some form of profiling, but this level of detail hasn't been available with OpenTelemetry.
The good news is that the process of adding profiling to OpenTelemetry is already underway. In fact, OpenTelemetry announced upcoming support for profiling in March 2024. Please see OpenTelemetry announces support for profiling for details.
Note: while profiling is in the process of being added to OpenTelemetry, Splunk distributions of OpenTelemetry already include AlwaysOn Profiling capabilities.
OpenTelemetry provides a wealth of information about applications and the infrastructure they run on. This includes apps running on traditional host-based environments as well as containerized apps running on Kubernetes. It also includes observability data from other components which applications depend on, such as databases, caches, and message queues.
This data goes a long way in determining why an application is performing slowly, or why the error rate for a particular service has suddenly spiked. But sometimes, issues go beyond the application code and server infrastructure that the applications run on. It would be helpful to see OpenTelemetry broaden its scope and provide visibility into additional domains.
We’ve all heard the joke about software engineers blaming the network whenever something goes wrong in their application. Well, the truth is that sometimes the problem *is* caused by the network. Yet engineers responsible for building and maintaining these applications rarely have direct insights into how the network is performing.
So for more holistic visibility into anything that could be impacting application performance, it would be wonderful to see OpenTelemetry add support for the network domain in the future. This could include ingesting metric and log data from existing network monitoring solutions, or pulling data directly from network devices themselves.
While it’s important for an application to be available and performant, none of that matters if the application isn’t secure.
Since OpenTelemetry already has a wealth of information about the applications it instruments, having OpenTelemetry expand into the security domain would open up a whole new set of use cases for observability data.
For example, OpenTelemetry could gather information about what specific packages and versions are used by an instrumented application. This could take the form of a new “security” signal, with a corresponding set of security-related semantic conventions that ensure this data is captured in a consistent manner across different languages and runtimes.
The security signals could then be analyzed by an observability backend, and engineers could be alerted when a security vulnerability is present in their application. And by correlating security-related signals with existing signals such as traces and the upcoming profiling signal, observability backends could also determine when a particular vulnerability is being exercised by analyzing which code paths are actively being exercised.
Thanks for taking the time to hear my thoughts on OpenTelemetry. Please leave a comment or reach out to let us know how you would make OpenTelemetry even better.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.