If you’ve read our previous post on self-service observability, you already know what it is and why it matters. Self-service observability empowers teams to own their observability practice, eliminating cross-team dependencies and bottlenecks, and instead optimizes delivery and speeds up incident resolution. It also empowers your engineers to customize observability for their application’s unique needs. In this post, we’ll walk through how to implement self-service observability using Splunk Observability Cloud. Start with Team Structure and Access Control To prevent teams from fumbling through an incident while sifting through an explosion of dashboards and charts, team structure and organization within an observability platform are critical. A successful self-service observability practice is organized and starts with team structure. In Splunk Observability Cloud, Teams are a way to organize users into functional groups and efficiently connect them to the dashboards, charts, detectors, and alerts that matter to them. To create a Team: Select Settings in the left navigation menu Go to Team Management Select Create team, enter a name and description, and then Add members Save the Team by selecting Create team Team members will have access to a dedicated landing page that contains relevant information tailored to their needs. A landing page brings together team-specific: Dashboard groups Detectors triggered by team-linked alerts Favorited and mirrored dashboards (centralized dashboards that are editable across teams but customizable per team without affecting other mirrors) This setup helps everyone stay aligned and focused, whether during incidents or proactive monitoring. For each team, team managers can be assigned to configure the proper roles and user management, including Admin, Power, Usage, and Read-only. This way, people only have access to the things they need. Access controls can be set using Splunk Observability Cloud Enterprise Edition’s built-in RBAC by: Navigating to Settings and selecting Access Tokens Creating team-specific access tokens with the appropriate permissions Setting token usage limits to manage the maximum number of usage metrics This granular token provisioning protects your environment, enables usage tracking per team, and helps to proactively alert on usage thresholds so your teams can build and use the resources they need without unknowingly impacting cost. Standardized Templates & Observability as Code Templating and automating the creation of observability resources increases the likelihood of adoption while also facilitating standardization through easily shareable resources. There are different ways to build templates, such as using the Splunk Observability Cloud API or the Splunk Terraform Provider. As long as your resources are in source control, there will be checks and balances, rollback capabilities, and higher developer velocity (because teams can focus on writing feature code rather than creating observability resources). Here’s an example of how to use the Splunk Terraform provider to create a dashboard and chart in Splunk Observability Cloud: Teams can also be specified from within the Terraform definitions so that dashboards show up within the specified Team’s landing page. If you’d like to learn more about Observability as Code, check out our blog posts: Observability as Code and Let’s Talk Terraform. Implement OpenTelemetry To enable self-service observability, you need to standardize the collection, processing, and export of telemetry data. OpenTelemetry provides this standardization through semantic conventions, including: Tagging conventions: service.name, deployment.environment, team Metric and span naming standards: http.server.duration Attribute requirements: cloud.provider, region, k8s.cluster.name When telemetry data is standardized, teams using observability resources can confidently find and understand their data, dashboards, and alerts. These resources are then reusable and familiar to users across teams, making it possible for incident responders to quickly jump in and provide assistance in emergencies. Using semantic conventions also helps your observability tooling aggregate across different cloud providers, frameworks, and programming languages to give a unified view of your business. Measure, Iterate, Expand Achieving Self-service observability is a journey. It is an incremental process that requires measuring adoption, identifying gaps, and iteration. Tracking adoption metrics, usage metrics, and cost can help you fine-tune your practice and deliver successful and complete visibility into your systems. We recommend you track some key metrics, below, to enable you to verify that teams are using observability, to understand and control costs, and to help justify your investment in observability. Adoption metrics to track: Percentage of teams with at least one active detector Number of dashboards or detectors created per team Time from alert trigger to acknowledgement Support ticket volumes (this should decrease as your observability practice grows) Usage metrics to track: Metric cardinality Time series volume Whether or not specific metrics are being used Cost optimizations: Filter unused or overly granular metrics Set rules for aggregation or dropping metrics Route data to reduce ingestion and storage costs Avoid Common Pitfalls Finally, when implementing a self-service observability practice, it’s essential to avoid common problem areas that can lead to chaos, increased cost, and operational silos. Below are some key pitfalls to watch out for, along with strategies to avoid them. Don’t skip documentation Every observability resource should have a clear description, contact information for the owning team, and links to runbooks where applicable. It might sound extreme, but a runbook should accompany every detector. Here’s a Terraform example of a detector with proper documentation: Don’t ignore costs With each team managing its observability, there is the potential for a telemetry data explosion or a rapid increase in the amount of telemetry data being exported to a backend observability platform. This explosion of data not only makes it difficult to process and analyze, but it also increases costs. Within Splunk Observability Cloud, you can always get detailed insight to monitor and manage subscription usage and billing to ensure you stay within your defined limits: In the Enterprise Edition of Splunk Observability Cloud, Metrics Pipeline Management (MPM) centralizes a solution for managing metrics and helps you easily identify metric usage in-product. You can then adjust for metric explosion in-product or adjust the metric in code, and even adjust storage settings to lower costs and improve monitoring performance. Don’t create silos Giving every team control over their observability practice can quickly dissolve into chaos if standards aren’t in alignment. Avoid silos and promote a unified strategy by: Using shared naming conventions Providing shared templates Promoting best practices through enablement (office hours, onboarding sessions, communication channels, internal demos, lunch-and-learns) Always encourage cross-team collaboration and create opportunities for shared learning to ensure your self-service observability practice grows and thrives. Wrap Up With an organized team structure, consistent telemetry powered by OpenTelemetry, and observability as code, your teams will not only be able to move faster, but they will also be empowered with the insights they need to respond to issues with greater efficiency and confidence. Ready to take the first step in your self-service observability journey? Start building your self-service observability practice today with Splunk Observability Cloud’s 14-day free trial. Resources Self-Service Observability: How to Scale Observability Adoption Through Self-Service Building a Self-Service Observability Practice Let’s Talk Terraform Introduction to the Splunk Terraform Provider Observability as Code Enable self-service observability
... View more