Comprehensive Observability Dashboard using error logs

tinsky · ‎06-07-2024

Role/Use Case: Incident Management

Background: As an Incident Manager, we are on the front lines when high-priority incidents occur. Our role demands rapid triaging to assess impact, urgency, and scale. The unpredictable nature of incidents, whether monitored or user-reported, often means we lack immediate comprehensive information. This gap challenges our ability to find an incident's scope and severity.

What is it?: An observability dashboard that aggregates data from 14 distinct indexes (Datasources). These Datasources, traditionally siloed by specific operational areas such as Network and Middleware, are unified in one dashboard. It actively scans for errors, including HTTP Response codes and instances of the word "error," across all Datasources. This integration enables a proactive "eyes on glass" approach and facilitates reactive incident triage through a versatile "Wildcard Search." For example, inputting "billing" triggers a comprehensive search across all indexes for related errors, revealing trends like a backend error spike that could explain front-end issues or tickets.

Results: This dashboard has proven its value, providing pertinent information in 65% of our most critical incidents. Its implementation has significantly decreased the mean time to restore service, underscoring its effectiveness in our incident management process.

Comprehensive Observability Dashboard using error logs

Join the Conversation

Comprehensive Observability Dashboard using error logs

Comprehensive Observability Dashboard using error logs