Culture extends beyond work experience and coffee roast preferences on software engineering teams. Team culture also includes opinions on ticket creation/completion, merge request reviews, on-call rotations, and incident response. When it comes to incident response, a Don’t Repeat Incident (DRI) team or company culture can be hugely beneficial in maintaining a reliable code base, a positive user experience, and overall team happiness.
What is DRI? When an incident occurs, on-call team members are notified of triggered alerts. These individuals are often pulled into an incident response role to troubleshoot and resolve the incident. Traditionally, once the incident is resolved, the work of the on-call team members is over. Contrast this with a DRI approach: when the incident is successfully mitigated, the work is not over. Post-incident, work needs to be tasked out and prioritized to address any code changes or safeguards that need to be put in place to prevent the same incident from occurring again in the future – every incident results in the creation of one or more actionable tickets or issues. Sidenote: post-incident reviews (or postmortems) are also a great idea and serve as a retrospective around the incident to discuss why it occurred, how it could have been prevented, who was impacted, and how it will be prevented in the future.
When this whole process works, it helps create a proactive culture and reduces incident frequency. However, manual steps always leave room for error – a person forgets to create a ticket, a person forgets to follow up, and soon alerts for errors that have been seen before are firing yet again. Taking manual intervention out of the equation helps guarantee that tickets are created, added to the right repositories or boards, and prioritized to prevent incidents from repeating. How can technology help us with this? Cue webhooks.
Webhooks are a great way for systems to communicate when specific events occur. They can be used for a range of interactions – building out CI/CD pipelines, real-time data sharing like when sending confirmation emails after online order completion, initiating two-factor authentication at login, etc.
Because of their lightweight efficiency, many platforms support webhooks – Gmail, Slack, GitHub, Jira – the list is long. In this post, we’ll look at how Splunk Observability Cloud can help with the DRI practice described above to reduce incidents, increase code resiliency, and remove the need for manual intervention with webhooks.
Your environment is unique, and Splunk Observability Cloud provides many integrations out-of-the-box (Slack, Jira, ServiceNow) that may suit your incident response needs. However, in situations where your environment needs aren’t met, custom webhook integrations have your back.
Say we keep track of our product’s code issues using GitHub issues. We can create a webhook integration in Splunk Observability Cloud to track active incidents by automatically opening an issue anytime an alert fires. Again, this helps ensure that human eyes land on the root cause and mitigation work is identified and recorded in an issue to make sure the incident does not repeat.
Let’s look at how we can go about setting up a GitHub webhook in Splunk Observability Cloud.
First, we’ll navigate to Data Management in Splunk Observability Cloud and search for the Webhook integration:
We can follow along with the guided setup and configure our GitHub connection:
Select Next to customize the auto-populated payload (but first, notice how much data you can send and act on in the remote system):
We’ll update the fields to match those required by the GitHub API to create a new issue and use the messageTitle and description variables to populate our issue:
After selecting Next we can review and save our webhook:
With our GitHub webhook saved, we next need to add it as a notification recipient on the desired detectors.
Note: to add a webhook as a detector recipient, you must have administrator access.
We can configure webhooks by editing the detector of choice:
We can add Alert recipients and select Webhook:
Then select our newly created GitHub Issue Webhook as the recipient and activate our updated alert notification:
Note: our GitHub webhook hits the GitHub REST API to create issues and these are scoped to the repository specified in the request URL – in our case, the Worms in Space repository. That means we would want to make sure the detector we attach this webhook to is related to that code repository specified in the URL. Thankfully, when we create detectors and alert rules, we can easily scope them to specific services.
That’s it! When an alert rule is triggered (and in this case when an alert is cleared or resolved), we’ll see a new issue automatically appear in our repo’s GitHub issues:
This is a simple example, but you can imagine how you could integrate this with a CI/CD system, for example, to monitor a critical metric, fires off a webhook that runs a script to determine if there’s been a recent release, and rolls back the release if the metric goes out of bounds. Exact examples vary a lot because there are so many systems out there, but the possibilities are really endless.
Now that our alerts create issues in our GitHub repository, we can prioritize and resolve their root cause and eliminate repeat incidents. Rather than being pulled away by reoccurring alerts, our focus can stay on delivering high-priority code, our applications can stay resilient, and our users can stay happy.
Want to accomplish something similar with Jira? Check out the Splunk Observability Cloud built-in Jira integration to easily connect Jira projects and create issues based on Splunk Observability Cloud alerts. Interested in building out a custom webhook? That works too! Once you’ve built a custom webhook, follow the same steps above so it can listen for and receive Splunk Observability Cloud alert notifications.
Don’t yet have Splunk Observability Cloud? Try it free for 14 days!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.