Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Almost Too Eventful Assurance: Part 2

Connor_Tye
Splunk Employee
Splunk Employee

Work While You Sleep

Before you can rely on any autonomous remediation measures, you need to close the loop between detection and action by codifying your best practices. This approach to bundling your performance metrics, remediation scripts, and operational workflows doesn’t just speed up MTTR, it depressurizes troubleshooting. Lets take a look at an example: 

What happened? Your companies’ ThousandEyes agents flagged an intermittent packet loss on a CDN provider serving U.S. West customers. With many CDN’s fronting HTTP traffic, you don’t want handshake connections (packets) to outright fail rather than just load slowly – so now it's worrisome that real users could abandon their shopping carts and support tickets. Instead of checking all your dashboards and hopping in a war room, both ITOps and NetOps teams bidirectionally see a single, clear incident tied to shared SLAs and business impact. Before your hands even reach the keyboard, either automatically or through predefined logic, ITSI sets off a chain of workflows to stop the cascade of alerts and proactively optimize resources if the problem persists. Here’s a high level flow:

  1. Event Analytics groups raw packet‑loss alerts with concurrent HTTP 5XX errors for you to discover ‘oh this looks like possible problems with the server itself.’

  2. Trigger playbooks to run a scripted sequence - pull the edge‑node health, refresh DNS cache, and open a ticket if thresholds remain breached for more than 5 minutes.

  3. Set Notifications to alert only on persistent failures - no 3 AM wake‑ups for transient glitches.

With these steps, ITOps and NetOps teams can eliminate midnight pages for fleeting issues, and surface problems in full-context with recommended remediation steps ready to guide the way. 

Predict Tomorrow’s Problems Today

Catching problems before they impact customers is kind of the ultimate idea behind fast detection & remediation, right? By applying advanced time-series forecasting and anomaly scoring to ThousandEyes network tests and application KPIs, ITOps and NetOps teams can turn historical data into forward-looking insights. This could mean getting an early warning sign of capacity bottlenecks, degrading performance, or routing instabilities as well as taking preemptive action to prevent the problems from occurring in the first place. Lets make a prediction ourselves on what this could look like for some of you later this year: 

8AM: *Click – You flip on the local Wisconsin news, feeling excited but anxious to host so many relatives for the upcoming holiday break, especially so given how exhausted you were during this time last year. The meteorologists on TV warn that a ‘great blizzard will slam the Northeast U.S. on December 24th with up to 32 inches of snow’... ‘Oh no!’ you think to yourself – ‘I’ll probably be up all night trying to keep services running…’ – but this year you're prepared.  

Now at your computer, you start forecasting network telemetry from ThousandEyes into ITSI, and see historical data from similar storms show an 180% surge in e-commerce traffic between 10PM and Midnight CST. With insight into the service performance impact of last-minute shoppers, you prepare to run more tests and pre-scale resources.

Try forecasting network telemetry from ThousandEyes in ITSI:

  • Apply time-series models to ThousandEyes synthetics transaction times and checkout API throughput.

  • Project KPI trends like hour-by-hour load increases, DNS query times, request throughput and more, hours in advance.

  • Generate a color coded “risk index”, like a heat‑map overlay on your service map - showing which backend components (i.e a payment gateway, inventory service, etc) will need preemptive attention.

Rather than ITOps teams running to add resources during the storm, teams can pre‑scale services, validate CDN failover settings, and run synthetic smoke tests - transforming a risky deployment into a smooth, controlled event

Focus on Shared Intelligence, Workflows, and Automation

Lets bring everything we’ve already covered together for one last scenario, and imagine a mid‑day network degradation somewhere in Northeast Asia:

  1. Your NetOps team’s ThousandEyes synthetic DNS tests detect a 30% increase in resolution time at edge nodes in Tokyo.

  2. An ITSI Detector then correlates that with a rise in checkout‑page errors, grouping them into a single incident, and prioritizing it by business impact *gasps/applauding* – we’ll call it “APAC Checkout Latency”.

  3. Next, ITSI’s Predictive capabilities forecast that, if current trends continue, user latency will breach SLAs in less than three hours. It's time to prepare.

  4. The Automated Runbook then kicks off: 
    • Execute an API‑driven switch to an alternative DNS provider.

    • Run post‑switch synthetic tests and post-results back to the incident ticket.

    • Finally, notify both NetOps and ITOps via a single, annotated alert – complete with remediation logs and forecast graphs.

 

By the time teams log in for their shift, the incident is already mitigated – and customer experience remains uninterrupted. 

Get Updates on the Splunk Community!

AppDynamics Summer Webinars

This summer, our mighty AppDynamics team is cooking up some delicious content on YouTube Live to satiate your ...

SOCin’ it to you at Splunk University

Splunk University is expanding its instructor-led learning portfolio with dedicated Security tracks at .conf25 ...

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor

Organizations handling credit card transactions know that PCI DSS compliance is both critical and complex. The ...