Clearing the Cloudy Skies for PipeStorm

swesmason · ‎11-30-2021

The PipeStorm challenge is our latest challenge ending on December 10th. Developed in 4 weeks the game gave us a way to help the community learn about OpenTelemetry, also known in the industry as OTel. OpenTelemetry is a vendor agnostic way of getting data (traces, spans, metrics, soon logs) from your infrastructure, and applications. OTel is now the standard way to get data into Splunk’s O11y Cloud. The latest brainchild of Splunk’s Advocacy Team, PipeStorm is a fun way to learn the OTel concepts and enter to win a t-shirt and possibly an Oculus Quest 2.

Splunk’s Advocacy team has created a dedicated effort to build interactive content, and leverage Splunk technology to monitor it. Our core responsibility is to connect with our practitioners, and to share what we know about DevOps, Site Reliability Engineering (SRE), and monitoring of cloud-native applications. With the support of our product marketing, campaigns, and brand teams we were the team behind the technical implementation of two of our games O11yQuest and PipeStorm, and are currently working on a third. Not only is it fun to take concepts around Observability and OpenTelemetry and put them into a game, we get to experience some of the cloud, monitoring, and other technical challenges our users do.

But the fun is not just in building the games. It is about running them as well. Although small, we have the same expectations of any application to have these games performant, up, and bug free. Deployed worldwide, we have seen over 10K active players, and experience peak loads every time there is a new promotion. The only way we were able to keep these high standards is by, of course, using Splunk!

Our Observability Stack

In order to support PipeStorm and O11y Quest we turned to the capabilities of Splunks O11y Cloud. In particular we used Splunk Infrastructure Monitoring, Splunk On-Call, and Synthetics to make sure the game is performant, and if something breaks (which it did a few times) we were responsive.

Infrastructure Monitoring

The core of our monitoring was served using the Infrastructure Monitoring solution in the Observability Suite. No matter where our application was hosted, Herkou and AWS, we were able to quickly and easily see all aspects of our environment. From CPU load to latency serving data to our end users, we had all the metrics we needed to sustain the standards we have for the scale of this application.

Splunk On-Call

With the growing needs of complex applications, the requirements of developers and SREs become ever more demanding. From release velocity to SLOs (Service Level Objectives), our team needed to best respond to the ever changing environment, quickly, effectively, and at any time. Efficiency was key in deploying an effective IR, (Incident Response) strategy, and enabling the team to focus with the right tooling is the right [On]call.

Synthetics

Because we had players from 82 countries and we wanted to make sure that each of them had a great experience. We used Splunk Synthetic Monitoring to set up Real Browser tests from different locations all over the world so we could see exactly what our users saw. Each test collects and exposes over 30 plus performance metrics. Reviewing test results we discovered a big latency in remote regions that caused response time jump up to 9 seconds. But with the help of Splunk Web Optimization tool, we investigated all defects, used the provided guidance to optimize our code and improved the app’s Lighthouse performance score to 97 points.

Our Application Stack

Our application is deployed on AWS Beanstalk in 3 separate regions based on geographical promotion. Our back-end is written in Node.js with the front-end in React. Some user session variables (to track those cheaters), and scoreboards are written to a Postgres database, and finally we use a combination of logs in Splunk Cloud, and Google Analytics for game usage telemetry.

No matter how simplistic the application, the monitoring challenges, and objectives to maximize the user experience are the same, all that changes is scale. With all our interactive content and into the future we plan on sticking with the Observability methodologies to make sure we deliver a consistent user experience and are responsive to incidents as they happen in production. It gives us a great opportunity to share the concepts in Observability but also experience some of the challenges that you as a practitioner face day in and day out. If you want to learn more about how we built and monitor these games let us know, but also The PipeStorm Challenge is still live. Test your OpenTelmetry skills today!

Clearing the Cloudy Skies for PipeStorm

Splunk AI Assistant for SPL | Key Use Cases to Unlock the Power of SPL

Buttercup Games: Further Dashboarding Techniques (Part 5)

Customers Increasingly Choose Splunk for Observability