Case Study: Complex Data Transformation Using Real-time Stream Processing

nzambo_splunk · ‎08-17-2021

If you’ve been onboarding data to Splunk for any amount of time, you have likely encountered formatting that initially appears straightforward but quickly becomes complicated as you dig deeper into the use-case with your stakeholders. In this post, I’ll introduce you to a data scenario, the challenges and opportunities presented by it, and then share with you a new way to make that data more valuable and usable.

At the surface level, onboarding data seems to be straightforward: where do I need to linebreak? Where is the appropriate time stamp and what's the format? And if you are performance minded, you'll be thinking about all of those "magic 8" settings. Over time, you see patterns emerge and these settings become fairly trivial to define.

However when you consider who will be using the data and why, you’ll find that there’s often the opportunity to pre-process the data, making it easier to consume - thus making it more valuable. Some questions you might consider tackling in addition to line breaking and time stamping:

Why is this data valuable?
Who is searching the data?
Who are the consumers of the output?
What searches or results are most important to the consumers?
Is the raw data valuable or are just summaries needed? Metrics or Events?
Is the data sensitive, do we need to redact, deduplicate, or enrich the data?

We’ll review feedback from our stakeholder about these questions after we review the raw data source. In the sample below, assume that those initial onboarding best practices have been followed, and what we're left with is a well-formatted JSON event.

{
  "device": {
    "deviceId": "127334527887",
    "deviceSourceId": "be:f3:af:c2:01:f1",
    "deviceType": "IPGateway"
  },
  "timestamp": 1489095004000,
  "rawAttributes": {
    "WIFI_TX_2_split": "325,650,390,150,150,780,293,135,325",
    "WIFI_RX_2_split": "123,459,345,643,234,534,123,134,656",
    "WIFI_SNR_2_split": "32, 18, 13, 43, 32, 50, 23, 12, 54",
    "ClientMac_split": "BD:A2:C9:CB:AC:F3,9C:DD:45:B1:16:53,1F:A7:42:DE:C1:4B,40:32:5D:4E:C3:A1,80:04:15:73:1F:D9,85:B2:15:B3:04:69,34:04:13:AA:4A:EC,4D:CB:0F:6B:3F:71,12:2A:21:13:25:D8"
  }
}

At first glance this onboarded data is great:

We have a structured format.
Splunk will expose field names for easy data discovery.
The timestamp has its own field so we can easily designate that for our record.

And now the extra detail from our stakeholder:

"These events are from our router. The device field at the top describes the router itself, and then the rawAttributes describes all of the downstream devices (ClientMac_split) that connect to the router and their respective performance values like transmit, receive, and signal to noise values. We want to be able to report on these individual downstream devices and associate those individual devices with the router that serviced them as well as investigate the metrics over time. We use this data to triage customer complaints and over time, improve the resiliency of our network.”

This context helps us make some key decisions:

We now know that the SPL required to process this data would be extensive, possibly complex, and would have to be executed every time this data is searched. We should pre-process these events from a single record containing many values to distinct records that contain the pertinent metadata. This would simplify the end-user search experience, and reduce resource utilization.

Near-real-time data for investigations is important, the key data are the performance metrics, and those metrics have dimensionality from within the record. As part of pre-processing, we should ensure that the data is consumed as metrics by Splunk and as above those metrics should have the proper dimensions attached. These resulting metrics will improve search performance through super fast mstats commands and reduce time to investigate through easier searches on more timely data.

While some of this can be done with traditional props or transforms, either on a heavyweight forwarder or on the indexers themselves, there is (I think) a better way to address these requirements. Stream Processing, either Data Stream Processor (DSP) for on-prem or Stream Processor Services (SPS) on Splunk Cloud offers us the ability to author powerful data pipelines to solve these complex data processing challenges.

With Stream Processing we can use familiar search processing language to apply the needed transformations in the stream before the data is indexed. This will remove complexity from the data, reduce search and index time resource consumption, and improve data quality.

You can learn more about Splunk stream processing here or here. Then follow me over to Splunk Lantern for the step-by-step walkthrough of the pipeline I created to address this fun challenge of aligning the incoming data to business value.

So what do you think? Have you had similar data challenges? Let me know below in the comments, as I’d love to hear about your use cases!

— Nick Zambo, Platform Architect

bjennewein · ‎08-17-2021

My friend... this is some solid Splunking, right here! Thanks for sharing your keen insights, and timely insights, too, imho. Hearing more and more, here and there, about the challenges getting the most value out of data but also being a good steward with the effort and costs associated. And I especially loved that some of your guiding questions ask us to think through who will be using the data and what might be important to them. 🙂

nzambo_splunk · ‎08-18-2021

Thank you! I'm so happy I can share some of the cool stuff I see in the field with this awesome community. I really appreciate the feedback and opportunity.

Case Study: Complex Data Transformation Using Real-time Stream Processing

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life