Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Case Study: Complex Data Transformation Using Real-time Stream Processing

Splunk Employee
Splunk Employee

If you’ve been onboarding data to Splunk for any amount of time, you have likely encountered formatting that initially appears straightforward but quickly becomes complicated as you dig deeper into the use-case with your stakeholders. In this post, I’ll introduce you to a data scenario, the challenges and opportunities presented by it, and then share with you a new way to make that data more valuable and usable.

At the surface level, onboarding data seems to be straightforward: where do I need to linebreak? Where is the appropriate time stamp and what's the format? And if you are performance minded, you'll be thinking about all of those "magic 8" settings. Over time, you see patterns emerge and these settings become fairly trivial to define.

However when you consider who will be using the data and why, you’ll find that there’s often the opportunity to pre-process the data, making it easier to consume - thus making it more valuable. Some questions you might consider tackling in addition to line breaking and time stamping:

  • Why is this data valuable?
  • Who is searching the data?
  • Who are the consumers of the output?
  • What searches or results are most important to the consumers?
  • Is the raw data valuable or are just summaries needed? Metrics or Events?
  • Is the data sensitive, do we need to redact, deduplicate, or enrich the data?

We’ll review feedback from our stakeholder about these questions after we review the raw data source. In the sample below, assume that those initial onboarding best practices have been followed, and what we're left with is a well-formatted JSON event.



  "device": {
    "deviceId": "127334527887",
    "deviceSourceId": "be:f3:af:c2:01:f1",
    "deviceType": "IPGateway"
  "timestamp": 1489095004000,
  "rawAttributes": {
    "WIFI_TX_2_split": "325,650,390,150,150,780,293,135,325",
    "WIFI_RX_2_split": "123,459,345,643,234,534,123,134,656",
    "WIFI_SNR_2_split": "32, 18, 13, 43, 32, 50, 23, 12, 54",
    "ClientMac_split": "BD:A2:C9:CB:AC:F3,9C:DD:45:B1:16:53,1F:A7:42:DE:C1:4B,40:32:5D:4E:C3:A1,80:04:15:73:1F:D9,85:B2:15:B3:04:69,34:04:13:AA:4A:EC,4D:CB:0F:6B:3F:71,12:2A:21:13:25:D8"



At first glance this onboarded data is great:

  • We have a structured format.
  • Splunk will expose field names for easy data discovery.
  • The timestamp has its own field so we can easily designate that for our record.

And now the extra detail from our stakeholder: 

"These events are from our router. The device field at the top describes the router itself, and then the rawAttributes describes all of the downstream devices (ClientMac_split) that connect to the router and their respective performance values like transmit, receive, and signal to noise values. We want to be able to report on these individual downstream devices and associate those individual devices with the router that serviced them as well as investigate the metrics over time. We use this data to triage customer complaints and over time, improve the resiliency of our network.

This context helps us make some key decisions:

  • We now know that the SPL required to process this data would be extensive, possibly complex, and would have to be executed every time this data is searched. We should pre-process these events from a single record containing many values to distinct records that contain the pertinent metadata. This would simplify the end-user search experience, and reduce resource utilization.
  • Near-real-time data for investigations is important, the key data are the performance metrics, and those metrics have dimensionality from within the record. As part of pre-processing, we should ensure that the data is consumed as metrics by Splunk and as above those metrics should have the proper dimensions attached. These resulting metrics will improve search performance through super fast mstats commands and reduce time to investigate through easier searches on more timely data.

While some of this can be done with traditional props or transforms, either on a heavyweight forwarder or on the indexers themselves, there is (I think) a better way to address these requirements. Stream Processing, either Data Stream Processor (DSP) for on-prem or Stream Processor Services (SPS) on Splunk Cloud offers us the ability to author powerful data pipelines to solve these complex data processing challenges.

With Stream Processing we can use familiar search processing language to apply the needed transformations in the stream before the data is indexed. This will remove complexity from the data, reduce search and index time resource consumption, and improve data quality.

You can learn more about Splunk stream processing here or here. Then follow me over to Splunk Lantern for the step-by-step walkthrough of the pipeline I created to address this fun challenge of aligning the incoming data to business value.

So what do you think? Have you had similar data challenges? Let me know below in the comments, as I’d love to hear about your use cases!

— Nick Zambo, Platform Architect

Tags (2)
Get Updates on the Splunk Community!

Platform Highlights | November 2022 Newsletter

 November 2022 Skill Up on Splunk with our New Builder Tech Talk SeriesCan you build it? Yes you can! *play ...

Splunk Education - Fast Start Program!

Welcome to Splunk Education! Splunk training programs are designed to enable you to get started quickly and ...

Five Subtly Different Ways of Adding Manual Instrumentation in Java

You can find the code of this example on GitHub here. Please feel free to star the repository to keep in ...