If you’ve been onboarding data to Splunk for any amount of time, you have likely encountered formatting that initially appears straightforward but quickly becomes complicated as you dig deeper into the use-case with your stakeholders. In this post, I’ll introduce you to a data scenario, the challenges and opportunities presented by it, and then share with you a new way to make that data more valuable and usable.
At the surface level, onboarding data seems to be straightforward: where do I need to linebreak? Where is the appropriate time stamp and what's the format? And if you are performance minded, you'll be thinking about all of those "magic 8" settings. Over time, you see patterns emerge and these settings become fairly trivial to define.
However when you consider who will be using the data and why, you’ll find that there’s often the opportunity to pre-process the data, making it easier to consume - thus making it more valuable. Some questions you might consider tackling in addition to line breaking and time stamping:
We’ll review feedback from our stakeholder about these questions after we review the raw data source. In the sample below, assume that those initial onboarding best practices have been followed, and what we're left with is a well-formatted JSON event.
{
"device": {
"deviceId": "127334527887",
"deviceSourceId": "be:f3:af:c2:01:f1",
"deviceType": "IPGateway"
},
"timestamp": 1489095004000,
"rawAttributes": {
"WIFI_TX_2_split": "325,650,390,150,150,780,293,135,325",
"WIFI_RX_2_split": "123,459,345,643,234,534,123,134,656",
"WIFI_SNR_2_split": "32, 18, 13, 43, 32, 50, 23, 12, 54",
"ClientMac_split": "BD:A2:C9:CB:AC:F3,9C:DD:45:B1:16:53,1F:A7:42:DE:C1:4B,40:32:5D:4E:C3:A1,80:04:15:73:1F:D9,85:B2:15:B3:04:69,34:04:13:AA:4A:EC,4D:CB:0F:6B:3F:71,12:2A:21:13:25:D8"
}
}
At first glance this onboarded data is great:
And now the extra detail from our stakeholder:
"These events are from our router. The device field at the top describes the router itself, and then the rawAttributes describes all of the downstream devices (ClientMac_split) that connect to the router and their respective performance values like transmit, receive, and signal to noise values. We want to be able to report on these individual downstream devices and associate those individual devices with the router that serviced them as well as investigate the metrics over time. We use this data to triage customer complaints and over time, improve the resiliency of our network.”
This context helps us make some key decisions:
While some of this can be done with traditional props or transforms, either on a heavyweight forwarder or on the indexers themselves, there is (I think) a better way to address these requirements. Stream Processing, either Data Stream Processor (DSP) for on-prem or Stream Processor Services (SPS) on Splunk Cloud offers us the ability to author powerful data pipelines to solve these complex data processing challenges.
With Stream Processing we can use familiar search processing language to apply the needed transformations in the stream before the data is indexed. This will remove complexity from the data, reduce search and index time resource consumption, and improve data quality.
You can learn more about Splunk stream processing here or here. Then follow me over to Splunk Lantern for the step-by-step walkthrough of the pipeline I created to address this fun challenge of aligning the incoming data to business value.
So what do you think? Have you had similar data challenges? Let me know below in the comments, as I’d love to hear about your use cases!
— Nick Zambo, Platform Architect
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.