Getting Data In

Complex data manipulation before indexing

eregon
Path Finder

Dear fellow Splunkthusiasts, is there a way to put my own script manipulating the data in between the forwarder and indexer?

To be specific: I have XML logs from SmartMeter/jMeter looking like this:

<?xml version="1.0" encoding="UTF-8"?>
<testResults version="1.2">
<httpSample t="86" it="0" lt="37" ts="1553000000000" s="true" lb="openLoginPage" rc="200" rm="OK #subresults:3" tn="10.20.30.40:1234_TestCases 1-2" dt="text" de="ISO-8859-1" by="3999" sc="1" ec="0" ng="2" na="2" hn="sm-generator2">
 <httpSample t="37" it="0" lt="37" ts="1553000000000" s="true" lb="openLoginPage-0" rc="200" rm="OK" tn="10.20.30.40:1234_TestCases 1-2" dt="text" de="ISO-8859-1" by="1578" sc="1" ec="0" ng="2" na="2" hn="sm-generator2">
    <responseHeader class="java.lang.String"></responseHeader>
    <requestHeader class="java.lang.String"></requestHeader>
    <responseData class="java.lang.String"></responseData>
    <responseFile class="java.lang.String"></responseFile>
    <cookies class="java.lang.String"></cookies>
    <method class="java.lang.String">GET</method>
    <queryString class="java.lang.String"></queryString>
    <java.net.URL>https://some.host/path/</java.net.URL>
  </httpSample>
  <httpSample t="17" it="0" lt="17" ts="1553000000001" s="true" lb="openLoginPage-1" rc="200" rm="OK" tn="10.20.30.40:1234_TestCases 1-2" dt="text" de="" by="578" sc="1" ec="0" ng="2" na="2" hn="sm-generator2">
    <responseHeader class="java.lang.String"></responseHeader>
    <requestHeader class="java.lang.String"></requestHeader>
    <responseData class="java.lang.String"></responseData>
    <responseFile class="java.lang.String"></responseFile>
    <cookies class="java.lang.String">some_cookie_name=some_cookie_value</cookies>
    <method class="java.lang.String">GET</method>
    <queryString class="java.lang.String"></queryString>
    <java.net.URL>https://some.host/path/</java.net.URL>
  </httpSample>
</httpSample>
...

That is way too verbose for my needs, so I wrote a script transforming the XML to the following:

httpSession sessionId="123" t="86" it="0" lt="37" ts="1553000000000" s="true" lb="openLoginPage" rc="200" rm="OK #subresults:3" tn="10.20.30.40:1234_TestCases 1-2" dt="text" de="ISO-8859-1" by="3999" sc="1" ec="0" ng="2" na="2" hn="sm-generator2"
httpRequest sessionId="123" t="37" it="0" lt="37" ts="1553000000000" s="true" lb="openLoginPage-0" rc="200" rm="OK" tn="10.20.30.40:1234_TestCases 1-2" dt="text" de="ISO-8859-1" by="1578" sc="1" ec="0" ng="2" na="2" hn="sm-generator2" method="GET" url="https://some.host/path/"
httpRequest sessionId="123" t="17" it="0" lt="17" ts="1553000000001" s="true" lb="openLoginPage-1" rc="200" rm="OK" tn="10.20.30.40:1234_TestCases 1-2" dt="text" de="" by="578" sc="1" ec="0" ng="2" na="2" hn="sm-generator2" cookies="some_cookie_name=some_cookie_value" method="GET" url="https://some.host/path/"

Please note the output is enriched by sessionId field holding the relationship of session and requests, which can't be simply done by sed.

I would like to collect the original log in XML format by universal forwarder, have it processed by my script (possibly on HFW?) and finally index the simplified output. Is something like that possible?

Scripted outputs are not exactly what I am looking for as this method would introduce data lags and a need to prevent re-reading the same data (both is solved with monitor:// input method).

0 Karma

DavidHourani
Super Champion

Hi @eregon,

You can use the script you made as the input script using scripted inputs. Whatever your script will output will automatically go straight into Splunk.

It's quite straightforward, all you have to do is add the script to the bin folder of an app and then create the input that goes with it.

You can find out how to apply this in detail here :
https://docs.splunk.com/Documentation/Splunk/7.2.6/AdvancedDev/ScriptSetup

Let me know if that helps and if you need further help.

Cheers,
David

0 Karma

eregon
Path Finder

Hi @DavidHourani , thanks for your advice! I did read about scripted inputs and unfortunately it is not what I am searching for (as mentioned at the end of my question). In my specific case I see these disadvantages:

  • it introduces lags (data is ingested once in a period) - compared to monitor:// method reacting virtually immediately
  • introduces unnecessary load (executing the script periodically even when no perftests are running/no data is produced)
  • requires additional measures to prevent re-reading the same data: the source simply appends new data to the end of existing log file and my script works in a stream manner - running the current script periodically would read it whole over and over again; I would have to implement some kind of file pointer similar to what monitor:// method already does, or try to tweak SmartMeter's logging behaviour
0 Karma

eregon
Path Finder

The closest Splunk feature I could find is SEDCMD option in props.conf and it could possibly solve my trouble, if I am able to read a value in parent-level httpSample tag and then insert it into subsequent lines.

0 Karma

DavidHourani
Super Champion

It does add latency and isn't as efficient as monitor, you're right.

There are three points at which you can apply that data cleansing :
1- Before reading the data : The easiest way - run your script on your data and have the results stored in files, then read the data directly from there with your UF.
2- On read : Scripted inputs.
3- After reading, on indexing : SEDCMD could be an option but you will need to write a complex sed command to get the cleansing done. Might have the same impact as the scripted inputs.

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...