Solved: Combining Fields into a Table

TylerJVitale · ‎07-17-2019

I want to combine the data from a prediction algorithm on crashed applications with additional data about the crashed applications, namely the count of affected users. Right now, index=index sourcetype="sourcetype" APPLICATION="app*"| timechart span=1d sum(VOLUME) | predict "sum(VOLUME)" as prediction algorithm="LLP5" future_timespan="30" holdback="0" period=7 lower"95"=lower"95" upper"95"=upper"95" | eval isOutlier = if(prediction!="" AND 'sum(VOLUME)' !="" AND ('sum(VOLUME)' < 'lower95(prediction)' OR 'sum(VOLUME)' > 'upper95(prediction)'), 1, 0) | where isOutlier=1 | fields - isOutlier|rename sum(VOLUME) as "Actual Crashes", prediction as "Predicted Crashes" gives me what I want for the prediction data,

and index=index sourcetype="sourcetype" APPLICATION="app*"| timechart span=1d dc(USERNAME) as "Users"gives me the count of users,

but index=index sourcetype="sourcetype" APPLICATION="app*"| timechart span=1d sum(VOLUME) | predict "sum(VOLUME)" as prediction algorithm="LLP5" future_timespan="30" holdback="0" period=7 lower"95"=lower"95" upper"95"=upper"95" | eval isOutlier = if(prediction!="" AND 'sum(VOLUME)' !="" AND ('sum(VOLUME)' < 'lower95(prediction)' OR 'sum(VOLUME)' > 'upper95(prediction)'), 1, 0) | where isOutlier=1 | fields - isOutlier|rename sum(VOLUME) as "Actual Crashes", prediction as "Predicted Crashes"|join[search index=index sourcetype="index" APPLICATION="app*"| timechart span=1d dc(USERNAME) as "Users"] doesn't give me the correct count of users in the table.

Any ideas on how to combine these searches properly?

rbechtold · ‎07-17-2019

Hey Tyler!

I've recreated what I believe your dataset may look like (copy and paste into any Splunk instance):

| gentimes start=6/18/2019 end=7/18/2019 increment=10m 
| eval modifier = random()%5 
| eval VOLUME = random()%400 + 100 
| eval VOLUME = if(modifier = 1 OR modifier = 2, VOLUME * -1, VOLUME) 
| eval _time = starttime 
| timechart span=1d sum(VOLUME) AS sumv 
| predict sumv as prediction algorithm="LLP5" future_timespan=30 holdback=0 period=7 lower95=lower95 upper95=upper95 
| streamstats count as test 
| eval sumv = if(test = 20 OR test = 13, sumv + 20000, if(test = 4 OR test = 25, sumv - 24000 , sumv)) 
| fields - test 
| eval isOutlier = if(prediction!="" AND 'sumv' !="" AND ('sumv' < 'lower95(prediction)' OR 'sumv' > 'upper95(prediction)'), 1, 0) 
| where isOutlier=1 
| fields - isOutlier 
| rename sumv as "Actual Crashes", prediction as "Predicted Crashes"
| join [| gentimes start=6/18/2019 end=7/18/2019 increment=1d
| eval USERNAME = random()%200
| eval _time = starttime
| timechart span=1d values(USERNAME) AS "Users"]

If this is correct, I think I understand your problem.

Without specifying a field to join on or a join type, the automatically defaults to an inner join and joins on any fields present in both datasets. For some reason, I don't believe the join command is using the "_time" field as default field to look at when joining the two searches. My best guess is that maybe Splunk doesn't automatically look at internal fields when using the join command.

In order to get around this, we can explicitly state that we want to join on the _time field like this:

index=index sourcetype="sourcetype" APPLICATION="app*" 
| timechart span=1d sum(VOLUME) 
| predict "sum(VOLUME)" as prediction algorithm="LLP5" future_timespan="30" holdback="0" period=7 lower"95"=lower"95" upper"95"=upper"95" 
| eval isOutlier = if(prediction!="" AND 'sum(VOLUME)' !="" AND ('sum(VOLUME)' < 'lower95(prediction)' OR 'sum(VOLUME)' > 'upper95(prediction)'), 1, 0) 
| where isOutlier=1 
| fields - isOutlier 
| rename sum(VOLUME) as "Actual Crashes", prediction as "Predicted Crashes" 
| join _time
    [ search index=index sourcetype="index" APPLICATION="app*" 
    | timechart span=1d dc(USERNAME) as "Users"]

Although I would strongly recommend adding a join type= argument to your joins in the future as they can greatly effect your results.

If you would like to know more about join types, check out the documentation on the join command here under the type subheading:
https://docs.splunk.com/Documentation/Splunk/7.3.0/SearchReference/Join

Let me know if you're still having any problems!

View solution in original post

rbechtold · ‎07-17-2019

Hey Tyler!

I've recreated what I believe your dataset may look like (copy and paste into any Splunk instance):

| gentimes start=6/18/2019 end=7/18/2019 increment=10m 
| eval modifier = random()%5 
| eval VOLUME = random()%400 + 100 
| eval VOLUME = if(modifier = 1 OR modifier = 2, VOLUME * -1, VOLUME) 
| eval _time = starttime 
| timechart span=1d sum(VOLUME) AS sumv 
| predict sumv as prediction algorithm="LLP5" future_timespan=30 holdback=0 period=7 lower95=lower95 upper95=upper95 
| streamstats count as test 
| eval sumv = if(test = 20 OR test = 13, sumv + 20000, if(test = 4 OR test = 25, sumv - 24000 , sumv)) 
| fields - test 
| eval isOutlier = if(prediction!="" AND 'sumv' !="" AND ('sumv' < 'lower95(prediction)' OR 'sumv' > 'upper95(prediction)'), 1, 0) 
| where isOutlier=1 
| fields - isOutlier 
| rename sumv as "Actual Crashes", prediction as "Predicted Crashes"
| join [| gentimes start=6/18/2019 end=7/18/2019 increment=1d
| eval USERNAME = random()%200
| eval _time = starttime
| timechart span=1d values(USERNAME) AS "Users"]

If this is correct, I think I understand your problem.

Without specifying a field to join on or a join type, the automatically defaults to an inner join and joins on any fields present in both datasets. For some reason, I don't believe the join command is using the "_time" field as default field to look at when joining the two searches. My best guess is that maybe Splunk doesn't automatically look at internal fields when using the join command.

In order to get around this, we can explicitly state that we want to join on the _time field like this:

index=index sourcetype="sourcetype" APPLICATION="app*" 
| timechart span=1d sum(VOLUME) 
| predict "sum(VOLUME)" as prediction algorithm="LLP5" future_timespan="30" holdback="0" period=7 lower"95"=lower"95" upper"95"=upper"95" 
| eval isOutlier = if(prediction!="" AND 'sum(VOLUME)' !="" AND ('sum(VOLUME)' < 'lower95(prediction)' OR 'sum(VOLUME)' > 'upper95(prediction)'), 1, 0) 
| where isOutlier=1 
| fields - isOutlier 
| rename sum(VOLUME) as "Actual Crashes", prediction as "Predicted Crashes" 
| join _time
    [ search index=index sourcetype="index" APPLICATION="app*" 
    | timechart span=1d dc(USERNAME) as "Users"]

Although I would strongly recommend adding a join type= argument to your joins in the future as they can greatly effect your results.

If you would like to know more about join types, check out the documentation on the join command here under the type subheading:
https://docs.splunk.com/Documentation/Splunk/7.3.0/SearchReference/Join

Let me know if you're still having any problems!

Combining Fields into a Table

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Think Like an Architect: Introducing the Splunk Certified Cybersecurity Defense ...

Index This | What has goals but no motivation?

Deep Dive: Accelerate threat investigation with Splunk’s AI Assistant in Security

Join the Conversation