Given log data, can I calculate accuracy and outpu...

ruhtraeel · ‎10-10-2019

Hello,
My events look like this:

2019-10-10T17:51:40+00:00 action="updateDate->saveDatesFromDataMining",
0={"urlupdateid":1371955,"datetype":"Review date","datevalue":"10/03/2019"}, 
1={"urlupdateid":1371955,"datetype":"somethingElse","datevalue":"10/03/2020"}, 
2={"urlupdateid":1371955,"datetype":"False datetype","datevalue":"10/03/2019"},` approved=true
host = stuff.localsource = PCDAccuracy.txtsourcetype = Text

2019-10-10T17:51:40+00:00 action="updateDate->saveDates", 
0={"urlupdateid":1371955,"datetype":"Review date","datevalue":"10/03/2019"}, 
1={"urlupdateid":1371955,"datetype":"Next review date (est.)","datevalue":"10/03/2020"}, approved=true
host = stuff.localsource = PCDAccuracy.txtsourcetype = Text

My question is, can I create a query that looks at every 0, 1, 2... N field in the event where action="updateDate->saveDatesFromDataMining" and checks to see if there is a corresponding entry in the most recent event where action="updateDate->saveDates" that has a matching datetype and datevalue, and then calculates the overall accuracy?

So in the example above, accuracy would be 1/3 (0th entry is correct, 1st entry is incorrect, 2nd entry is extra and therefore incorrect)

Accuracy = # of correctly extracted dates where action = "updateDate->saveDatesFromDataMining" / (# of dates where action = "updateDate->saveDates") UNION (# of extra dates where action = "updateDate->saveDatesFromDataMining")

woodcock · ‎10-13-2019

Like this:

| makeresults 
| eval _raw="2019-10-10T17:51:40+00:00 action=\"updateDate->saveDatesFromDataMining\",
0={\"urlupdateid\":1371955,\"datetype\":\"Review date\",\"datevalue\":\"10/03/2019\"}, 
1={\"urlupdateid\":1371955,\"datetype\":\"somethingElse\",\"datevalue\":\"10/03/2020\"}, 
2={\"urlupdateid\":1371955,\"datetype\":\"False datetype\",\"datevalue\":\"10/03/2019\"}, approved=true
host = stuff.local, source = PCDAccuracy.txt, sourcetype = Text"
| append [
| makeresults
| eval _raw="
2019-10-10T17:51:40+00:00 action=\"updateDate->saveDates\", 
0={\"urlupdateid\":1371955,\"datetype\":\"Review date\",\"datevalue\":\"10/03/2019\"}, 
1={\"urlupdateid\":1371955,\"datetype\":\"Next review date (est.)\",\"datevalue\":\"10/03/2020\"}, approved=true
host = stuff.local, source = PCDAccuracy.txt, sourcetype = Text"]

| rename COMMENT AS "Everything above generates sample events; everything below is your solution"

| rex max_match=0 "(?ms)^\d+=(?<raw>\{[^\}]+\})"
| kv
| mvexpand raw
| streamstats count AS which BY _raw
| eval which = which - 1
| spath input=raw
| fields - _* raw
| stats count dc(action) AS actions dc(datetype) AS dates BY urlupdateid which
| eval state=if((actions==2 AND dates==1), "1", "0")
| stats count(eval(state==1)) AS valid count AS total BY urlupdateid
| eval accuracy = round(valid/total,2)

ololdach · ‎10-11-2019

Hi,
this query should give you an idea, on how this can be done:

| makeresults | eval _raw="2019-10-10T17:51:40+00:00 action=\"updateDate->saveDatesFromDataMining\", 0={\"urlupdateid\":1371955,\"datetype\":\"Review date\",\"datevalue\":\"10/03/2019\"},1={\"urlupdateid\":1371955,\"datetype\":\"somethingElse\",\"datevalue\":\"10/03/2020\"},2={\"urlupdateid\":1371955,\"datetype\":\"False datetype\",\"datevalue\":\"10/03/2019\"}`, approved=true host=stuff.localsource PCDAccuracy.txtsourcetype=Text" 
| rex field=_raw "action=\"(?<action>[^\"]+)\"" | rex field=_raw "(?<values>\d+\=\{.*)`" 
| eval values=replace(values,"},","}|") 
| makemv delim="|" values | mvexpand values 
| rex field=values "urlupdateid\":(?<urlupdateid>[^,]+),.*:\"(?<datetype>[^\"]+)\".*:\"(?<datevalue>.*)\"" 
| fields - _raw,values

Let me explain:
1. The messy part on top is your event data. In order to reproduce it in search, I assign it to a field called _raw. In your events, this field is already present and filled with the event data.
2. The line with the two rex commands, one for action and one for values extracts two fields from your events. Cut & Paste from the start including only this line to take a look at the content. You should extract those fields in your sourcetype. Doing it with a rex at search time is somewhat ok, the sourcetype is better.
3. Now the fun part starts: Values contains a list of results. Since the list is dynamic, it can hold an arbitrary number of results, one per datetype. A static mapping is useless. Therefore we trick the machine by making the individual datetype values clearly indentifyable. That's the replace for: We replace the , that separates two datatype records by | in order to distinguish it from the other commas around.
4. Now we make the values field a multi-value field based on the |that we introduced and we split the event into multiple events. The result is, that for every datetype in the original single event, we now have one event.
5. Almost done: We extract the relevant fields from the single datetype values field
6. With the last fields command we remove the unwanted noise from the result

This was the heavy lifting. From here on, you can do your further statistics and analysis:
- Choose the timeframe for the query, so that it has 3 results ( 3 days, if you want 3 results for every datetype and get them daily)
- Do a transaction on the urlupdateid field, this will group all events based on the urlupdateid into one. You may use the startswith and endswith parameter to include the action field
- Analyse the datetype and datevalue fields to calculate the Accuracy

Hope it helps
Oliver

ruhtraeel · ‎10-11-2019

Thank you for the reply. Two questions:
1: How would I eval _raw to be the result of a search, rather than the hardcoded fake data? Doing this gives me the error Error in 'eval' command: Failed to parse the provided arguments. Usage: eval dest_key = expression:

| eval _raw=[ search saveDates | return $result ] | rex field=_raw "action=\"(?<action>[^\"]+)\"" | rex field=_raw "(?<values>\d+\=\{.*)`" 
 | eval values=replace(values,"},","}|") 
 | makemv delim="|" values | mvexpand values 
 | rex field=values "urlupdateid\":(?<urlupdateid>[^,]+),.*:\"(?<datetype>[^\"]+)\".*:\"(?<datevalue>.*)\"" 
 | fields - _raw,values

2: Can the accuracy calculation be done in the search string to be shown on a dashboard? I'm beginning to think that the accuracy calculation should be done in our backend, and the result should be returned for Splunk, as these data calculations and transformations don't seem very natural to do in Splunk.

ololdach · ‎10-13-2019

To answer 1: _raw is already set by your own search that selects the events. Just omit the first line of the query and you are good to go. Your own search would likely look like this index=... | ...<the query starting with line 2 from above>
To answer 2: Yes, it can be done and it is quite easy from where I left off or taken the answer from woodcock below. The challenge is to think about your problems in a splunky way. There are usually many solutions, all deliver the same results and all having their pro's and con's. Whatever it is, though, Splunk always works on sets of events that have something in common and works either by doing stats on some values inside those events or some statistical analysis on the common characteristics like: Of the n events that show failures, please list the type of failure in percent by failure type/category. My approach above would be the latter: combine all results together and get the number of fails in comparison to the total number.

Given log data, can I calculate accuracy and output it in a dashboard?

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor

Are you a member of the Splunk Community?

Given log data, can I calculate accuracy and output it in a dashboard?

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor