How to use a summary index and collect command to ...

user33 · ‎08-09-2023

Hi all. I’m kind of new to Splunk. I have data by day - this is the response time for each API call by day. I want to run that automatically every day, collecting it into a summary index. (I cannot run this by month since it is too much data). Then, every month, I want to use the summary index to calculate the 95th percentile, average, stan dev, of all the response times by each API call. The summary index will allow me to do that faster. Although I am not sure of the mechanics on how to use.

For instance, do I need to readd my filters for the monthly pull?
Does the below so far look correct to pull in all information (events)?

So, I want to understand if I am doing this correctly. I have the below SPL by day:

index=virt [other search parameters]
| rename msg.sessionId as sessionId
| rename msg.apiName as apiName
| rename msg.processingTime as processingTime
| rename msg.responseCode as responseCode
| eval session_id= coalesce(a_session_id, sessionId)
| fields …
| stats values(a_api_responsetime) as responsetime, values(processingTime) as BackRT by session_id
| eval PlatformProcessingTime = (responsetime - BackRT) | where PlatformProcessingTime>0
| collect index=virt_summary

Then I have the below SPL by month:

index=virt_summary
| bucket _time span=1mon
| stats count as Events, avg(PlatformProcessingTime), stdev(PlatformProcessingTime), perc95(PlatformProcessingTime) by _time

Any assistance is much appreciated! Let me know if you need more clarification. The results are what I have attached, so it looks like it is not working properly. I tested the results by day.

bowesmana · ‎08-09-2023

When you say you want stats for each API call - is that to be per different apiName? Currently you are collecting based on session Id and do not have apiName in the collected results.

Is there only a SINGLE request (response_time/processing_time) per session id - if not, then your event counts, percentiles and averages may be wrong, as values() will remove duplicates, so if you have 10 requests each taking 16ms and a 1000 at 160ms, your averages and percentiles will be completely wrong.

If you have 10 requests for a single session Id, then the monthly summary is only counting 1 for that session, not 10.

Another thing to consider is the error status of the request - in the past, depending on the response codes of your API, some calls may return things like 404 for a valid request, but which there was no relevant data, so having the response code to differentiate performance times with can be useful. Typically error states will return faster than non-error stats, so can corrupt the metrics

user33 · ‎08-10-2023

Thank you. Apologies, I do not want by apiName. I can remove that line. I want stats for each API call.

Yes, there is only a single request (response_time/processing_time) per session id.

Currently the query is only looking at 2xx responses. So I will definitely consider another query for errors.

Do you know if the mechanics of the collect command and summary is properly set up?

bowesmana · ‎08-10-2023

As additional info to @ITWhisperer comments about time, see this post for info on how time can be set in collected data. It's not as written in the documentation, which is buggy

https://community.splunk.com/t5/Getting-Data-In/How-to-log-results-to-an-index/m-p/653651#M110866

ITWhisperer · ‎08-10-2023

Possibly not - in your first search, your stats command does not record the (earliest/latest) time (_time) of the event, so when it gets added to the summary index, the value of _time you are retrieving in your second search probably isn't the time of the events (more likely to be either the time the search was run or possibly responseTime) - I suggest you check the data in your summary index to ensure the _time value is the one you are expecting. Although this doesn't really explain why you are getting an event count but no avg or stdev etc.

ITWhisperer · ‎08-09-2023

Presumably, this line

| stats values(a_api_responsetime) as responsetime, values(processingTime) as BackRT by session_id

gives you two single value numerics i.e. not multi-values and not strings, for each unique session_id, then it looks like it should work.

However, you could try using max() instead of values()

Also, you could try

| where PlatformProcessingTime >= 0

as, depending on the resolution of your time values, PlatformProcessingTime could be zero.

Also, with millions of events per day, you might want to consider running the collect search over shorter periods of time to reduce the number of events being added to the summary index each time.

How to use a summary index and collect command to pull large data by month?

stats

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Calling All Security Pros: Ready to Race Through Boston?

Are you a member of the Splunk Community?

How to use a summary index and collect command to pull large data by month?

stats

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Calling All Security Pros: Ready to Race Through Boston?