I have a simple web service with a request and response called DeliverySchedule. The request and response have a unique identifier called a GUID which are in pairs. I'm trying to find the duration (response time) between the response and request.
Here's my search which is returning some correct durations but also listing 0 for more than 400 events . I'm also missing a lot of events
index=unleashed "deliveryschedule" | transaction GUID startswith="Request" endswith="Reply" | timechart avg(duration)
GUID = "33efb817-1948-4c8f-bdf4-4111aa1941cc" in this case. We will only have 2 matching GUID's which are attached to the req/resp. They will be different for each req/resp. I also did a field extraction called 'GUID' to capture the GUID between the pipes..
REQUEST
DEBUG 2015-05-13 15:31:40,590
fterReceiveRequest - Request Record : |33efb817-1948-4c8f-bdf4-4111aa1941cc |
<GetDeliverySchedule xmlns="http://tempuri.org/">
<request>
<DeliveryType /DotCom_Delivery">A</DeliveryType>
<EndDate/DotCom_Delivery">2015-05-15</EndDate>
<Region_StoreNo/DotCom_Delivery">970</Region_StoreNo>
<Region_zip/DotCom_Delivery">11111</Region_zip>
<StartDate/DotCom_Delivery">2015-05-13</StartDate>
</request>
</GetDeliverySchedule>
RESPONSE
DEBUG 2015-05-13 15:31:41,276 BeforeSendReply - Response Record : |33efb817-1948-4c8f-bdf4-4111aa1941cc |
<GetDeliveryScheduleResponse xmlns="http://tempuri.org/">
<GetDeliveryScheduleResult xmlns:a="http://schemas.datacontract.org/2004/07/DotCom_Delivery" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:DeliveryCalendar>
<a:Ranges>
<a:DeliveryRange i:type="a:DeliveryDate">
<a:IsAvailableForDelivery>false</a:IsAvailableForDelivery>
<a:Date>2015-05-13T00:00:00-04:00</a:Date>
</a:DeliveryRange>
<a:DeliveryRange i:type="a:DeliveryDate">
<a:IsAvailableForDelivery>false</a:IsAvailableForDelivery>
<a:Date>2015-05-14T00:00:00-04:00</a:Date>
</a:DeliveryRange>
</a:Ranges>
<a:TypeOfCalendar>DayType</a:TypeOfCalendar>
<a:ZoneId>110</a:ZoneId>
</a:DeliveryCalendar>
<a:StatusCode>200</a:StatusCode>
<a:StatusMessage>OK</a:StatusMessage>
</GetDeliveryScheduleResult>
</GetDeliveryScheduleResponse>
Transaction does a ton of things that you don't appear to need. Have you tried something like this:
index=unleashed "deliveryschedule" | stats earliest(_time) AS firstTime latest(_time) AS lastTime by GUID | eval duration=lastTime-firstTime | timechart avg(duration)
Transaction does a ton of things that you don't appear to need. Have you tried something like this:
index=unleashed "deliveryschedule" | stats earliest(_time) AS firstTime latest(_time) AS lastTime by GUID | eval duration=lastTime-firstTime | timechart avg(duration)
Additionally, consider specifying TIME_FORMAT
, TIME_PREFIX
, and MAX_TIMESTAMP_LOOKAHEAD
to mitigate the risk of detecting timestamps within the XML as new events.
I went ahead and added TIME_FORMAT = %s to my props.conf and it almost worked!
My events are getting split from the request and response but not the reqyest field does not have a timestamp (must have gone somewhere else) and the request has the first line (that includes the timestamp) of the response below it
Here are the configurations for timestamping that will work for your events (get rid of everything else related to timestamps and linebreaking):
Inside props.conf
:
TIME_FORMAT = %Y-%m-%d %H:%M:%S,%3N
SHOULD_LINEMERGE=true
No luck with that. Do you think I need to adjust my LINE_BREAK?
Try adding this:
BREAK_ONLY_BEFORE = ^(?:\w+\s+)?\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3}
Works exactly as expected! Thank you so much! You are the man with Splunk
Remove the LINE_BREAKER
and let Splunk break on timestamps instead (default behaviour)?
I Just tried this and it cut off about 30% of my request and 90% of my response
If your events are large you may also need to increase MAX_EVENTS
and TRUNCATE
.
Without a full sample of the actual log file rather than a few lines it's impossible to provide actual values for these settings.
That certainly sounds like an event breaking issue. Do post the props.conf settings for that sourcetype along with a longer sample of the source file before splunking it (pastebin.com?).
Looks like answers.splunk isn't letting me embed the images so I'll post the links here
The Unleashed_Message_Log is the sourcetype. I also have the following for that sourcetype
TRUNCATE = 20000
MAX_EVENTS = 20000
XML
Props.conf
count=1
suggests only one event was found for that GUID
, so the duration must be zero.
Make sure both events are in the searched time range, and that both events have the correct GUID value extracted - including case and possible spaces at both ends.
I just looked up a GUID in the event which was coming back as duration=0 and it's in 2 events both happened today within a half second of each other.
I compared a GUID which has duration=0 to a GUID which has a valid duration.
Correct Duration
This has the request and response in separate events
Duration = 0
This has the request and response in the same event
Do I have to modify my props.conf file so the line breaks for the response so I have separate events for the request and response?
Yes, this is definitely a problem that you need to fix if you are going to anything like what we have been discussing.
Why would some of the events have the request and response in the same event while others have ONLY a request or response in the event?
Do you have any suggestions as to how I should configure my regex for LINE_BREAKER so only 1 request or response is present for each event?
Here's what I currently have in my props.conf file.
[Unleashed_Message_Log]
LINE_BREAKER = |([\r\n]+)
There's a backslash before the first pipe, before r and before n. It looks like answers.splunk didnt include it in my response when I submitted it
Note: This would affect all methods of calculating the duration, whether based on transaction
, stats
, or anything else.
Include a count
in the stats to see if both Request and Response are actually counted. If you get a count of 1, see if the GUID has been extracted correctly. If you get a count of 2, re-check those extracted timestamps.
I just added a count after stats and count = 1 for all duration=0.
And count = 2 for all durations that equal the real difference in timestamps. So this has to be an issue with my timestamp then?