I'm not sure if I can get any help here, but I am going to try cause I've been wrestling with this search/data for a week now.
The setup:
I have log files that have fields and I need to determine the amount of time since a field is x to when a field is y. I am using transaction and concurrency. Here is my search:
index=xxx ALPXX=* SCBXX=* abc12.prod04 | transaction host startswith="ALPXX=re-connecting" endswith="ALPXX=online" | concurrency duration=duration | eval duration=tostring(duration,"duration") | sort 0 _time | where concurrency < 2
Example logs:
[1456793400] SERVICE ALERT: abc12.prod04;xx_yy;WARNING;SOFT;1;ALPXX=re-connecting, SCBXX=online Total=1/2
[1456793700] SERVICE ALERT: abc12.prod04.dqx;xx_yy;OK;SOFT;2;ALPXX=online SCBXX=online Total=2/2
The search works great if the two logs that it's determining the duration for are close to each other, but if there are several ALPXX=re-connecting
between them, it will skip sometimes over the first instance of ALPXX=online
and go to one a few logs further along in time. (so basically, our server is reporting these at specific time intervals so when its down, it will say ALPXX=re-connecting
every 5 mins and when its up it will say ALPXX=online
every 5 mins)
Can anyone spot something in the search that may be causing me to skip a valid endswith value? Thanks for any insight.
What about this?
index=xxx ALPXX="re-connecting" SCBXX=* abc12.prod04
| dedup consecutive=true keepempty=true host sortby +_time
| append [ index=xxx ALPXX="online" SCBXX=* abc12.prod04
| dedup consecutive=true keepempty=true host ]
| sort 0 host _time
| transaction host startswith="ALPXX=re-connecting" endswith="ALPXX=online" maxevents=2
| concurrency duration=duration
| eval duration=tostring(duration,"duration")
| sort 0 _time
| where concurrency < 2
You can make the search more efficient if you do this:
index=xxx (ALPXX="re-connecting" OR ALPXX="online") SCBXX=* abc12.prod04
| transaction host startswith="ALPXX=re-connecting" endswith="ALPXX=online" unifyends= true
| concurrency duration=duration
| eval duration=tostring(duration,"duration")
| sort 0 _time
| where concurrency < 2
I did two things: first, I only searched for events that match the start/end conditions, as you are only interested in the duration. This may reduce the number of events returned from the search, which will help the transaction command work better (it is memory-intensive). I also specified "unifyends".
However, think about this: The transaction command looks at the events in reverse-time order, so it searches first for the end condition and then looks for a start condition to match. If you have multiple "starts" and "ends", how does Splunk know to match the earliest possible start with the latest possible end for a transaction?
I think that the dedup command may be needed, to reduce a series of "starts" to the earliest start, and the "ends" to the latest end. Try this:
index=xxx ALPXX="re-connecting" SCBXX=* abc12.prod04
| dedup consecutive=true keepempty=true host sortby +_time
| append [ index=xxx ALPXX="online" SCBXX=* abc12.prod04
| dedup consecutive=true keepempty=true host ]
| transaction host startswith="ALPXX=re-connecting" endswith="ALPXX=online" unifyends= true
| concurrency duration=duration
| eval duration=tostring(duration,"duration")
| sort 0 _time
| where concurrency < 2
This collapses a sequence of "re-connecting" events to the earliest one found, and a sequence of "online" events to the latest one found. Check it carefully to make sure that it does not collapse multiple transactions for a single host into a single transaction, but I think it will work.
@EricLloyd79 - what do you get if you simply run the first part of the search?
index=xxx ALPXX="re-connecting" SCBXX=* abc12.prod04
| dedup consecutive=true keepempty=true host sortby +_time
| append [ index=xxx ALPXX="online" SCBXX=* abc12.prod04
| dedup consecutive=true keepempty=true host ]
This should give you the most recent "re-connecting" event for each host, and the oldest "online" event for each host.
Yeah it does, you are correct. As soon as I add the transaction line, it returns no results... which is mind-boggling to me. I add this line and it returned no results:
| transaction host startswith="ALPXX=re-connecting" endswith="ALPXX=online"
The more I think about it, taking the query results with the append and not with the transaction does return only two events, the first re-connecting and the last online for ALPXX. Running a transaction on those two events would maybe at least return one transaction event for those two events but even if it did that, that's not ultimately what I want. I'm feeling pretty stumped on this. I have this working via a Python SDK which outputs to an csv but I need to create a dashboard with visualizations for it. I'm considering taking the csv instead and doing an input lookup on it to get the visualizations. I feel like I understand what transactions and concurrency does but for some reason it skips over necessary stopping points.
Sometimes I wish that we could inject hybrid code into SPL kinda like PL-SQL.
Thank you for replying. For some reason your solution causes the query to return no results.
It is baffling to me that transaction searches that way. Does it start at least at the earliest time and then proceed forward searching for the endswith and then step back to find the paired startswith to create a transaction?
When I took out this line: | dedup consecutive=true keepempty=true host sortby +_time
It did return results but they seem to skip even more logs till it gets to the one it flaga as the end.
Its baffling cause I am looking at the results of which ones it picks to be the endswith and it chooses one that is surrounded by ALPXX=online log entries, so its netiher the latest or earliest in that specific transaction time. I feel like I'm missing something pivotal that will be obvious once I see it. Thanks for your input. I am going to manipulate your query and see if I can figure out how to put it to my advantage.
As a side note, when I took out this line: | dedup consecutive=true keepempty=true host sortby +_time
I also had to do the search for ALPXX=* instead of ALP='re-connecting' or it wouldn't return any results. Im very curious to hear of the nature of how transaction processes the logs. Thanks for any guidance.