 
					
				
		
We have several problems that we weren't able to resolve with Splunk's SPL. Problems are listed below. Any suggestions is greatly appreciated.
Let's say we have several event types: A, B, C, D, E.
Each event posted is in a _time ordered sequence which we need to maintain.
PROBLEM 1:
And we want transactions that begin with A and end with D.
We can't do the simple transaction below:
... | transaction startswith="A" endswith="D"
because it gives us wrong data, with A sometimes missed inside the Splunk recursive search:
For example, if the event sequence was (AB)[AB]CDA.., Splunk returns the transaction ABCD from the outer sequence (AB)(CD) as well as the inner sequence [AB][CD], which is incorrect. To be correct, the first (AB) should be a discarded transaction, and [AB]CD should be returned, then a new transaction starts again at A...
SO we decide to find transactions that ALWAYS start with A. For instance:
... | transaction startswith="A" 
This will give us sample events like:
A
ABCB
AB
ABCE
ABCDCBCBCE
ABCDCDE
ABCDE
ABCDCDCBCDE
ABCBCE
ABCDCDCDE
Now I want to find transactions that start with A, and end in D, but since A is the marker for the beginning of a transaction, we do not want an A grouped into a wrong transaction like above (A starts a new transaction), so we do this:
... | transaction startswith="A" | transaction startswith="A" endswith="D"
This should only give us a subset of the above, that contain D, but not necessarity endswith 😧
ABCDCDE
ABCDE
ABCDCDCBCDE
ABCDCDCDE
PROBLEM 2:
We would rather have the transactions trimmed at D as follows, but Splunk is not doing this. It leaves whatever is there created from the superset. Maybe piping a transaction to another transaction might not be the solution either?:
ABCDCD
ABCD
ABCDCDCBCD
ABCDCDCD
NOW, lets say we want to have a transaction that startswith A, followed by 1 or more Bs, and ending in E. There could be 0 to many Cs and Ds in the mix but as long as A*B*E is satisfied it should return all the transactions that satisfy this requirement. In this example we are using only 5 event types, but we could have dozens of event types. And thus it could be possible that we are looking for a sequence such as A*B*G*M*V*Z.
The subset of the above PROBLEM 1 example should return for A*B*E:
ABCE
ABCDCBCBCE
ABCDCDE
ABCDE
ABCDCDCBCDE
ABCBCE
ABCDCDCDE
PROBLEM 3:
We need to specify sequence order for specific events with many unknown events that can be intermixed as long as the sequence is satisfied. Example A*B*G*M*V*Z. Is there anyway to do this?
It should return something like this as a transaction:
ABCBCDGJKMSRSVYZ
Thanks in advance.
 
					
				
		
Splunk transactions are built in reverse order, and the transaction command actually requires that the events are ordered by descending time. So Splunk looks first for the end of the transaction and then works backwards to the beginning. When you think about this, it may change your approach. You might try this:
yoursearchhere 
| transaction id endswith="D"
| where somefield(0)=="A"
But this example causes Splunk to build a new transaction each time it sees an instance of "D" - and what you want is for Splunk to start with the earliest instance of "A" and end with the latest instance of "D", with no intervening "A"s.
Also, Splunk cannot deal with interleaved transactions unless there is a unique identifier for each transaction. If you have multiple transactions with an id of "35", then one transaction must end before another begins.
Finally, the transaction command is very memory intensive. When Splunk runs short of memory, it may "evict" transactions that it otherwise would have kept. Test your searches with the smallest reasonable time range to avoid this problem. You may want to look at the Search Job Inspector for any warnings that would indicate that Splunk was not able to form the transactions completely.
I think what would work best is
yoursearchhere 
| transaction id startswith="A"
| then go backwards through the transaction's events to the latest event that is "D"
As you can see, I haven't figured out the last part yet. But I think that some of these other issues may be inhibiting your progress. I hope this helps you to get closer to a solution.
 
					
				
		
 
		
		
		
		
		
	
			
		
		
			
					
		Do you still have this issue?
 
					
				
		
Yes. The issue is Splunk transactions are recursive in nature so it dives in and comes back out of the recursion. We would need a linear option for this scenario to work.
Given your example in problem 1 of your question ( (AB)[AB]CDA... ), and given the structure of the sample data you've provided in the above comment, what in the data would tell you to discard the first (AB) in problem 1? Anything?
 
					
				
		
To clarify, "A" is that starting point for any transaction, so anytime an "A" is found before the ending point "D" is found, it should start a new transaction, and discard the prior one.
But how splunk does transactions is that when another startswith "A" is found (2nd transaction) before the prior transaction (1st) is complete (ended), splunk brings up a new transaction (2nd) which in turn will look for its own endswith. When the 2nd transaction finds its endswith "D", it is complete, and splunk returns to the 1st transaction. This is where the problem lies.
The 1st transaction will continue to look for the endswith "D" and when it finds it it will complete the 1st transaction. (AB)(CD) is returned as a valid transaction which is incorrect, because (AB) is not contiguous with (CD) according to the timestamps. There are many possible events that could have inserted itself between (AB) and (CD) but we kept it to only 4 extraneous events in this case.
Do you have any kind of transaction ID in the data? Or some sample data? Also, to what end do you need to join the events into transactions like this? To count success vs failure, visualization, etc?
 
					
				
		
Basically, we need to determine a path from point X to point Y. And how that person reached point Y, always beginning at X. Anytime a person reaches a point, an event is logged that he reached that point. Sample data could be a simple json. And the returned transaction would be a list of these events.
{ "location": "A", "id":35, "time": 1454532214 }
{ "location": "B", "id":35, "time": 1454532215 }
{ "location": "C", "id":35, "time": 1454532216 }
{ "location": "B", "id":35, "time": 1454532217 }
{ "location": "C", "id":35, "time": 1454532218 }
{ "location": "D", "id":35, "time": 1454532219 }
{ "location": "C", "id":35, "time": 1454532220 }
{ "location": "D", "id":35, "time": 1454532221 }
{ "location": "E", "id":35, "time": 1454532222 }
{ "location": "A", "id":35, "time": 1454532223 }
I am thinking that once I get the transaction that is separated by say "A", then I could use regex to grab the transactions that would fit my criteria, say "A*B*C*D" where an "asterisk" represents possible loops of any OTHER location than the location following the "asterisk" (which would be NOT "B" for A*B).
The final outcome would be to find:
1) The count of the number of transactions that match the criteria. ie A*B*C*D
2) Compare that transaction count to the previous supersets to determine how many people never reached then next location.
   a) A* - count 500
   b) A*B - count 490 (10 persons never reached B)
   c) A*B*C -count 300 (190 persons never reached C)
   d) A*B*C*D - count 130 (170 persons never reached D)
3) List the most likely transactions that match the criteria and group them. ie A*B*C*D returned the following possible transactions. There will likely be more than 6 transaction types but for brevity...
   a) ABCD - count 10
   b) AXBCXD - count 16
   c) AYXBXYCXYXYD - count 49 - this would be something I would like to know as the most favorite path.
   d) ABYXCYXYXYXYXYD - count 3
   e) AXYXYBCYXYXD - count 22
   f) ABCXYXYXYXYXD - count 30  - this would be something I would like to know as the 2nd most favorite path.
