Splunk Search

What's wrong with my rex?

Communicator

I'm trying to time an event. The event occurs within a burst of log events, which arrive at my Splunk server and usually end up being contained as one Splunk event. This burst of log events contains date-time stamps for each individual transaction, so I'm rex'ing in an attempt to both isolate both the beginning and the end of the subsection I'm interested in, as well as the times for beginning and end. Then I convert those strings to epoch, so I can get the total time it took to complete that part. (...with some help from other people on this community to create the rex string.)

So here is both the beginning event and the rex string:

1206 090527 (3185) 3 SparkPostOutConnector: Log initialized

 rex "^(?<log_i>.+?)\s\(3185\)\s3\sSparkPostOutConnector:\sLog\sinitialized"

And here is the end event, and the rex string:

1206 090528 (3185) 3 SparkPostOutConnector: EMail Sent Successfully!! Accepted recipients: 2

rex "^(?<job_m>.+?)\s\(3185\)\s3\sSparkPostOutConnector:\sEMail\sSent\sSuccessfully"

And the good news it that it works...sometimes. For instance, sometimes I get the value I'm looking for:

log_i = 1204 102211
job_m = 1204 102212

And other times...maddeningly...it absolutely ignores what is supposed to be an obvious start and end to the event, it completely ignores them. For instance, in this log entry...it will not find the start and end section:

1206 090526 (1719) 2 JobBegin:servicerequest=RestInput
1206 090526 (1726) 3 Collect Event:ReceiptData < servicerequest=RestInput
1206 090526 (5026) 3 Sender of job set to:
1206 090526 (1730) 3 preproc message:ReceiptData < servicerequest=RestInput
1206 090526 (1731) 3 Doing message:ReceiptData < servicerequest=RestInput
1206 090526 (1739) 3   Process:ProcessingEngine_New_job_1_exstream (v. 1) > PDFOutput(null)
1206 090527 (0089) 2 EX001012W (1) The system key expires in 58 days on 2019-02-01. Send an email to "software.keys@opentext.com" for information on updating your key or contact OpenText Customer Support if email is unavailable.
1206 090527 (0089) 2 EX002522W (1) The locale is not set for this application. The language and locale will be set to default.
1206 090527 (0089) 3 EX002697I (1) Data aggregation has been enabled due to XML end tag processing.
1206 090527 (0089) 3 EX001800I (1) >>>>>>>>>>>> Driver File: opening the customer driver file named ReceiptData. <<<<<<<<<<<<
1206 090527 (0089) 3 EX001000I (1) ********************* Customer 1:    Starts at input record 3 (byte 144)
1206 090527 (0089) 1 EX004108E (1) The PNG pass-through file ($$CNT$$k7iFrf4NoInN9jSQT9WfcQ==) for variable RES MANAGER FILE is not a supported file or is damaged. File is ignored.
1206 090527 (0089) 1 EX004108E (2) The PNG pass-through file ($$CNT$$J+jQzJSMMF_9TsTKtJIJrg==) for variable RES MANAGER FILE is not a supported file or is damaged. File is ignored.
1206 090527 (0089) 2 EX004146W (1) Table rows set to split across pages are not supported in UDL output types. Rows will print on a single page and may cause output inconsistencies.
1206 090527 (0089) 2 EX004146W (2) Table rows set to split across pages are not supported in UDL output types. Rows will print on a single page and may cause output inconsistencies.
1206 090527 (0089) 2 EX004146W (3) Table rows set to split across pages are not supported in UDL output types. Rows will print on a single page and may cause output inconsistencies.
1206 090527 (0089) 2 EX004146W (4) Table rows set to split across pages are not supported in UDL output types. Rows will print on a single page and may cause output inconsistencies.
1206 090527 (0090) 2 OpenText Exstream Engine exited with exit code 8.
1206 090527 (5105) 3 Receiver of created document set (Document, reciever):E1C4B98A-04E6-3646-9FA7-7CBA2721A97E,
1206 090527 (5105) 3 Receiver of created document set (Document, reciever):17F9CFC3-D762-AA43-AB81-AA86DAA6CB77,
1206 090527 (3185) 3 SparkPostOutConnector: Log initialized
1206 090527 (0576) 3 Queue item E1C4B98A-04E6-3646-9FA7-7CBA2721A97E with external ID '' completed successfully.
1206 090527 (3185) 3 SparkPostOutConnector: There are 2 attachments
1206 090527 (3185) 3 SparkPostOutConnector: DataItem: id=9FFE7869-1511-F44C-A5CC-82520E4B30E9name=HTMLImage9.png, type=attachment, contentType=image/png
1206 090527 (3185) 3 SparkPostOutConnector: DataItem: id=FDC3FF95-E8C3-C24F-864A-4E66830E9F77name=RTGLogo2.png, type=attachment, contentType=image/png
1206 090527 (3185) 3 SparkPostOutConnector: Processing a request
1206 090528 (3185) 3 SparkPostOutConnector: EMail Sent Successfully!! Accepted recipients: 2
1206 090528 (0576) 3 Queue item 17F9CFC3-D762-AA43-AB81-AA86DAA6CB77 with external ID '' completed successfully.
1206 090528 (5105) 3 Receiver of created document set (Document, reciever):FD540AAC-0978-A146-8E85-331926BA46C7,
1206 090528 (1723) 2 JobEnd:servicerequest=RestInput
1206 090528 (0576) 3 Queue item 8B552D3D-97B6-0A46-881B-282137BFDF92 with external ID '' completed successfully.
1206 090528 (0576) 3 Queue item FD540AAC-0978-A146-8E85-331926BA46C7 with external ID '' completed successfully.
1206 090528 (0572) 3 Tracker 'B27EF624-8FF2-5C46-B408-3327A2B4D5A3' (StrsID 5207606, external ID '') completed successfully.

Executing this search comes up with nada:

index=foo "1719" "0572" 
| rex "^(?<log_i>.+?)\s\(3185\)\s3\sSparkPostOutConnector:\sLog\sinitialized"
| rex "^(?<job_m>.+?)\s\(3185\)\s3\sSparkPostOutConnector:\sEMail\sSent\sSuccessfully"
| eval job_start_epoch = strptime(log_i, "%m%d %H%M%S") 
| eval job_end_epoch = strptime(job_m, "%m%d %H%M%S") 
| eval job_dur_in_sec = job_end_epoch - job_start_epoch
| table job_start_epoch, job_end_epoch, job_dur_in_sec

If I manage to get any results out of this search, the statistics table will put the begin event on one line, end event on another line, and absolutely refuses to do the eval for the duration between the two. (So that's another problem.)

I'm stumped. Ideally, what I'd like the table to show (on the same line) would be the begin time, the end time, and the time it took (in seconds) to complete the sub-transaction within the larger transaction. Oh, and if at all possible, to eliminate cases where there were incomplete events (because these do happen). I know my rex's work...because if I break down the search to only look for that single event, I can get it occasionally - so SOMETHING about them works (and I developed it on regex101). What I can't figure out is why it won't work every time. Assuming the data is consistent, the only thing I can figure is that my rex is faulty.

Anyone feel like telling this newb where he went wrong?

0 Karma

SplunkTrust
SplunkTrust

What we need is a copy of the raw log, before any muddling by anything else, so that line breaking and white space is preserved.

One thing I noted in the regex was the very beginning with your named capture groups:

^(?<log_i>.+?)

I would not use ".+" here, in this small sample it can take 8000+ steps to match. Since you have a defined format you could re-write as follows:

^\s*(?<log_i>\d+\s+\d+)\s+\(3185\)\s+3\s+SparkPostOutConnector:\s+Log\s+initialized
^\s*(?<job_m>\d+\s+\d+)\s+\(3185\)\s+3\s+SparkPostOutConnector:\s+EMail\s+Sent\s+Successfully

This gets it down to ~500 steps to match and reduces the time it takes to process by 25-85%.

I'm not sure without the actual raw data samples, but this might help with your matching problem...also, I added a zero or more whitespace to the beginning...from your paste it looks like maybe there's whitespace at the start of the line and your original regex doesn't account for that (but that could also just be from the wonky formatting in the post...)

To be honest, your original regex did make the capture in rege101...so...really need the actual raw...

0 Karma

Communicator

Thank you for the input! Yes, I had the search running on rege101, so I started pulling my hair out...I just knew I was doing something wrong, but not enough experience to figure out what.

SplunkTrust
SplunkTrust

shameless plug, would you mind accepting my answer if you feel like that's appropriate 🙂

0 Karma

Communicator

I don't mind accepting your answer, but I've already accepted and awarded points to JensT. Is it even possible to accept another one?

0 Karma

SplunkTrust
SplunkTrust

strange, its not showing as answered...but thats fine lol, we don't need 2 answers! 😛

0 Karma

Communicator
index=foo
| rex "^(?<datetime>\d+\s\d+)\s\((?<job_id>\d+)\)"
| eval timestamp = strptime(datetime,"%m%d %H%M%S")
| eval job_start_time = if(match(_raw, "Log initialized"), timestamp, null())
| eval job_end_time = if(match(_raw, "EMail Sent Successfully"), timestamp, null())
| stats values(job_start_time) AS Start, values(job_end_time) AS End by job_id
| eval durationInSec = End-Start
| eval Start = strftime(Start, "%D %T"), End = strftime(End, "%D %T")
0 Karma

Communicator

This is a fascinating search. Right off, it extracts the time and job_ID from each line. Then, it assigns an epoch date to each line. Then it evaluates to see if the line it's working on is the correct job ID to start on...except I think at this point it falls down. This search (an extracted part),

index=foo
| rex "^(?<datetime>\d+\s\d+)\s\((?<job_id>\d+)\)"
| eval timestamp = strptime(datetime,"%m%d %H%M%S")
| eval mail_start_time = if(match(_raw, "Log initialized"), timestamp, null())
| eval mail_end_time = if(match(_raw, "EMail Sent Successfully"), timestamp, null())
| table job_id, timestamp, mail_start_time, mail_end_time

Generates this table:

job_id,     timestamp,          mail_start_time,    mail_end_time
1719,       1544130116.000000   ,           ,
1719,       1544130114.000000
0089,       1544130112.000000
0089,       1544130106.000000
1719,       1544130112.000000
1719,       1544130088.000000
1719,       1544130083.000000
0089,       1544130080.000000,  1544130080.000000,  1544130080.000000
1719,       1544130106.000000
0576,       1544130076.000000
1719,       1544130076.000000
1719,       1544130067.000000
1719,       1544130080.000000
0089,       1544130064.000000
1719,       1544130064.000000
1719,       1544130064.000000
1719,       1544130060.000000
1719,       1544130061.000000
1719,       1544130025.000000
0089,       1544130022.000000
1719,       1544130022.000000
1719,       1544130023.000000
0089,       1544130012.000000
1719,       1544130012.000000

It doesn't appear to be finding this event:

 1206 090527 (3185) 3 SparkPostOutConnector: Log initialized

...or this event:

1206 090528 (3185) 3 SparkPostOutConnector: EMail Sent Successfully!! Accepted recipients: 2

...so I don't think it's figuring out the time difference between them.

0 Karma

Path Finder

The query provided by @JensT is the query you are looking for.
If you are sure that your sample dataset is as per real data, then verify if you don't have any event sampling turned on by coincidence.
Here is a link describing how to turn the event sampling off:
https://docs.splunk.com/Documentation/Splunk/7.2.1/Search/Retrieveasamplesetofevents#Specify_a_sampl...

0 Karma

Communicator

Thanks, that is good advice, and I did check it. And, no, sampling is not turned on. I can say this because I am unable to find the ui-prefs.conf in my /$SPLUNK_HOME/etc/apps/<app_name>/local/ directory, and according to the link you gave me, that would be there if it were set.

I would be more than willing to upload a sample of the logs...is there a way to take an export and then upload it for use?

0 Karma

Path Finder

In Splunk web UI right below the search bar is menu for sampling.
Do you have there "No Event Sampling" ? Or is there something like "Sampling 1:10" ?

And you can upload sample logs by attaching an attachment, I guess. I am a bit new to this forum as a registered user and me myself I don't have enough points yet to attach an attachment. I would attach a screenshot of the sampling settings if I can 😉

0 Karma

Communicator

It is set as "No Event Sampling".

My priorities have shifted, and I've been pointed at a new project, so now I can't follow up and upload a file. I've decided rather than spend points uploading the log, I'll award the points to JensT, for an honest effort and I'll hope that if I have to come back to this, he'll be charitable and continue helping me with it.

Thank you, JensT.

0 Karma

Communicator

You're most welcome!
A pleasure.

0 Karma

Communicator

Hi,

not sure what you mean. I used exactly your dataset.
The table output does not make sense. As it indicates that one event is start and end at the same time.
The trick is the stats ! It will not work with a table.

0 Karma

Can you show us a screenshot of how the data looks in splunk? is the example you posted one large event in splunk, or are there multiple events?

0 Karma

Communicator

"a screenshot of how the data looks in Splunk"? I'm not certain how to post screenshots, would it be more useful, perhaps to describe it? Because these bursts of log data about the transactions happen typically within a few seconds, so each burst ends up in separate Splunk events, sometimes even spanning the minute-mark (and ending up as two Splunk events, despite being single bursts).

That data I posted was a single event in Splunk (despite being multiple log entries for the application which generated them). However, as I said, sometimes these do span multiple Splunk events. Rarely will one of these bursts take longer than a few seconds, though, so typically they end up as single Splunk events. That is why I resorted to using rex to parse both the log entries for both start/begin time and stop/end time.

0 Karma