Solved: How to optimize the regular expression for our rex...

JDukeSplunk · ‎08-19-2016

So, we have a really nasty regex that runs against a customized version of a tomcat log. The rex finds certain strings within the _raw data and grabs the last bit of the error message. I am just looking for a more elegant solution, and one that will most likely not kill the search heads. If we find one that is good enough, we can get it out of inline and put it in a transforms/props.

 |rex field=_raw "(com.pega.apache.http.conn|java.sql|com.pega.pegarules.pub.clipboard|java.net|com.pega.pegarules.pub.services|com.pega.pegarules.pub.context|com.pega.pegarules.pub| com.pega.pegarules.pub.database|com.pega.pegarules.pub.generator|java.lang|com.sun.jersey.api.client).(?<type>\w+)(\s|:)"

The number of periods move each time, and sometimes end with a space, sometimes end with a :

Some examples of the source. Highlighted are the bits we are currently extracting.

9:59:01.300 PM

2016-08-13 21:59:01,300 [http-bio-8004-exec-1] [ STANDARD] [ ] Portal:01.50 ERROR TTAPPPEGAAPP05.company.com|172.22.101.10|HTTP|PortalFeatures|Services|PostChallengeData|A660C7C3D30428FBD26529DE9859DEB5F - LookupList : error reading from file file://llc:/LLC/Rule-Obj-FieldValue/getFieldValue.xml. java.io.IOException: Exception 'com.pega.pegarules.pub.clipboard.InvalidStreamError: Invalid clipboard stream detected in module com.pega.pegarules.data.internal.clipboard.XMLStream.new

2016-08-13 14:32:56,776 [http-bio-8001-exec-5] [ STANDARD] [ ] PHSInt:01.01 ERROR TTAPPPEGAAPP02.company.com|172.22.101.10|HTTP|AssessmentServices|Services|SaveAssessmentAnswers|AEFBBD97AEE6CED837A732AD77C6C437F - Exception
com.pega.pegarules.pub.PRRuntimeException: Unable to identify default schema for the connection to Device_Staging
at com.pega.pegarules.data.internal.access.DatabaseTableImpl.getSchemaName(DatabaseTableImpl.java:360)
at com.pega.pegarules.data.internal.access.DatabaseTableImpl.getFullyQualifiedTableName(DatabaseTableImpl.java:416)
at com.pega.pegarules.data.internal.access.rdb.SQLParser.directive(SQLParser.java:653)

2016-08-13 13:17:25,746 [http-bio-8003-exec-5] [ STANDARD] [ ] PHSInt:01.01 ERROR TTAPPPEGAAPP02.company.com|172.22.101.10|HTTP|UserActivityInt|Services|SavePartUserActivityReq - HCIncentiveEvent failed for MemberEligID:69691976Params are ObjectiveID:103021210ActivityType:2::** Caught unhandled exception: java.net.SocketTimeoutException: Read timed out

2016-08-12 10:46:40,992 [http-bio-8003-exec-4] [ STANDARD] [ ] PHSInt:01.01 ERROR TTAPPPEGAAPP08.company.com|172.22.101.10|HTTP|MessageCenter|Services|SavePtNotifPreferences|A32A2BB43A9ABBCD410AAB8D6AC3D6FD3 - Not returning connection 2 for database "pegadata" to the pool as it previously encountered the following error
User ID: (unknown)
Last SQL: call SECUREMESSAGING_PKG.InsertUpdatePtPreference( ?, ?, ?, ?, ?, ?, ?, ? )
java.sql.SQLException: ORA-06502: PL/SQL: numeric or value error: character string buffer too small

mhpark · ‎08-19-2016

Judging by only the given examples, I would go like this;

 rex field=_raw "\.(?<error_type>[^\.\:]+(Exception|Error))\:"

View solution in original post

gabriel_vasseur · ‎08-22-2016

I like mhpark's answer, but I thought I would comment on your original regex too.

First, is your main problem with it performance or elegance? I think the job inspector might help measure the performance, maybe there's a line dedicated to regexes. If performance isn't an issue, then elegance should not keep you awake at night, as much as maintainability. In that respect, your regex isn't particularly nasty.

About the regex itself, first up all your dots should be escaped, especially the one outside the parenthesis. As I'm sure you know, dots match any character so for instance this bit of your regex: (com.pega.pegarules.pub).(?<type>\w+)(\s|:)" would match the string com.pega.pegarules.public:and extract "ic" as a type... 🙂

You can also speed things up a bit by starting with a word boundary: \b(com\.pega\.apache\.http\.conn|java\.sql|........

Finally, you could regroup similar alternatives together. So for instance, you could replace com\.pega\.apache\.http\.conn|...|com\.pega\.pegarules\.pub\.clipboard with com\.pega\.(apache\.http\.conn|pegarules\.pub\.clipboard)|.... That should speed things up a bit, but again you need to benchmark it to see if it's worth the loss in readability.

That's assuming you're not going with something a lot simpler (but is it faster? :-P) like mhpark suggested.

mhpark · ‎08-22-2016

Writing all your terms would be faster for sure.
I was assuming there might be cases where the already given words could not cover.

Thank you for your comment 🙂

JDukeSplunk · ‎08-22-2016

Thanks guys. mhpark's works pretty good, although extracts some of the exceptions a little differently than the original, and it does do it faster.

Gabe,

I like you comment, but the flexability of not having to update the preceeding strings everytime a new one is added made me shy away from it. Which, was another of my goals. So if tomorrow a new error showed up under java.some.bs.string.like.this. I wouldnt have to edit the dahsboard/reports to catch it.

-JD

gabriel_vasseur · ‎08-23-2016

Yes, that is best. I mostly commented for the educational value!

gabriel_vasseur · ‎08-22-2016

That's a good point, I don't know how easy it is to gather an exhaustive list.

mhpark · ‎08-19-2016

Judging by only the given examples, I would go like this;

 rex field=_raw "\.(?<error_type>[^\.\:]+(Exception|Error))\:"

How to optimize the regular expression for our rex statement to extract Java errors from our sample data?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Splunk Community Badges!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

Join the Conversation