Splunk Search

Use positive lookahead in regex when applying field transformation at index time

Path Finder

I am trying to normalize the URLs from the access log file in tomcat in order to analyze the evolution of the requests performance

Example URLs:


192.33.20.22 2014-10-15 13:47:16,477 "POST /test/rest/1.0/payments/reimbursement/164653 HTTP/1.1" 400 240 773 10282
192.33.20.22 2014-10-15 13:46:27,062 "POST /test/rest/1.0/payments/reimbursement/164653 HTTP/1.1" 400 241 2068 10282
192.33.20.22 2014-10-23 12:45:26,197 "GET /test/rest/1.0/applications/10113 HTTP/1.1" 200 507 97 110860
192.33.20.22 2014-10-23 11:54:05,302 "GET /test/rest/1.0/applications/10114 HTTP/1.1" 200 507 92 110860
192.33.20.22 2014-10-23 11:53:54,313 "GET /test/rest/1.0/applications/10115/generateKey HTTP/1.1" 200 509 1236 110860
192.33.20.22 2014-10-23 11:53:54,313 "GET /test/rest/1.0/applications/10116/generateKey HTTP/1.1" 200 509 1236 110860

There are many different types of urls, these are just a couple of examples so it must be generic

I want to replace all occurrences of an id in the url by a common element (like "byId") in order to analyze the performance of the urls.

What I have done so far is :

.../system/props.conf:


[test-access-log]
TRANSFORMS-fix-urls = remove-trailing-id

.../system/transforms.conf

[remove-trailing-id]

REGEX = ^(.*)(GET|HEAD|POST|PUT|OPTIONS|CONNECT)(\s\/test)((\/.*?)+)\/(?=[0-9]{1,}\s)([0-9]{1,})(\sHTTP.*)$

FORMAT = $1$2$3$4/byId$6

DEST_KEY = _raw

I am using a regex positive lookahead in order to know when there is an id coming (?=[0-9]{1,}\s). As you can see, the fifth group ($5) should be the id in each case (example: /12345). I have tested my regular expression and it works on a regular expression tester. However, I am uable to make it work with Splunk.

Is there something that I am missing or is there a better way of accomplishing such a task.

This is what I want the urls to look like:


192.33.20.22 2014-10-15 13:47:16,477 "POST /test/rest/1.0/payments/reimbursement/byId HTTP/1.1" 400 240 773 10282
192.33.20.22 2014-10-23 12:45:26,197 "GET /test/rest/1.0/applications/byId HTTP/1.1" 200 507 97 110860
192.33.20.22 2014-10-23 12:45:26,197 "GET /test/rest/1.0/applications/generateConfigById HTTP/1.1" 200 507 97 110860

Any help would be very much appreciated.

Thanks!

0 Karma

Motivator
| rex mode=sed "s.\/(\d+)[\s\/].byId.g"

SEDCMD in props.conf on the indexer does the same thing to the indexed data. Transforms is not the only way to change data before indexing.

SEDCMD-byId = s.\/(\d+)[\s\/].byId.g

If you want the digits merely deleted then remove byId.

0 Karma

Splunk Employee
Splunk Employee

This modification at search time seems to give what you are looking for.

If not, please give before and after examples of what you are looking for.

  | rex field=_raw mode=sed "s/(\w+\/)\d+( |\/)/\1byId\2/g"
0 Karma

Splunk Employee
Splunk Employee

I know you said you wanted to sed the data at index time, but let me try and dissuade you.

Once you lose that ID, you can't get it back. And you may want it later.

This will extract the data you want at search time.

       |  rex field=_raw "(GET|HEAD|POST|PUT|OPTIONS|CONNECT)\s(?.+)/\d+ HTTP"

Path Finder

This answer doesn't help me. I would be willing to use a search time transform if it solved my issue. Maybe I wasn't clear enough in my question. I need to regroup all urls that have an id. Simply removing the id does not cut it, I need to add something to make the group unique. In each case, the url already exists without the ID and signifies a "get all" (they are two different methods). This means that my performance would be skewed and the two different method invocations would be regrouped as one which is incorrect.

0 Karma

Motivator

and that rex doesn't actually extract anything. splunkguy, what are you trying to do? It seems complicated and there seems like there is a much simpler solution if the entire picture can be seen.

0 Karma