Grouping URLs by their path variable pattern

kronite13 · ‎07-15-2021

I need to do an analysis on API calls using logs, like avg, min, max, percentile99, percentil95, percentile99 response time, and also hits per second.

So, if I have events like below :

/data/users/1443 | 0.5 sec
/data/users/2232 | 0.2 sec
/data/users/39 | 0.2 sec

Expectation: I want them to be grouped like below, as per their API pattern :

proxy              max_response_time
/data/users/{id} | 0.5 sec

These path variables (like {id}) can be numerical or can be a string with special characters

I have about 3000 such API patterns which have path variables in them, they can be categorized into 3 types, those that have a path variable only at the end, those that have 1 or more path variables only in the middle, and those that have 1 or more path variables in the middle as well as in the end. Note: there are no arguments after the API i.e. like /data/view/{name}/pagecount?age=x. There will be just the URI part

proxy                              method   request_time
/data/users/{id}                    POST    0.046
/server/healthcheck/check/up        GET     0.001
/data/commons/people/multi_upsert   POST    0.141
/store/org/manufacturing/multi_read POST    0.363
/data/users/{id}/homepage/{name}    POST    0.084
/data/view/{name}/pagecount         PUT     0.043

Category 1 (path variable only at the end) :
/data/users/{id}                    POST    0.046

Category 2 (1 or more path variables only in the middle) :
/data/view/{name}/pagecount                         PUT     0.043
/data/view/{name}/details/{type}/pagecount          PUT     0.043

Category 3 (1 or more path variables only in the middle and also at the end) :
/data/users/{id}/homepage/{name}    POST    0.084
/data/users/{id}/homepage/{type}/details/{name} POST    0.084

Current Query :

index="*myindex*" host="*abc*" host!=*ftp* sourcetype!=infra* sourcetype!=linux* sourcetype = "nginx:plus:access" 
| bucket span=1s _time| stats count by env,tenant,uri_path,request_method,_time

I need the uri_path to be grouped as per the API patterns I have.

1 option is to add 3000 regex replace statements, like the one blow, in the query for each API pattern, but that makes query too heavy to parse, I tried something like this, for a sample pattern /api/data/users/{id} :

|rex mode=sed field=uri_path "s/\/api\/data\/users\/([^\/]+)$/\/api\/data\/users\/{id}/g"

richgalloway · ‎07-16-2021

I've done that sort of normalization using patterns within a case function. Like this:

index="*myindex*" host="*abc*" host!=*ftp* sourcetype!=infra* sourcetype!=linux* sourcetype = "nginx:plus:access" 
| eval path=case(like(uri_path, "/data/user/%/homepage/%/details/%"),"/data/users/{id}/homepage/{type}/details/{name}", 
like(uri_path, "/data/users/%/homepage/%"), "/data/users/{id}/homepage/",
like(uri_path, "/data/users/%"), "data/users/{id}",
like(uri_path, "/data/view/%/details/%/pagecount"), "/data/view/{id}/details/{type}/pagecount",
like(uri_path, "/data/view/%/pagecount"), "/data/view/{name}/pagecount",
1==1,uri_path)
| bin span=1s _time
| stats count by env,tenant,path,request_method,_time

---
If this reply helps you, Karma would be appreciated.

kronite13 · ‎07-16-2021

Hi @richgalloway ,

Thanks for the reply, I tried using it like this, but it gives a warning that having a wildcard in the middle might be an issue in matching if there are special characters in place of it. Do you think there is a way to prevent any misses due to this and be sure that it will be working correctly 100%?

kronite13 · ‎07-16-2021

@richgalloway : I think using match() and exact regex will do the trick, I am testing it. Will let you know if it handles all the cases, I will mark this as the accepted answer.

Thanks!

venkatasri · ‎07-16-2021

Hi @kronite13

Unless you have unique identification field associated to each dynamic url pattern what you are trying to do is correct gives you the desired result. Having unique-id for each dynamic url is very rare in logs.

Know your url patterns upfront
replace the dynamic portion of url using rex sed mode
Apply stats aggr function max min avg on response_time

Other approach i could think of instead rex mode=sed, match the patterns of url's into categories and assign them a unique-value then group by unique-value.

Example pseudo code: you can use if, case like conditional stuff its upto coder

if url is like /data/user/something-1 then set categorie="url-1"

if url is like /data/users/some-id(/homepage/some-name then set categorie="url-2"

stats earliest(url) as url_sample , max(response_time)... by categorie

further change url_sample with format you want to display for readability - /data/user/{id}

---

An upvote would be appreciated and Accept solution if this reply helps !

efika · ‎07-15-2021

Hi @kronite13 ,

You don't necessarily need to end up with 3,000 regexes but I think you will have to have some kind of reference to the exposed api endpoint that you then need to import, possibly into a multivalue field and check for the similarity in order to do the grouping you wish to do.

Hope this help.

kronite13 · ‎07-15-2021

Hi @efika ,

Thanks for your response!
Could you give an example for what you are saying?

What I am getting is, I need to add a new eval column with all the 3000 API endpoint patterns comma separated (Ex: /data/user/details/{id}, /data/places/{place_name}/street etc), and then check if the API endpoint in the event, matches with any of the API endpoint patterns which I have added in the eval column? Is that what you mean?

Grouping URLs by their path variable pattern

regex

rex

Combine Multiline Logs into a Single Event with SOCK - a Guide for Advanced Users

Everything Community at .conf24!

Index This | I’m short for "configuration file.” What am I?