topic Re: Regex for URL parsing in Splunk Search

Regex for URL parsing

ChhayaV — Thu, 27 Jun 2013 09:46:05 GMT

Hi,

I want to extract url's from the events as a seperate field.

Here is the log file

04/15/2013 17:51:58.09  w3wp.exe (0x113C)                           0x3D50  SharePoint Foundation           Monitoring                      nasq    Medium      Entering monitored scope (Request (GET:https://www.abc.co.in:443/GEOMETRIC/SitePages/MyEnrollment.aspx))
04/15/2013 17:51:58.26  w3wp.exe (0x113C)                           0x4AA0  SharePoint Foundation           Monitoring                      nasq    Medium      Entering monitored scope (Request (GET:https://www.abc.co.in:443/PublicSite/images/header.jpg)) 
04/15/2013 17:59:25.20  w3wp.exe (0x113C)                           0x14B0  SharePoint Foundation           Monitoring                      nasq    Medium      Entering monitored scope (Request (GET:https://www.abc.co.in:443/_LAYOUTS/ClientPortal/SilverlightWebParts/PROD/MyBenefits.xap?ver=5.19))

Here i just want to extract the url's ends with .aspx and .xap pages like
https://www.abc.co.in:443/GEOMETRIC/SitePages/MyEnrollment.aspx https://www.abc.co.in:443/_LAYOUTS/ClientPortal/SilverlightWebParts/PROD/MyBenefits.xap?ver=5.19

If i write regex as (?i)\(GET:(?P< FIELDNAME>[^\?]+) ,the url is not being extracted properly.

Please help with the regex.

Re: Regex for URL parsing

MHibbin — Thu, 27 Jun 2013 10:21:45 GMT

Not sure your second example is an aspx file, but I'm not web developer. However the following regex will capture those that end in ".aspx"...

"GET:\w+://(?P<url>[^\)]+\.aspx)"

You can try out regular expressions on the following site... handy tool:

http://gskinner.com/RegExr/

Hope this helps.

Re: Regex for URL parsing

kristian_kolb — Thu, 27 Jun 2013 10:28:39 GMT

should work;

rex "\(GET:(?<fieldname>[^\)]+\.(xap|aspx))"

Re: Regex for URL parsing

burkmat — Thu, 27 Jun 2013 12:13:25 GMT

All current answers rely on the HTTP request being a GET-request. HTTP has several types (GET/POST/HEAD being most common), and if you want all URLs to be captured, you need to take this into consideration.

The following regex would probably be a better choice to catch all HTTP methods, and all URLs regardless of weird formats (assuming no GET-parameters are appended to the URL - if so you need to take them into consideration).

(?i)\(Request \([A-Z]+:(?<fieldname>.*\.(aspx|xap))\)\)$

Re: Regex for URL parsing

ChhayaV — Fri, 28 Jun 2013 05:57:12 GMT

Hi,
Its working But how can i extract word.aspx and word.word.word.xap or word.xap all other possible combinations of word and (.)

Re: Regex for URL parsing

ChhayaV — Fri, 28 Jun 2013 05:57:17 GMT

Hi,
Its working But how can i extract word.aspx and word.word.word.xap or word.xap all other possible combinations of word and (.)

Re: Regex for URL parsing

Ayn — Fri, 28 Jun 2013 08:04:35 GMT

The regex should cover that. It does not cover parameters though, like burkmat said.

Re: Regex for URL parsing

ChhayaV — Tue, 02 Jul 2013 09:38:38 GMT

hi,
i want to restrict my regex to first match only

Leaving Monitored Scope (Request (GET:https://www.abc/_layouts/ClientPortal/abc/CustomPages/LoginPage.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F)). Execution Time=17.1800154751023
if this is my log entry then i should get only "LoginPage.aspx" but the result is "LoginPage.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx"