Solved: Re: Regex for URL parsing

ChhayaV · ‎06-27-2013

Hi,

I want to extract url's from the events as a seperate field.

Here is the log file

04/15/2013 17:51:58.09  w3wp.exe (0x113C)                           0x3D50  SharePoint Foundation           Monitoring                      nasq    Medium      Entering monitored scope (Request (GET:https://www.abc.co.in:443/GEOMETRIC/SitePages/MyEnrollment.aspx))
04/15/2013 17:51:58.26  w3wp.exe (0x113C)                           0x4AA0  SharePoint Foundation           Monitoring                      nasq    Medium      Entering monitored scope (Request (GET:https://www.abc.co.in:443/PublicSite/images/header.jpg)) 
04/15/2013 17:59:25.20  w3wp.exe (0x113C)                           0x14B0  SharePoint Foundation           Monitoring                      nasq    Medium      Entering monitored scope (Request (GET:https://www.abc.co.in:443/_LAYOUTS/ClientPortal/SilverlightWebParts/PROD/MyBenefits.xap?ver=5.19))

Here i just want to extract the url's ends with .aspx and .xap pages like
https://www.abc.co.in:443/GEOMETRIC/SitePages/MyEnrollment.aspx https://www.abc.co.in:443/_LAYOUTS/ClientPortal/SilverlightWebParts/PROD/MyBenefits.xap?ver=5.19

If i write regex as (?i)\(GET:(?P< FIELDNAME>[^\?]+) ,the url is not being extracted properly.

Please help with the regex.

MHibbin · ‎06-27-2013

Not sure your second example is an aspx file, but I'm not web developer. However the following regex will capture those that end in ".aspx"...

"GET:\w+://(?P<url>[^\)]+\.aspx)"

You can try out regular expressions on the following site... handy tool:

http://gskinner.com/RegExr/

Hope this helps.

View solution in original post

ChhayaV · ‎07-02-2013

hi,
i want to restrict my regex to first match only

Leaving Monitored Scope (Request (GET:https://www.abc/_layouts/ClientPortal/abc/CustomPages/LoginPage.aspx?ReturnUrl=%2f_layouts%2fAuthent...). Execution Time=17.1800154751023
if this is my log entry then i should get only "LoginPage.aspx" but the result is "LoginPage.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx"

burkmat · ‎06-27-2013

All current answers rely on the HTTP request being a GET-request. HTTP has several types (GET/POST/HEAD being most common), and if you want all URLs to be captured, you need to take this into consideration.

The following regex would probably be a better choice to catch all HTTP methods, and all URLs regardless of weird formats (assuming no GET-parameters are appended to the URL - if so you need to take them into consideration).

(?i)\(Request \([A-Z]+:(?<fieldname>.*\.(aspx|xap))\)\)$

Ayn · ‎06-28-2013

The regex should cover that. It does not cover parameters though, like burkmat said.

ChhayaV · ‎06-27-2013

Hi,
Its working But how can i extract word.aspx and word.word.word.xap or word.xap all other possible combinations of word and (.)

MHibbin · ‎06-27-2013

Not sure your second example is an aspx file, but I'm not web developer. However the following regex will capture those that end in ".aspx"...

"GET:\w+://(?P<url>[^\)]+\.aspx)"

You can try out regular expressions on the following site... handy tool:

http://gskinner.com/RegExr/

Hope this helps.

ChhayaV · ‎06-27-2013

Hi,
Its working But how can i extract word.aspx and word.word.word.xap or word.xap all other possible combinations of word and (.)

kristian_kolb · ‎06-27-2013

should work;

rex "\(GET:(?<fieldname>[^\)]+\.(xap|aspx))"

Regex for URL parsing

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

Regex for URL parsing

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases