How to extract a value from a field by ignoring so...

vigneshit · ‎02-19-2020

The first query I run is
index=sec_proxy_web sourcetype="bluecoat:proxysg:access:syslog" | top 10 url

I have web proxy log and a field url but in url it contains tcp, http and some ad blocker sites which i want to remove by entering those site.
for eg :
1. I need to remove the site edge-chat.facebook.com,
2. I just need to remove tcp or http or https if its present in that url

I tried few rex but i was not able to do it.
can you please help me how to get an output without this characters.

vnravikumar · ‎02-19-2020

Hi

Try this, in the below, you can filter out unwanted url in where

| makeresults 
| eval url="tcp://edge-chat.facebook.com/##tcp://www.google.com/##tcp://wynk.in/" 
| makemv delim="##" url 
| mvexpand url 
| where NOT url like("%edge-chat%") 
| rex field=url ":\/\/(?P<site>\S+)\/"

vigneshit · ‎02-20-2020

This works but i cannot use it for realtime and for a particular user in drilldown later.

mydog8it · ‎02-19-2020

Exclude the site you do not wish to include in the results by setting "NOT url=tcp://edge-chat.facebook.com" and remove tcp:// by stripping it with replace "| replace tcp://* WITH * IN url"

index=sec_proxy_web sourcetype="bluecoat:proxysg:access:syslog" NOT url=tcp://edge-chat.facebook.com |replace tcp://* WITH * IN url

anthonymelita · ‎02-19-2020

To filter a specific url from the results, just add a | where url!="tcp://edge-chat.facebook.com/"
There are multiple solutions for stripping tcp, http, https, etc...
example 1 using eval replace: | eval url=replace(url, "tcp://", "")

example 2 using rex to extract then drop old field and rename new (can't just extract and overwrite):
| rex field=url "([chpt]://)(?.*)" | fields - url | rename urla AS url

example 3 in case you want to reformat specific url only, combine multiple eval functions:
| eval url=if(url="tcp://edge-chat.facebook.com/", replace(url, "tcp://", ""), url)

vigneshit · ‎02-20-2020

EG 1 and 3 works fine
But eg2 i get
error : Error in 'rex' command: Encountered the following error while compiling the regex '([chpt]://)(?.*)': Regex: unrecognized character after (? or (?-.

I wanted to write a rex where it can ignore http https and tcp the above expression looks correct but i dont know why splunk throws above error.

Couldnt understand this "rex to extract then drop old field and rename new (can't just extract and overwrite):"

vigneshit · ‎02-20-2020

Hi I used where and replace command.
but it takes more time to run the query.
Is there any other way I can make this query still efficient:
index=sec_proxy_web user_work_country="in" user_bunit="rbei/bsd1" sourcetype="bluecoat:proxysg:access:syslog"
|where category!="Content Delivery Networks" | where category!="Chat (IM)/SMS"
|where category!="Web Ads/Analytics"
| where NOT url like("%edge-chat%")
|where NOT url like("%mtalk.google.com%")
|where NOT url like("%accounts.google%")
|where NOT url like("%doubleclick%")
|where NOT url like("%googlevideo.com%")
|where NOT url like("%clients%")
|where NOT url like("%googlesyndication%")
|where NOT url like("%adservice%")
|where NOT url like("%gstatic%")
|where NOT url like("%api%")
|where NOT url like("%gravatar%")
|where NOT url like("%youboranqs01%")
|replace tcp://* WITH * IN url
|replace http://* WITH * IN url
|replace https://* WITH * IN url
| stats count by url | sort -count

anthonymelita · ‎02-20-2020

sorry, the board formatting removed the field extraction tag from my regex.

| rex field=url "([chpt]://)(?<urla>.*)" | fields - url | rename urla AS url

I was saying you already have the field named 'url' so you cannot simply specify the same field name in the rex command to overwrite the existing values. That's why I gave it name 'urla' and then dropped the old one and then renamed the new field to the 'url'

I forget that you use % instead of * for a where. I would assume all those wildcards are what's increasing the execution time.

anthonymelita · ‎02-19-2020

It may be helpful to mention that |where does not accept wildcards. If you needed to wildcard then change to |search url!=*edge-chat.facebook.com instead. And you will want to put that before your top 10 command

How to extract a value from a field by ignoring some of the characters?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!