Knowledge Management

How to extract a value from a field by ignoring some of the characters?

vigneshit
New Member

The first query I run is
index=sec_proxy_web sourcetype="bluecoat:proxysg:access:syslog" | top 10 url

I have web proxy log and a field url but in url it contains tcp, http and some ad blocker sites which i want to remove by entering those site.
for eg :
1. I need to remove the site edge-chat.facebook.com,
2. I just need to remove tcp or http or https if its present in that url

I tried few rex but i was not able to do it.
can you please help me how to get an output without this characters.

alt text

Tags (1)
0 Karma

vnravikumar
Champion

Hi

Try this, in the below, you can filter out unwanted url in where

| makeresults 
| eval url="tcp://edge-chat.facebook.com/##tcp://www.google.com/##tcp://wynk.in/" 
| makemv delim="##" url 
| mvexpand url 
| where NOT url like("%edge-chat%") 
| rex field=url ":\/\/(?P<site>\S+)\/"
0 Karma

vigneshit
New Member

This works but i cannot use it for realtime and for a particular user in drilldown later.

0 Karma

mydog8it
Builder

Exclude the site you do not wish to include in the results by setting "NOT url=tcp://edge-chat.facebook.com" and remove tcp:// by stripping it with replace "| replace tcp://* WITH * IN url"

index=sec_proxy_web sourcetype="bluecoat:proxysg:access:syslog" NOT url=tcp://edge-chat.facebook.com |replace tcp://* WITH * IN url
0 Karma

anthonymelita
Contributor

To filter a specific url from the results, just add a | where url!="tcp://edge-chat.facebook.com/"
There are multiple solutions for stripping tcp, http, https, etc...
example 1 using eval replace: | eval url=replace(url, "tcp://", "")

example 2 using rex to extract then drop old field and rename new (can't just extract and overwrite):
| rex field=url "([chpt]://)(?.*)"
| fields - url
| rename urla AS url

example 3 in case you want to reformat specific url only, combine multiple eval functions:
| eval url=if(url="tcp://edge-chat.facebook.com/", replace(url, "tcp://", ""), url)

0 Karma

vigneshit
New Member

EG 1 and 3 works fine
But eg2 i get
error : Error in 'rex' command: Encountered the following error while compiling the regex '([chpt]://)(?.*)': Regex: unrecognized character after (? or (?-.

I wanted to write a rex where it can ignore http https and tcp the above expression looks correct but i dont know why splunk throws above error.

Couldnt understand this "rex to extract then drop old field and rename new (can't just extract and overwrite):"

0 Karma

vigneshit
New Member

Hi I used where and replace command.
but it takes more time to run the query.
Is there any other way I can make this query still efficient:
index=sec_proxy_web user_work_country="in" user_bunit="rbei/bsd1" sourcetype="bluecoat:proxysg:access:syslog"
|where category!="Content Delivery Networks" | where category!="Chat (IM)/SMS"
|where category!="Web Ads/Analytics"
| where NOT url like("%edge-chat%")
|where NOT url like("%mtalk.google.com%")
|where NOT url like("%accounts.google%")
|where NOT url like("%doubleclick%")
|where NOT url like("%googlevideo.com%")
|where NOT url like("%clients%")
|where NOT url like("%googlesyndication%")
|where NOT url like("%adservice%")
|where NOT url like("%gstatic%")
|where NOT url like("%api%")
|where NOT url like("%gravatar%")
|where NOT url like("%youboranqs01%")
|replace tcp://* WITH * IN url
|replace http://* WITH * IN url
|replace https://* WITH * IN url
| stats count by url | sort -count

0 Karma

anthonymelita
Contributor

sorry, the board formatting removed the field extraction tag from my regex.

| rex field=url "([chpt]://)(?<urla>.*)" | fields - url | rename urla AS url

I was saying you already have the field named 'url' so you cannot simply specify the same field name in the rex command to overwrite the existing values. That's why I gave it name 'urla' and then dropped the old one and then renamed the new field to the 'url'

I forget that you use % instead of * for a where. I would assume all those wildcards are what's increasing the execution time.

0 Karma

anthonymelita
Contributor

It may be helpful to mention that |where does not accept wildcards. If you needed to wildcard then change to |search url!=*edge-chat.facebook.com instead. And you will want to put that before your top 10 command

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...