Getting Data In

Bluecoat log with domain-based sorting possible?

supergtom
New Member

For example, I would like to group all the following URLs under google:
docs.google.com,
maps.google.com,
www.google.com,
...
(may be it is *google*)

Is there a way to do it such that it will show results with pre-defined domains?
I would much appreciate if such pre-defined rules already exist some where.
Thank you.

Tags (3)
0 Karma
1 Solution

kristian_kolb
Ultra Champion

Well, I assume that you have an extracted field for the URL (or URI), correct?

That field would contain just a little too much information for your sorting/grouping purposes, right, e.g.

http://www.google.com/search?q=blah
https://secure.bank.co.uk/login

From that field you can extract the domain part (google, bank) as a new field with a regex, either inline in the search, or more 'permanent' by editing a config file (or using the IFX).

Inline, you could have a search that looks something like;

sourcetype=your_bluecoat_sourcetype | rex field=URL "https?://[^\.]+\.(?XXXXXXXXX[^\.]+)\." | stats c by domain

Aaargh - something seems to be wrong - I just cannot get HTML-specific characters too work. The XXXX should be replaced with the word "domain", enclosed in angle brackets (no quotes).

The final part after the | creates a table counting events by the newly extracted 'domain' field.

Hope this helps,

Kristian

View solution in original post

0 Karma

kristian_kolb
Ultra Champion

Well, I assume that you have an extracted field for the URL (or URI), correct?

That field would contain just a little too much information for your sorting/grouping purposes, right, e.g.

http://www.google.com/search?q=blah
https://secure.bank.co.uk/login

From that field you can extract the domain part (google, bank) as a new field with a regex, either inline in the search, or more 'permanent' by editing a config file (or using the IFX).

Inline, you could have a search that looks something like;

sourcetype=your_bluecoat_sourcetype | rex field=URL "https?://[^\.]+\.(?XXXXXXXXX[^\.]+)\." | stats c by domain

Aaargh - something seems to be wrong - I just cannot get HTML-specific characters too work. The XXXX should be replaced with the word "domain", enclosed in angle brackets (no quotes).

The final part after the | creates a table counting events by the newly extracted 'domain' field.

Hope this helps,

Kristian

0 Karma

MikeyG
Explorer

Had same problem - this worked for me...

Created field extract named bcoat_proxysg: EXTRACT-cs_uri_authority with regex:

(?)..*?.(?P[a-z]+.[a-z]+(?=/)

then changed the search/view.

0 Karma

supergtom
New Member

thanks for the reminder.
i doubt if regex (in Splunk) can do if-then-else. otherwise, a single regex cannot handle URL with many levels of sub-domains or variations.

0 Karma

kristian_kolb
Ultra Champion

Also, have you checked how your regex would handle subdomains/ports. I believe that it might fail to handle some cases.

Not saying that the one I provided is perfect, but it will at least pick something out of it, since it does not expect a slash after three groups of characters.

I don't really know what your format looks like, but there are a couple of possible patterns, where ABC is what you want to capture;

http://www.ABC.com
http://ABC.com
http://www.ABC.co.uk
https://ABC.co.uk
ftp://ABC.com:21
http://all.work.and.no.play.ABC.com

..then you also might have trailing slashes....

/k

0 Karma

kristian_kolb
Ultra Champion

Well, if you have extracted the fields 'bytes' and 'duration', I believe your stats command at the end of the line should read:

...| stats c sum(bytes) sum(duration) by domain

/k

0 Karma

supergtom
New Member

thanks to kristian. the question is solved.
the regex i used is rex field=Url "[http|https|ftp|tcp]?\:\/\/[^\.]+\.(?[^\.]+).[^\.]+\/"
the regex is aimed to resolve the format ://xxx.domain.xxx/ (i duno if there is any error)

0 Karma

supergtom
New Member

While i am still handling the regex stuff, there is actually a second question.

For example, there are 2 lines of event
maps.google.com bytes_a duration_a
docs.google.com bytes_b duration_b

Will it be combined as follows?
google bytes_a+b duration_a+b

0 Karma

kristian_kolb
Ultra Champion

Yeah, well, the IFX may have a hard time trying to find the correct regex. It isn't perfect, but you often get an idea on how to craft your own.

If this answered your question, please mark as "answered" a/o upvote. Thanks, K.

0 Karma

supergtom
New Member

In case anyone would like to get quick answer on regex URL http://gskinner.com/RegExr/ (I suppose you need some basis on regex)

0 Karma

supergtom
New Member

Btw, I have use the IFX and it seems not good in making custom regex for URL (I am not good at regex too).

0 Karma

supergtom
New Member

Thank you very much. That is exactly what I would like to archieve.

0 Karma

supergtom
New Member

I am not sure if the term "grouping" is appropriate.

0 Karma

supergtom
New Member

I have the log downloaded from bluecoat server and would like to import it to Splunk for log analysis. Normally, splunk will treat each line (of bluecoat log) as an event. Each event contains some fields. One of them is URL-related. I would like to group each event with similar URL characteristic (i.e. under the same domain, in the example above, google). It is because the log may be huge. Doing such grouping will reduce the size. In addition, the result (or the report) looks simpler.

0 Karma

kristian_kolb
Ultra Champion

Sorry, are you talking about configuration of BlueCoat or Splunk? Not sure exactly what you want to do, though.

/k

0 Karma
Get Updates on the Splunk Community!

Routing Data to Different Splunk Indexes in the OpenTelemetry Collector

This blog post is part of an ongoing series on OpenTelemetry. The OpenTelemetry project is the second largest ...

Getting Started with AIOps: Event Correlation Basics and Alert Storm Detection in ...

Getting Started with AIOps:Event Correlation Basics and Alert Storm Detection in Splunk IT Service ...

Register to Attend BSides SPL 2022 - It's all Happening October 18!

Join like-minded individuals for technical sessions on everything Splunk!  This is a community-led and run ...