Solved: Bluecoat log with domain-based sorting possible?

supergtom · ‎03-22-2012

For example, I would like to group all the following URLs under google:
docs.google.com,
maps.google.com,
www.google.com,
...
(may be it is *google*)

Is there a way to do it such that it will show results with pre-defined domains?
I would much appreciate if such pre-defined rules already exist some where.
Thank you.

kristian_kolb · ‎03-26-2012

Well, I assume that you have an extracted field for the URL (or URI), correct?

That field would contain just a little too much information for your sorting/grouping purposes, right, e.g.

http://www.google.com/search?q=blah
https://secure.bank.co.uk/login

From that field you can extract the domain part (google, bank) as a new field with a regex, either inline in the search, or more 'permanent' by editing a config file (or using the IFX).

Inline, you could have a search that looks something like;

sourcetype=your_bluecoat_sourcetype | rex field=URL "https?://[^\.]+\.(?XXXXXXXXX[^\.]+)\." | stats c by domain

Aaargh - something seems to be wrong - I just cannot get HTML-specific characters too work. The XXXX should be replaced with the word "domain", enclosed in angle brackets (no quotes).

The final part after the | creates a table counting events by the newly extracted 'domain' field.

Hope this helps,

Kristian

View solution in original post

kristian_kolb · ‎03-26-2012

Well, I assume that you have an extracted field for the URL (or URI), correct?

That field would contain just a little too much information for your sorting/grouping purposes, right, e.g.

http://www.google.com/search?q=blah
https://secure.bank.co.uk/login

From that field you can extract the domain part (google, bank) as a new field with a regex, either inline in the search, or more 'permanent' by editing a config file (or using the IFX).

Inline, you could have a search that looks something like;

sourcetype=your_bluecoat_sourcetype | rex field=URL "https?://[^\.]+\.(?XXXXXXXXX[^\.]+)\." | stats c by domain

Aaargh - something seems to be wrong - I just cannot get HTML-specific characters too work. The XXXX should be replaced with the word "domain", enclosed in angle brackets (no quotes).

The final part after the | creates a table counting events by the newly extracted 'domain' field.

Hope this helps,

Kristian

MikeyG · ‎03-30-2012

Had same problem - this worked for me...

Created field extract named bcoat_proxysg: EXTRACT-cs_uri_authority with regex:

(?)..*?.(?P[a-z]+.[a-z]+(?=/)

then changed the search/view.

supergtom · ‎03-26-2012

thanks for the reminder.
i doubt if regex (in Splunk) can do if-then-else. otherwise, a single regex cannot handle URL with many levels of sub-domains or variations.

kristian_kolb · ‎03-26-2012

Also, have you checked how your regex would handle subdomains/ports. I believe that it might fail to handle some cases.

Not saying that the one I provided is perfect, but it will at least pick something out of it, since it does not expect a slash after three groups of characters.

I don't really know what your format looks like, but there are a couple of possible patterns, where ABC is what you want to capture;

http://www.ABC.com
http://ABC.com
http://www.ABC.co.uk
https://ABC.co.uk
ftp://ABC.com:21
http://all.work.and.no.play.ABC.com

..then you also might have trailing slashes....

/k

kristian_kolb · ‎03-26-2012

Well, if you have extracted the fields 'bytes' and 'duration', I believe your stats command at the end of the line should read:

...| stats c sum(bytes) sum(duration) by domain

/k

supergtom · ‎03-26-2012

thanks to kristian. the question is solved.
the regex i used is rex field=Url "[http|https|ftp|tcp]?\:\/\/[^\.]+\.(?[^\.]+).[^\.]+\/"
the regex is aimed to resolve the format ://xxx.domain.xxx/ (i duno if there is any error)

supergtom · ‎03-26-2012

While i am still handling the regex stuff, there is actually a second question.

For example, there are 2 lines of event
maps.google.com bytes_a duration_a
docs.google.com bytes_b duration_b

Will it be combined as follows?
google bytes_a+b duration_a+b

kristian_kolb · ‎03-26-2012

Yeah, well, the IFX may have a hard time trying to find the correct regex. It isn't perfect, but you often get an idea on how to craft your own.

If this answered your question, please mark as "answered" a/o upvote. Thanks, K.

supergtom · ‎03-26-2012

In case anyone would like to get quick answer on regex URL http://gskinner.com/RegExr/ (I suppose you need some basis on regex)

supergtom · ‎03-26-2012

Btw, I have use the IFX and it seems not good in making custom regex for URL (I am not good at regex too).

supergtom · ‎03-26-2012

Thank you very much. That is exactly what I would like to archieve.

supergtom · ‎03-25-2012

I am not sure if the term "grouping" is appropriate.

supergtom · ‎03-25-2012

I have the log downloaded from bluecoat server and would like to import it to Splunk for log analysis. Normally, splunk will treat each line (of bluecoat log) as an event. Each event contains some fields. One of them is URL-related. I would like to group each event with similar URL characteristic (i.e. under the same domain, in the example above, google). It is because the log may be huge. Doing such grouping will reduce the size. In addition, the result (or the report) looks simpler.

kristian_kolb · ‎03-25-2012

Sorry, are you talking about configuration of BlueCoat or Splunk? Not sure exactly what you want to do, though.

/k

Bluecoat log with domain-based sorting possible?

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?