All Apps and Add-ons

What is the fastest way to extract fields from Blue Coat proxy logs that the Splunk Add-on for Blue Coat ProxySG didn't extract?

daniel_augustyn
Contributor

What is the fastest way to extract fields from Blue Coat proxies logs in Splunk?
Is it better doing a simple Regex query for a single field like:

^(?:[^,\n]*,){5}(?P<hostname>[^\.]+) 

or is it better doing a longer Regex and be more specific while extracting fields, something like:

^(?P<date>\d+\-\d+\-\d+)[^ \n]* (?P<time>\d+:\d+:\d+)\s+(?P<time_taken>[^ ]+)(?:[^\.\n]*\.){3}\d+\s+\-\s+\-\s+\-\s+(?P<filter_result>\w+)[^"\n]*"(?P<category>[^"]+)[^"\n]*"\s+(?P<http_referrer>[^ ]+)\s+(?P<http_response>[^ ]+)\s+(?P<action>[^ ]+)\s+(?P<cs_method>\w+)[^ \n]* (?P<http_content_type>\w+/\w+)[^ \n]* (?P<protocol>\w+)\s+(?P<dest_host>[^ ]+)[^ \n]* (?P<dest_port>\d+)\s+(?P<cs_uri_path>[^ ]+)[^\-\n]*\-\s+(?P<cs_uri_extension>\w+)\s+\-\s+(?P<dvc_ip>[^ ]+)[^ \n]* (?P<bytes_in>\d+)[^ \n]* (?P<bytes_out>\d+)[^\-\n]*\-\s+"(?P<http_content>[^"]+)

I assume doing a longer Regex would take a really long time, since these are really specific queries. Any other idea how to extract fields from Blue Coat proxies logs? I am addressing this question for the logs which Blue Coat add-on didn't extract.

0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

Have you tried the Field Extractor??

In addition to the above (great advice, if I do say so myself), some folks find the Field Extractor very useful for quicly generating a lot of this. Just drill down in a search until you have the right records showing, then down below all the listed fields click Extract New Fields. Follow the wizard.

One hint - it doesn't write the greatest of regex, but it seems to do better if you need to map everything in the event. In your case start from the left (after the timestamp) and just do each bit all in a row. Be careful of selecting spaces. Give that a try and see if that doesn't help you.

View solution in original post

0 Karma

Richfez
SplunkTrust
SplunkTrust

Have you tried the Field Extractor??

In addition to the above (great advice, if I do say so myself), some folks find the Field Extractor very useful for quicly generating a lot of this. Just drill down in a search until you have the right records showing, then down below all the listed fields click Extract New Fields. Follow the wizard.

One hint - it doesn't write the greatest of regex, but it seems to do better if you need to map everything in the event. In your case start from the left (after the timestamp) and just do each bit all in a row. Be careful of selecting spaces. Give that a try and see if that doesn't help you.

0 Karma

daniel_augustyn
Contributor

I tried that too, and it wasn't really helpful. I will need to dump all of the logs that the BC add-on doesn't recognize and sort them and see how many rex queries I will need to write. Thanks for your help!

0 Karma

Richfez
SplunkTrust
SplunkTrust

If you do that, you might find some repeated near-patterns between a lot of them which may make your job easier.

It may be easier to develop the regexes using something like regex101. Also, you may have just found yourself a use case for actually looking at the Splunk built-in punctuation field, punct. It may prove useful to do something like

...mysearch... | stats count by punct | sort - count

That may help you focus on the bigger bang or the buck. Note, it's possible you may have one or two particular field defined in everything you have parsed, like, oh, client_ip or server_ip or something, right? So you could make your root search exclude those and see what's left

... mysearch NOT client_ip=* NOT server_ip=* | stats count by punct | sort - count

That will filter out the rows where you don't have either of those fields defined.

Good luck! If you get stuck on anything specific, be sure to ask!

0 Karma

Richfez
SplunkTrust
SplunkTrust

If the question is between one big regex or the equivalent several small ones, I would think one big one would be best.

There are several reasons.

First, if you have to make the same extractions in either case, you'll have the same basic amount of work involved in the extractions themselves so the only difference is in rooting (beginning and/or end) and in context. In one big one, you have the entire thing rooted to the beginning of the string but with the several smaller ones, only one will be rooted to the beginning of the string. There's more context for each field extraction in the longer one as well. I've never done any formal testing, but generally I've never noticed that adding more extractions to a single rex takes any longer, but adding an entirely separate second rex often can.

But, it's very easy to convert from one to many and vice versa, it could be interesting to test. Post back with what you find if you do that!

daniel_augustyn
Contributor

If I would do one long rex for each different type of proxy logs, there would be a lot of work to do. Since you have one long rex which is really specific to that single type of proxy logs, you would need to create tons of them. I was looking for something more general, and wasn't sure how to create it. In my logs, I've been seeing a lot of different field structures. Not sure why.

2016-01-14 00:42:32 284 10.130.16.102 - - proxy.domain.net x.x.x.x None - - PROXIED "Social Networking" http://www.domain.com/article/david-bowie-blackstar-album-sales-networth 200 TCP_NC_MISS GET application/javascript;%20charset=utf-8 http platform.domain.com 80 /widgets.js - js "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36" 172.16.130.10 27626 726 - "Widgets" - none
OR
2016-01-14 00:42:03 1 10.130.0.156 - - 0.0.0.0 - Invalid - invalid_request PROXIED - - 400 TCP_NC_MISS unknown - - - 0 / - - - 172.16.130.10 842 152 - "none" "none" none

I have plenty of others that don't lay out together. I would work on it and try to figure something out, but if you have any idea how to approach it, I would really appreciate.

0 Karma

Richfez
SplunkTrust
SplunkTrust

There are techniques, but I'm not sure how applicable those would be here. Still, some may help.

One technique (I used with Barracuda logs) was that they came in 3 flavors depending on a particular field. The initial 6 or 10 fields were all the same, then in the middle was "SEND" or "RECV" and everything after that had to be parsed based on if it was SEND or RECV. So, I used one rex to get everything up to but not including the SEND or whatever, and created a field with that SEND plus everything that was left. Then there were three additional rexes that used the field I just created, checking "^SEND ...." and "^RECV ..." to parse the additional information correctly in each case.

If you have a piece at the front that's always the same, a similar technique could be used. Like, if the first 7 fields are always domain, dest_ip, something, the name of the king of England ("-"), and so on, then perhaps you can get all of those for nearly all your logs with only one rex.

Keep in mind that you have to define what's around the field you are extracting in order to "anchor" the rex in the event. A good anchor is at the beginning or end ^ or $. Also good is anchoring on static text, like hostname-(?[^,]) because the parser can figure that out really easily so it's not grabbing random non-comma things all over the place and trying to call them myfield.

But your examples appear to be not like that. They appear to be just something (space) somethign (space) ...
For example in the example above of ...

2016-01-14 00:42:03 1 10.130.0.156 - - 0.0.0.0 - Invalid - invalid_request PROXIED...

it looks like the "reason" of "invalid_request" is surrounded by spaces. Umm. They're all surrounded by spaces. Maybe if you knew it was the 11th "field" you could write something like ^\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(?[^ ]+). But if you've done that in order to give it context, why not just extract where the \S+'s were? (P.S. I was probably wrong with "11", and I probably didn't count right in my fake rex above! 🙂 )

In the case of regular patterns like that space field space field space field one, you could also check the FIELD_DELIMITER = \s in props.conf. It's worth a try, at least on some of those events. 🙂

So, it seems like in most cases, you just have repeated things like \s+(?[^ ]+)\s+(?[^ ]+)... and, while not quite trivial, once you get a few under your belt they're really not very hard.

Does that makes sense? I think what I'm getting at is that there are some tips and tricks and I've tried to provide a few, but they often come down to not being a tip or a trick except that they get you thinking about it the right way, and thinking that it's fun to build these can be half the battle. I've been told I'm odd, though.

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Adding some commentary to the discussion:

  • I have noticed that one long regex performs better than many small. I assume this is because of reduced overhead in passing the results back into the python command
  • @rich7177's general idea is an approach I've used before as well: Bundle together some generic regexes that match common formats and ignore variable portions of the event. Then have whatever is needed for the variable portions. That means the common formats portions are extracted the same and are easier to manage.
0 Karma

Richfez
SplunkTrust
SplunkTrust

And, I think your "answer" should have been a comment against my answer, or a comment against your own question. Could you move it please to keep things tidy and easy to read? I can move it for you if you'd like and do that cleanup.

0 Karma

daniel_augustyn
Contributor

Thanks for moving it.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...