Splunk Dev

Custom Logfile extract fields help

tomeki
New Member

Hi,

i am very new to Splunk and a total greenhorn in regex. I have a log file with the following format

Jul 31 12:23:32 BALTHAZAR squid[7415]: 1375237412.537     93 10.110.40.144 TCP_MISS/200 1214 GET somewebsite ftropea FIRST_UP_PARENT/content1 application/x-javascript
Jul 30 23:59:13 BALTHAZAR squid[7415]: 1375192753.517      0 10.110.40.113 TCP_DENIED/407 3646 GET somewebsite - NONE/- text/html

it is a firewall/proxy access.log and when i import the data I choose access.log as type, then I need to customize since splunk gets just the Date/Time part correct and treats the whole rest as event.

I would like to extract the following fields:

Date = Jul 31 12:23:32 
Servername = BALTHAZAR 
IP = 10.110.40.144 
Code= TCP_MISS/200 
RequestType = GET 
Website = somewebsite includes the http://
User = ftropea 

I also have to note that the username is sometimes empty and sometimes filled out.
I used the inbuilt field extractor and could extract almost all of the fields above except the User.

What i got until now is something like

(?:[^ \n]* ){3}(?P<_Servername_>[^ ]+)[^\.\n]*\.\d+\s+\d+\s+(?P<_IP_>[^ ]+)\s+(?P<_Code_>[^ ]+)\s+\d+\s+(?P<_RequestType_>[^ ]+)\s+(?P<_Website_>[^ ]+)

and I think even that this is not correct... any idea what i could do?

Or do i have to write my own app/plugin and write a parser (php/c# or whatever) for this file?

Tags (5)
0 Karma

Gilberto_Castil
Splunk Employee
Splunk Employee

You do not need to program anything specific to get these fields out of your data. These look like SYSLOG style messages. It should be noted that Splunk recommends using the Common Information Model to standardize the naming convention for fields extracted from your data. You are not mandated to do this but it is a best practice recommendation.

In your case, the following will extract most fields appropriately. If necessary, just rename them.

^(?<date>\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2})\s(?<hostname>[a-zA-Z0-9]+)\s(?<message_type>\w+)\[(?<message_id>\d+)\]\:\s+(?<epoch>\d{10}\.\d{3})\s+\d+\s(?<src>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s(?<action>[A-Z_]+?)\/(?<status>\d{3})\s+(?<bytes_in>\d+)\s+(?<method>[A-Z]+)\s+(?<url>.+?)\s+(?<user>[a-z]+|\-)\s+(?<other>[A-Z_]+)/(?<http_user_agent>\w+|-)\s+(?<http_content_type>.+?)$

So, you could run a search using this:

sourcetype=squid | rex "\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2})\s(?<hostname>[a-zA-Z0-9]+)\s(?<message_type>\w+)\[(?<message_id>\d+)\]\:\s+(?<epoch>\d{10}\.\d{3})\s+\d+\s(?<src>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s(?<action>[A-Z_]+?)\/(?<status>\d{3})\s+(?<bytes_in>\d+)\s+(?<method>[A-Z]+)\s+(?<url>.+?)\s+(?<user>[a-z]+|\-)\s+(?<other>[A-Z_]+)/(?<http_user_agent>\w+|-)\s+(?<http_content_type>.+?)$"

Which will render you something like this:

alt text

Once you've confirmed that this is what you need, automate the extraction by navigating SplunkWeb to

Manager >> Fields >> Field Extractions

Click "New"

Fill in the blanks


alt text

And enjoy your automatic extractions. There are multiple ways to accomplish this but this is the most straight forward.

--gc

rsennett_splunk
Splunk Employee
Splunk Employee

Nope, you don't need a special parser for this.
I think you might be trying to do two things at once, so I'll address the extraction first.
I notice you have line break characters in your regex, so it looks like you're trying to make a muiltiline event.
First, that's not how you do that... so let's set that aside for a moment.

Extracting fields in the way you've begun, will pull fields make them available for you to use in searching. It will not populate your index with visible field=value pairs.

Here is the extraction syntax you're looking for to pull the fields you've indicated:

(?i)(?P\w+\s+\d+\s+\d+:\d+:\d+)\s+(?P\w+[^ ]+)\s+\S+\s+\S+\s+\S+\s+(?P\S+[^ ]+)\s+(?P\S+[^ ]+)\s+\S+\s+(?P\S+[^ ]+)\s+(?P\S+[^ ]+)\s+(?P\S+[^ ]+)

That will work in the field extractor (which is creating a search time field extraction) or you can put it in your props.conf for the same effect preceded like this:

EXTRACT-all_fields = (?i)(?P\S+\s+\d+\s+\d+:\d+:\d+)\s+(?P\S+[^ ]+)\s+\S+\s+\S+\s+\S+\s+(?P\d+.\d+.\d+.\d+[^ ]+)\s+(?P\S+[^ ]+)\s+\S+\s+(?P\S+[^ ]+)\s+(?P\S+[^ ]+)\s+(?P\S+[^ ]+)\s+(?P\S+.[^ ]+)

The second iteration has a slightly different regex. I took out the \w and replaced them with \S just for the heck of it... and I added an extra field to grab what's left at the end of the string, also just for the heck of it.

If you are trying to create a multiline event with value pairs like your example so it looks more like a windows log and you want the index to look like that... that would be a whole different question.

As it again separately.

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...