Splunk Search

Regex for field extraction is not working properly

daniel_augustyn
Contributor

I just did a regex for proxy fields extractions and it seems that is not working as it should have. Not sure why. Fields for some of the proxy logs are getting extracted but some don't. The weird thing about this is that the regex works fine on regex101.com and additionally, when I try to redo the rex using drop down under an event for "Extract Fields", I am getting redirected to the "extract field" page and it shows in colors that these fields should have been extracted. Not sure why they are not getting extracted. Any thoughts?

0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

daniel_augustyn,

One thing I just thought of is to try pretending this data is space-separated and let Splunk process this as an indexed extraction and provide your field names for it. This could be the shortcut you were looking for before on doing all this faster/easier, too, so if this works, it may solve two problems at once.

The docs provide examples and help to extract fields from files with structured data. You might need to redefine sourcetypes as you read events to make sure they're tagged with a sourcetype that's unique for each unique type of event - that may take a bit of thinking. But, once you have that done...

I believe you'll want props.conf to have

[MySourcetypeForThoseEvents]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITER=,
FIELD_QUOTE="
FIELD_NAMES=field1_date,field2_time,field3,field4,field5_ip...

And you may need to add to that

TIMESTAMP_FIELDS = field1_date,field2_time

Give that a shot and let us know how it goes and if you find it useful!

View solution in original post

0 Karma

woodcock
Esteemed Legend

Make sure you select PCRE which is the flavor of RegEx that splunk uses.

0 Karma

Richfez
SplunkTrust
SplunkTrust

daniel_augustyn,

One thing I just thought of is to try pretending this data is space-separated and let Splunk process this as an indexed extraction and provide your field names for it. This could be the shortcut you were looking for before on doing all this faster/easier, too, so if this works, it may solve two problems at once.

The docs provide examples and help to extract fields from files with structured data. You might need to redefine sourcetypes as you read events to make sure they're tagged with a sourcetype that's unique for each unique type of event - that may take a bit of thinking. But, once you have that done...

I believe you'll want props.conf to have

[MySourcetypeForThoseEvents]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITER=,
FIELD_QUOTE="
FIELD_NAMES=field1_date,field2_time,field3,field4,field5_ip...

And you may need to add to that

TIMESTAMP_FIELDS = field1_date,field2_time

Give that a shot and let us know how it goes and if you find it useful!

0 Karma

daniel_augustyn
Contributor

Haha, I did that already. I had to adjust a little bit the bluecoat add-on transform.conf file, two last fields in Regex were messing the logs and I had to adjust them a bit to include dashes while extracting. It was mostly UA field that in Regex that was really messing the logic and didn't want to extract the fields from the logs since it didn't recognize a dash if the UA was missing in the specific log.

0 Karma

daniel_augustyn
Contributor

The issue was with different proxy versions. I had to create a single regex that picked up the most of the logs, and I will need to manually extract the rests, which is really not that much left.

0 Karma

Richfez
SplunkTrust
SplunkTrust

It would help if we could see the regex you are using and a sample event that it won't work against. Could you post those?

daniel_augustyn
Contributor

Under the "Extract Fields" page, it shows that these fields should have been extracted. All fields are recognized correctly and circled with different colors.

( ?=[^p]*(?:portal.threatpulse.net|p.*portal.threatpulse.net))^(?P[^ ]+)\s+(?P[^ ]+)[^ \n]* (?P\d+)(?:[^ \n]* ){9}(?P\w+)[^"\n]*"(?P[^"]+)"\s+\-\s+(?P\d+)[^ \n]* (?P[^ ]+)\s+(?P\w+)\s+(?P[^ ]+)[^ \n]* (?P[a-z]+)\s+(?P\w+\.\w+\.\w+)\s+(?P[^ ]+)(?:[^ \n]* ){5}(?P\d+\.\d+\.\d+\.\d+)\s+(?P\d+)\s+(?P\d+\s+\-)

Here is the log which fields are not getting extracted:

2016-01-15 17:08:58 56 10.167.7.93 - - portal.domain.net x.x.x.x None - - OBSERVED "Technology/Internet" -  302 TCP_NC_MISS GET text/html http portal.threatpulse.net 80 / - - - 172.16.167.104 177 80 - "none" "none" 3 454f7877563349f7-00000000027dbbf5-00000000569927aa

Well, only half of fields are getting extracted to be more specific, protocol, dest_host, and dest_port are not extracted. Not sure why this is not working as any others. It should have been.

0 Karma

daniel_augustyn
Contributor

Not sure why regex is getting cut off. I attached the regex below.

0 Karma

Richfez
SplunkTrust
SplunkTrust

Didn't get the attachment (though I swear I saw it on my phone when I saw this first).

Repost where you linked before, or pastebin.com? I should have some time to check it in a bit if no one else does.

0 Karma

jluo_splunk
Splunk Employee
Splunk Employee

I think it's worth nothing that the Interactive Field Extractor is generally not going to write good regular expressions. More often than not, people are able to write better, more efficient regular expressions. If you're comfortable writing the reg ex yourself, I recommend doing that.

0 Karma

daniel_augustyn
Contributor

thanks for the input

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...