Splunk Search

Regex for field extraction is not working properly

daniel_augustyn
Contributor

I just did a regex for proxy fields extractions and it seems that is not working as it should have. Not sure why. Fields for some of the proxy logs are getting extracted but some don't. The weird thing about this is that the regex works fine on regex101.com and additionally, when I try to redo the rex using drop down under an event for "Extract Fields", I am getting redirected to the "extract field" page and it shows in colors that these fields should have been extracted. Not sure why they are not getting extracted. Any thoughts?

0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

daniel_augustyn,

One thing I just thought of is to try pretending this data is space-separated and let Splunk process this as an indexed extraction and provide your field names for it. This could be the shortcut you were looking for before on doing all this faster/easier, too, so if this works, it may solve two problems at once.

The docs provide examples and help to extract fields from files with structured data. You might need to redefine sourcetypes as you read events to make sure they're tagged with a sourcetype that's unique for each unique type of event - that may take a bit of thinking. But, once you have that done...

I believe you'll want props.conf to have

[MySourcetypeForThoseEvents]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITER=,
FIELD_QUOTE="
FIELD_NAMES=field1_date,field2_time,field3,field4,field5_ip...

And you may need to add to that

TIMESTAMP_FIELDS = field1_date,field2_time

Give that a shot and let us know how it goes and if you find it useful!

View solution in original post

0 Karma

woodcock
Esteemed Legend

Make sure you select PCRE which is the flavor of RegEx that splunk uses.

0 Karma

Richfez
SplunkTrust
SplunkTrust

daniel_augustyn,

One thing I just thought of is to try pretending this data is space-separated and let Splunk process this as an indexed extraction and provide your field names for it. This could be the shortcut you were looking for before on doing all this faster/easier, too, so if this works, it may solve two problems at once.

The docs provide examples and help to extract fields from files with structured data. You might need to redefine sourcetypes as you read events to make sure they're tagged with a sourcetype that's unique for each unique type of event - that may take a bit of thinking. But, once you have that done...

I believe you'll want props.conf to have

[MySourcetypeForThoseEvents]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITER=,
FIELD_QUOTE="
FIELD_NAMES=field1_date,field2_time,field3,field4,field5_ip...

And you may need to add to that

TIMESTAMP_FIELDS = field1_date,field2_time

Give that a shot and let us know how it goes and if you find it useful!

0 Karma

daniel_augustyn
Contributor

Haha, I did that already. I had to adjust a little bit the bluecoat add-on transform.conf file, two last fields in Regex were messing the logs and I had to adjust them a bit to include dashes while extracting. It was mostly UA field that in Regex that was really messing the logic and didn't want to extract the fields from the logs since it didn't recognize a dash if the UA was missing in the specific log.

0 Karma

daniel_augustyn
Contributor

The issue was with different proxy versions. I had to create a single regex that picked up the most of the logs, and I will need to manually extract the rests, which is really not that much left.

0 Karma

Richfez
SplunkTrust
SplunkTrust

It would help if we could see the regex you are using and a sample event that it won't work against. Could you post those?

daniel_augustyn
Contributor

Under the "Extract Fields" page, it shows that these fields should have been extracted. All fields are recognized correctly and circled with different colors.

( ?=[^p]*(?:portal.threatpulse.net|p.*portal.threatpulse.net))^(?P[^ ]+)\s+(?P[^ ]+)[^ \n]* (?P\d+)(?:[^ \n]* ){9}(?P\w+)[^"\n]*"(?P[^"]+)"\s+\-\s+(?P\d+)[^ \n]* (?P[^ ]+)\s+(?P\w+)\s+(?P[^ ]+)[^ \n]* (?P[a-z]+)\s+(?P\w+\.\w+\.\w+)\s+(?P[^ ]+)(?:[^ \n]* ){5}(?P\d+\.\d+\.\d+\.\d+)\s+(?P\d+)\s+(?P\d+\s+\-)

Here is the log which fields are not getting extracted:

2016-01-15 17:08:58 56 10.167.7.93 - - portal.domain.net x.x.x.x None - - OBSERVED "Technology/Internet" -  302 TCP_NC_MISS GET text/html http portal.threatpulse.net 80 / - - - 172.16.167.104 177 80 - "none" "none" 3 454f7877563349f7-00000000027dbbf5-00000000569927aa

Well, only half of fields are getting extracted to be more specific, protocol, dest_host, and dest_port are not extracted. Not sure why this is not working as any others. It should have been.

0 Karma

daniel_augustyn
Contributor

Not sure why regex is getting cut off. I attached the regex below.

0 Karma

Richfez
SplunkTrust
SplunkTrust

Didn't get the attachment (though I swear I saw it on my phone when I saw this first).

Repost where you linked before, or pastebin.com? I should have some time to check it in a bit if no one else does.

0 Karma

jluo_splunk
Splunk Employee
Splunk Employee

I think it's worth nothing that the Interactive Field Extractor is generally not going to write good regular expressions. More often than not, people are able to write better, more efficient regular expressions. If you're comfortable writing the reg ex yourself, I recommend doing that.

0 Karma

daniel_augustyn
Contributor

thanks for the input

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...