I just did a regex for proxy fields extractions and it seems that is not working as it should have. Not sure why. Fields for some of the proxy logs are getting extracted but some don't. The weird thing about this is that the regex works fine on regex101.com and additionally, when I try to redo the rex using drop down under an event for "Extract Fields", I am getting redirected to the "extract field" page and it shows in colors that these fields should have been extracted. Not sure why they are not getting extracted. Any thoughts?
daniel_augustyn,
One thing I just thought of is to try pretending this data is space-separated and let Splunk process this as an indexed extraction and provide your field names for it. This could be the shortcut you were looking for before on doing all this faster/easier, too, so if this works, it may solve two problems at once.
The docs provide examples and help to extract fields from files with structured data. You might need to redefine sourcetypes as you read events to make sure they're tagged with a sourcetype that's unique for each unique type of event - that may take a bit of thinking. But, once you have that done...
I believe you'll want props.conf to have
[MySourcetypeForThoseEvents]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITER=,
FIELD_QUOTE="
FIELD_NAMES=field1_date,field2_time,field3,field4,field5_ip...
And you may need to add to that
TIMESTAMP_FIELDS = field1_date,field2_time
Give that a shot and let us know how it goes and if you find it useful!
Make sure you select PCRE
which is the flavor of RegEx
that splunk uses.
daniel_augustyn,
One thing I just thought of is to try pretending this data is space-separated and let Splunk process this as an indexed extraction and provide your field names for it. This could be the shortcut you were looking for before on doing all this faster/easier, too, so if this works, it may solve two problems at once.
The docs provide examples and help to extract fields from files with structured data. You might need to redefine sourcetypes as you read events to make sure they're tagged with a sourcetype that's unique for each unique type of event - that may take a bit of thinking. But, once you have that done...
I believe you'll want props.conf to have
[MySourcetypeForThoseEvents]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITER=,
FIELD_QUOTE="
FIELD_NAMES=field1_date,field2_time,field3,field4,field5_ip...
And you may need to add to that
TIMESTAMP_FIELDS = field1_date,field2_time
Give that a shot and let us know how it goes and if you find it useful!
Haha, I did that already. I had to adjust a little bit the bluecoat add-on transform.conf file, two last fields in Regex were messing the logs and I had to adjust them a bit to include dashes while extracting. It was mostly UA field that in Regex that was really messing the logic and didn't want to extract the fields from the logs since it didn't recognize a dash if the UA was missing in the specific log.
The issue was with different proxy versions. I had to create a single regex that picked up the most of the logs, and I will need to manually extract the rests, which is really not that much left.
It would help if we could see the regex you are using and a sample event that it won't work against. Could you post those?
Under the "Extract Fields" page, it shows that these fields should have been extracted. All fields are recognized correctly and circled with different colors.
( ?=[^p]*(?:portal.threatpulse.net|p.*portal.threatpulse.net))^(?P[^ ]+)\s+(?P[^ ]+)[^ \n]* (?P\d+)(?:[^ \n]* ){9}(?P\w+)[^"\n]*"(?P[^"]+)"\s+\-\s+(?P\d+)[^ \n]* (?P[^ ]+)\s+(?P\w+)\s+(?P[^ ]+)[^ \n]* (?P[a-z]+)\s+(?P\w+\.\w+\.\w+)\s+(?P[^ ]+)(?:[^ \n]* ){5}(?P\d+\.\d+\.\d+\.\d+)\s+(?P\d+)\s+(?P\d+\s+\-)
Here is the log which fields are not getting extracted:
2016-01-15 17:08:58 56 10.167.7.93 - - portal.domain.net x.x.x.x None - - OBSERVED "Technology/Internet" - 302 TCP_NC_MISS GET text/html http portal.threatpulse.net 80 / - - - 172.16.167.104 177 80 - "none" "none" 3 454f7877563349f7-00000000027dbbf5-00000000569927aa
Well, only half of fields are getting extracted to be more specific, protocol, dest_host, and dest_port are not extracted. Not sure why this is not working as any others. It should have been.
Not sure why regex is getting cut off. I attached the regex below.
Didn't get the attachment (though I swear I saw it on my phone when I saw this first).
Repost where you linked before, or pastebin.com? I should have some time to check it in a bit if no one else does.
I think it's worth nothing that the Interactive Field Extractor is generally not going to write good regular expressions. More often than not, people are able to write better, more efficient regular expressions. If you're comfortable writing the reg ex yourself, I recommend doing that.
thanks for the input