Splunk Enterprise

REGEX not being applied even though it is working in REX

jto13
Explorer

Hi all,

We have ingested some logs using a heavy forwarder as below in /opt/splunk/etc/apps/test_inputs/local/:

inputs.conf

[monitor:///opt/splunk/test/test.log]

index=test
sourcetype=aws:elb:accesslogs
disabled=0
start_from=oldest
_meta = splunk_orig_fwd::splunkfwd_hostname

Props.conf

[aws:elb:accesslogs]
TRANSFORMS-aws_elb_accesslogs = aws_elb_accesslogs_extract_all_fields

Transforms.conf

[aws_elb_accesslogs_extract_all_fields]
REGEX = ^(?P<Protocol>\S+)\s+(?P<Timestamp>\S+)\s+(?P<ELB>\S+)\s+(?P<ClientPort>\S+)\s+(?P<TargetPort>\S+)\s+(?P<RequestProcessingTime>\S+)\s+(?P<TargetProcessingTime>\S+)\s+(?P<ResponseProcessingTime>\S+)\s+(?P<ELBStatusCode>\S+)\s+(?P<TargetStatusCode>\S+)\s+(?P<ReceivedBytes>\S+)\s+(?P<SentBytes>\S+)\s+\"(?P<Request>[^\"]+)\"\s+\"(?P<UserAgent>[^\"]+)\"\s+(?P<SSLCipher>\S+)\s+(?P<SSLProtocol>\S+)\s+(?P<TargetGroupArn>\S+)\s+\"(?P<TraceId>[^\"]+)\"\s+\"(?P<DomainName>[^\"]+)\"\s+\"(?P<ChosenCertArn>[^\"]+)\"\s+(?P<MatchedRulePriority>\S+)\s+(?P<RequestCreationTime>\S+)\s+\"(?P<ActionExecuted>[^\"]+)\"\s+\"(?P<RedirectUrl>[^\"]+)\"\s+\"(?P<ErrorReason>[^\"]+)\"\s+(?P<AdditionalInfo1>\S+)\s+(?P<AdditionalInfo2>\S+)\s+(?P<AdditionalInfo3>\S+)\s+(?P<AdditionalInfo4>\S+)\s+(?P<TransactionId>\S+)

Before we applied the props and transforms.conf, we have used the rex function to test the logs in the search head as below and the fields appeared when searched:

index=test sourcetype=aws:elb:accesslogs
| rex field=_raw "^(?P<Protocol>\S+)\s+(?P<Timestamp>\S+)\s+(?P<ELB>\S+)\s+(?P<ClientIP>\S+)\s+(?P<TargetIP>\S+)\s+(?P<RequestProcessingTime>\S+)\s+(?P<TargetProcessingTime>\S+)\s+(?P<ResponseProcessingTime>\S+)\s+(?P<ELBStatusCode>\S+)\s+(?P<TargetStatusCode>\S+)\s+(?P<ReceivedBytes>\S+)\s+(?P<SentBytes>\S+)\s+\"(?P<Request>[^\"]+)\"\s+\"(?P<UserAgent>[^\"]+)\"\s+(?P<SSLCipher>\S+)\s+(?P<SSLProtocol>\S+)\s+(?P<TargetGroupArn>\S+)\s+\"(?P<TraceId>[^\"]+)\"\s+\"(?P<DomainName>[^\"]+)\"\s+\"(?P<ChosenCertArn>[^\"]+)\"\s+(?P<MatchedRulePriority>\S+)\s+(?P<RequestCreationTime>\S+)\s+\"(?P<ActionExecuted>[^\"]+)\"\s+\"(?P<RedirectUrl>[^\"]+)\"\s+\"(?P<ErrorReason>[^\"]+)\"\s+(?P<AdditionalInfo1>\S+)\s+(?P<AdditionalInfo2>\S+)\s+(?P<AdditionalInfo3>\S+)\s+(?P<AdditionalInfo4>\S+)\s+(?P<TransactionId>\S+)"

However, when we ingested the logs as usual, the fields weren't extracted as per the rex during the search, is there anything missing or why the regex isn't being applied to the logs? 

Appreciate if anyone has any advice on this.

Thank you in advance.

Labels (1)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

It's a bit more complicated than just saying that "search-time extractions are simpler". But "the Splunk way" is to use search-time extractions when possible. That sums it up without getting too deeply into technical intricacies of the indexing process. 😉

And yes, if you don't add fields.conf entries for indexed fields, Splunk won't know that it has to look for indexed fields instead of search-time extracted ones. That's why you wouldn't find your data when you had those TRANSFORMS.

View solution in original post

hieuba6868
Explorer

you can try REPORT instead of TRANSFORMS in props.conf

0 Karma

PickleRick
SplunkTrust
SplunkTrust

You're using TRANSFORMS which mean you're defining indexed fields (which you should generally avoid) without defining those fields as indexed in fields.conf.

You should rather define them as search-time extractions and move the definitions to SH layer.

jto13
Explorer

Hi Rick,

Thanks for the response, but just wondering what would be the disadvantages of index-time extractions? Our search head is quite overloaded so we are making changes at the heavy forwarder side and trying to reduce the load by parsing it there.

We have also tried to change the TRANSFORMS to EXTRACT instead in the props.conf and put the regex there as well but it is also not working even after restarting splunk for some reason, so we're wondering if any additional config lines are required.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

There is no single good answer to this question.

Generally, indexed fields cause additional overhead in terms of storage size, can - if bloated - counterintuitively have negative impact on performance and for straight event searches do not give that much of a performance gain versus well written raw events search.

Having said that, there are some scenarios when adding some indexed fields helps.

One is when you do a lot of summarizing on some fields. Not searching but summarizing. Then indeed tstats is lightning fast compared to search | stats. (OTOH you can usually get similar results by report acceleration or summary indexing so indexed fields might not be needed).

Another case is when you have a lot of values which can appear often in multiple fields. Splunk searches by finding values first and then parsing the event containing those values to find out if it parses out to given field. So if you have 10k events of which only 10 contain a string "whatever" and out of those ten nine are values of a field named comment, a search for "comment=whatever" will only need to check 10 events out of those 10k and of those 90% of considered events will match. So the search will be quite effective. But if your data contained the word "whatever" in 3k events of which only 9 were in the comment field, Splunk would have to fetch all 3k events, parse them and see if the comment field indeed contained that word. Since only 9 of those 3k events contain that word in that right spot, this search would be very ineffective.

So there is no one size fits all. But the general rule is that adding indexed fields can sometimes help and it's not a thing that should never be used at all but should be only done when indeed needed. Not just added blindly for all possible fields in all your data because then you're effectively transforming Splunk into something it is not - a document database with schema on index. And for that you don't need Splunk.

And if your SH is already overloaded, that usually (again - as always, it of course depends on particular case; yours might be completely different but I'm speaking from general experience) means that either you simply have too many concurrently running searches. And creating indexed fields won't help here much. Or you have badly written searches. (which is nothing to be ashamed of; Splunk is easy to start working with but can be tricky to master; writing effective searches requires quite a significant bit of knowledge).

jto13
Explorer

Hi Rick,

Thanks for the info, understood on that. For now, we are trying to get it to work first to at least get the fields and maybe understand where we configured it wrong, would there be any problem with our props and transforms.conf that made it unable to work?

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Try searching for

field::value

instead of

field=value

To test whether the fields are getting indexed (remember that field names _are_ case sensitive).

0 Karma

jto13
Explorer

Hi Rick,

When you mean to search for the field::value, do you mean at the rex part or during search? Apologies if my wording was confusing but the rex part managed to work and we did see the fields when we just searched the index (index= index_name) using verbose mode. However, we did not manage to see those fields when just using the props and transforms.conf.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Just use key::value as your search term. Like

index=something somekey::somevalue

You can also check if fields are indexed with (an example looking for Protocol)

| walklex=your_index type=all
| search term=" Protocol::*"
| table term

(Need to give it a quite big time range).

0 Karma

jto13
Explorer

Hi Rick,

Instead of props.conf and transforms.conf in the HF (index-time extraction), we have moved the regex settings to the props.conf in all of our search heads (search-time extraction) manually in the /opt/splunk/etc/system/local directory as below:
 
props.conf

[aws:elb:accesslogs]
EXTRACT-aws_elb_accesslogs = ^(?P<Protocol>\S+)\s+(?P<Timestamp>\S+)\s+(?P<ELB>\S+)\s+(?P<ClientPort>\S+)\s+(?P<TargetPort>\S+)\s+(?P<RequestProcessingTime>\S+)\s+(?P<TargetProcessingTime>\S+)\s+(?P<ResponseProcessingTime>\S+)\s+(?P<ELBStatusCode>\S+)\s+(?P<TargetStatusCode>\S+)\s+(?P<ReceivedBytes>\S+)\s+(?P<SentBytes>\S+)\s+\"(?P<Request>[^\"]+)\"\s+\"(?P<UserAgent>[^\"]+)\"\s+(?P<SSLCipher>\S+)\s+(?P<SSLProtocol>\S+)\s+(?P<TargetGroupArn>\S+)\s+\"(?P<TraceId>[^\"]+)\"\s+\"(?P<DomainName>[^\"]+)\"\s+\"(?P<ChosenCertArn>[^\"]+)\"\s+(?P<MatchedRulePriority>\S+)\s+(?P<RequestCreationTime>\S+)\s+\"(?P<ActionExecuted>[^\"]+)\"\s+\"(?P<RedirectUrl>[^\"]+)\"\s+\"(?P<ErrorReason>[^\"]+)\"\s+(?P<AdditionalInfo1>\S+)\s+(?P<AdditionalInfo2>\S+)\s+(?P<AdditionalInfo3>\S+)\s+(?P<AdditionalInfo4>\S+)\s+(?P<TransactionId>\S+)

This is working as of now, but it is weird that the props and transforms configurations wouldn't work since the regex are the same. 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

1. That's good. You should use search-time extractions as I said from the beginning.

2. And as I said before, without additional configurations indexed fields are not searchable the same way search-time fields are. It doesn't mean "transforms don't work".

0 Karma

jto13
Explorer

Hi Rick,

Ok understood, to sum it up as below:

The search-time extraction settings are much simpler and there is less load to our environment compared to the index-time extraction.

For our index-time extraction, there should be additional configurations as well in our props and transforms conf files and most likely that's why our existing ones didn't work.

We resolved it by moving the regex settings to the props.conf in our search heads (search-time extraction) manually in the /opt/splunk/etc/system/local directory as below:

[aws:elb:accesslogs]
EXTRACT-aws_elb_accesslogs = ^(?P<Protocol>\S+)\s+(?P<Timestamp>\S+)\s+(?P<ELB>\S+)\s+(?P<ClientPort>\S+)\s+(?P<TargetPort>\S+)\s+(?P<RequestProcessingTime>\S+)\s+(?P<TargetProcessingTime>\S+)\s+(?P<ResponseProcessingTime>\S+)\s+(?P<ELBStatusCode>\S+)\s+(?P<TargetStatusCode>\S+)\s+(?P<ReceivedBytes>\S+)\s+(?P<SentBytes>\S+)\s+\"(?P<Request>[^\"]+)\"\s+\"(?P<UserAgent>[^\"]+)\"\s+(?P<SSLCipher>\S+)\s+(?P<SSLProtocol>\S+)\s+(?P<TargetGroupArn>\S+)\s+\"(?P<TraceId>[^\"]+)\"\s+\"(?P<DomainName>[^\"]+)\"\s+\"(?P<ChosenCertArn>[^\"]+)\"\s+(?P<MatchedRulePriority>\S+)\s+(?P<RequestCreationTime>\S+)\s+\"(?P<ActionExecuted>[^\"]+)\"\s+\"(?P<RedirectUrl>[^\"]+)\"\s+\"(?P<ErrorReason>[^\"]+)\"\s+(?P<AdditionalInfo1>\S+)\s+(?P<AdditionalInfo2>\S+)\s+(?P<AdditionalInfo3>\S+)\s+(?P<AdditionalInfo4>\S+)\s+(?P<TransactionId>\S+)

Thank you for the help.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It's a bit more complicated than just saying that "search-time extractions are simpler". But "the Splunk way" is to use search-time extractions when possible. That sums it up without getting too deeply into technical intricacies of the indexing process. 😉

And yes, if you don't add fields.conf entries for indexed fields, Splunk won't know that it has to look for indexed fields instead of search-time extracted ones. That's why you wouldn't find your data when you had those TRANSFORMS.

jto13
Explorer

Hi Rick,

Makes sense, thanks a lot for your help. 🙏

0 Karma

Bhumi
Explorer

Hi @jto13 

Can you please share few sample raw data from which you are trying to extract the fields. If there is any sensitive information,do mask it and then share it . 

Also wanted to confirm the dataflow , is it from HF->indexer ?? 

 

0 Karma

jto13
Explorer

Hi Bhumi,

Yes, it is from HF->indexer

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...