Monitoring Splunk

Performance impact due to log fields anonymization

strive
Influencer

Hi,

We have a requirement to anonymize two fields in log events depending on a condition. Its like, if field A value is 'a' then anonymize field B and C.

We achieved this using regex.

In our test environment we did not see any delay in events getting indexed at the rate of 0.1MBps. Unfortunately we do not have the replica of the production system where the log events are indexed at the rate of 1.18MBps.

I have two questions

1. Is there any better way to anonymize data?

2. What is the impact on performance of the system due to log fields anonymization?

Thanks

Strive

Tags (2)
0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Do post your regular expressions along with sample data. It's very easy to stumble over regex performance pitfalls, maybe there's room for improvement.

As for your questions, in principle regex at indextime is a great way to anonymize data. However, depending on how complex your conditions are, bending regex into shape to accommodate those conditions may thrash any performance. As an alternative, you could anonymize the data straight at the source - provided you can control what is logged there.

The impact on indexing performance greatly depends on the regular expression and the data. As an example, I've had a customer indexing at around 7MBps on average with regular expressions sifting through the data at index time to determine the appropriate index. With the original expressions, his (well-sized) hardware was at its knees, the regex processor occupying 99% CPU. After changes to the regex it's chugging along nicely with lots of headroom for additional data.

View solution in original post

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Do post your regular expressions along with sample data. It's very easy to stumble over regex performance pitfalls, maybe there's room for improvement.

As for your questions, in principle regex at indextime is a great way to anonymize data. However, depending on how complex your conditions are, bending regex into shape to accommodate those conditions may thrash any performance. As an alternative, you could anonymize the data straight at the source - provided you can control what is logged there.

The impact on indexing performance greatly depends on the regular expression and the data. As an example, I've had a customer indexing at around 7MBps on average with regular expressions sifting through the data at index time to determine the appropriate index. With the original expressions, his (well-sized) hardware was at its knees, the regex processor occupying 99% CPU. After changes to the regex it's chugging along nicely with lots of headroom for additional data.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The quickest way to improve this is to make the quantifiers non-greedy by changing from .* to .*?. Whether that still yields the same results depends on your data.

As another thought, you could compare this full-length regex with splitting the individual parts into three regexes - one for each to-be-replaced bit. Whether that'd be faster or not is quite hard to tell without testing and timing it.

Did you run some performance comparisons yet with vs without the anonymization?

martin_mueller
SplunkTrust
SplunkTrust

So, to translate that into English:

  • keep the first 2 words
  • replace the 3rd word
  • keep the next 13 words
  • keep everything until aap=
  • replace everything until ~
  • keep everything until uid=
  • replace everything until ~
  • keep the rest

Looking at the regex, its performance is going to depend on the data.
You might run into lots of memory consumption and erroneously taken branches from the sixth and eighth capturing group - those asterisk quantifiers are greedy, hence they will match until the end of the string and backtrack from there. That may or may not be expensive.

0 Karma

strive
Influencer
[anonymize_fields]
REGEX = (?i)^(([^ ]+ ){2})([^ ]+ )(([^ ]+ ){13})(.*aap=)([^~]+)(.*uid=)([^~]+)(.*A=1)
DEST_KEY = _raw
FORMAT = $1#### $4$6####$8####$10

[web_source]
SHOULD_LINEMERGE = false
TRANSFORMS-include = include_eventcount, anonymize_fields
TIME_PREFIX=^(?:[^ ]*( {1,2})){8}(?:\"[^\"]*\"|-)(?:[^ ]*( {1,2})){5}\[
MAX_TIMESTAMP_LOOKAHEAD=35
0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...