Hi,
We have a requirement to anonymize two fields in log events depending on a condition. Its like, if field A value is 'a' then anonymize field B and C.
We achieved this using regex.
In our test environment we did not see any delay in events getting indexed at the rate of 0.1MBps. Unfortunately we do not have the replica of the production system where the log events are indexed at the rate of 1.18MBps.
I have two questions
1. Is there any better way to anonymize data?
2. What is the impact on performance of the system due to log fields anonymization?
Thanks
Strive
Do post your regular expressions along with sample data. It's very easy to stumble over regex performance pitfalls, maybe there's room for improvement.
As for your questions, in principle regex at indextime is a great way to anonymize data. However, depending on how complex your conditions are, bending regex into shape to accommodate those conditions may thrash any performance. As an alternative, you could anonymize the data straight at the source - provided you can control what is logged there.
The impact on indexing performance greatly depends on the regular expression and the data. As an example, I've had a customer indexing at around 7MBps on average with regular expressions sifting through the data at index time to determine the appropriate index. With the original expressions, his (well-sized) hardware was at its knees, the regex processor occupying 99% CPU. After changes to the regex it's chugging along nicely with lots of headroom for additional data.
Do post your regular expressions along with sample data. It's very easy to stumble over regex performance pitfalls, maybe there's room for improvement.
As for your questions, in principle regex at indextime is a great way to anonymize data. However, depending on how complex your conditions are, bending regex into shape to accommodate those conditions may thrash any performance. As an alternative, you could anonymize the data straight at the source - provided you can control what is logged there.
The impact on indexing performance greatly depends on the regular expression and the data. As an example, I've had a customer indexing at around 7MBps on average with regular expressions sifting through the data at index time to determine the appropriate index. With the original expressions, his (well-sized) hardware was at its knees, the regex processor occupying 99% CPU. After changes to the regex it's chugging along nicely with lots of headroom for additional data.
The quickest way to improve this is to make the quantifiers non-greedy by changing from .*
to .*?
. Whether that still yields the same results depends on your data.
As another thought, you could compare this full-length regex with splitting the individual parts into three regexes - one for each to-be-replaced bit. Whether that'd be faster or not is quite hard to tell without testing and timing it.
Did you run some performance comparisons yet with vs without the anonymization?
So, to translate that into English:
aap=
~
uid=
~
Looking at the regex, its performance is going to depend on the data.
You might run into lots of memory consumption and erroneously taken branches from the sixth and eighth capturing group - those asterisk quantifiers are greedy, hence they will match until the end of the string and backtrack from there. That may or may not be expensive.
[anonymize_fields]
REGEX = (?i)^(([^ ]+ ){2})([^ ]+ )(([^ ]+ ){13})(.*aap=)([^~]+)(.*uid=)([^~]+)(.*A=1)
DEST_KEY = _raw
FORMAT = $1#### $4$6####$8####$10
[web_source]
SHOULD_LINEMERGE = false
TRANSFORMS-include = include_eventcount, anonymize_fields
TIME_PREFIX=^(?:[^ ]*( {1,2})){8}(?:\"[^\"]*\"|-)(?:[^ ]*( {1,2})){5}\[
MAX_TIMESTAMP_LOOKAHEAD=35