I have some logs that can include any one of 50,000+ users. But, i only need to index and keep a subset of that -- approximately 2000 users.. I'm looking for the most efficient way to only include logs that are associated with these users.
I thought of using transforms.conf and doing a ridiculously long regex to match those users, but, looking for any better ideas.
Props.conf
[host::blah]
TRANSFORMS-null= setnull
Tranforms.conf
[setnull]
REGEX=
DEST_KEY=queue
FORMAT=nullQueue
I have an automatic lookup table of all Oracle returncodes/descriptions, which is a few times larger than what you’re looking to do, and I see zero performance impact.
Splunk docs (http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Indextimeversussearchtime) says there is a performance hit from index time extractions, so you should avoid it if you can – some mumbojumbo about making the index larger which makes all searches slower. However, it looks like you're doing a nullQueue as opposed to adding a new field, so it may work just fine.
If you really need to do this at index time, then you should figure out a way to automate the management of the regex and then just drop it in what Kristian posted.
It will be far easier to manage a csv lookup table, then it would be to manage a regex of that size.
Please post your results if you do do index time filtering with regex on this because I am curious of the impacts.
These are iis logs that include usernames (cs_username)
Do these accounts have some sort of distinguishing pattern, like da_xxxxx, admxxxxx, sys-xxxxxx.
Otherwise the regex would be awful to maintain.
Is there perhaps some other field in the events that can be used to make the filtering on a broader scope.
Also, as per the docs on nullQueueing, you'll need to add an extra transform to keep some of the events;
props.conf
[your_host]
TRANSFORMS-blah = setnull, keepsome
transforms.conf
[setnull]
REGEX = .
DEST_KEY = queue
FORMAT = nullQueue
[keepsome]
REGEX = here is where you write your super regex
DEST_KEY = queue
FORMAT = indexQueue
K
That pretty much answers the question I was asking. Are there any other distinguishing features that can be used for filtering, e.g. the c-ip
, if the users you want to keep come from a certain ip-range.
Are you constrained license-wise? Otherwise you might index more data than you need and use tags
or automatic lookups
to your advantage. Not sure that it would consume less resources, but it would likely be more manageable.
/k
These are AD usernames so they are all different if that answers what you are trying to ask
My question was rather, what differs between the usernames you want to keep, and those you want to throw out?
Are the all usernames just arbitrary strings, e.g. bob, apple, horse, crane, alice? And there is no pattern that can be used to filter out the unwanted ones. You simply have to know that 'crane' and 'horse' are the ones to keep.
These are iis logs that include usernames (cs_username)