This is during parsing time .. not search time.
Is there a way that I can use a lookup during parsing phase, and directly write the data into the log?
i.e. DHCP information, since this is volatile. so lookup "src_ip" in the CSV/etc. and add this to the event as "src_name"
And can I subsequently filter out events that based on that
e.g. "src_name" = "unknown" drop event
In general, hostname-ip resolving with DHCP logs is like a "reference" example of Splunk time-based lookup. However...
Assuming, that (for some reason) you cannot dump and store the data required for your lookup (e.g. DHCP leases), but you can somehow perform the lookup "at the moment" (for example, by running external command like "netsh"). In this case, I believe, you should create a scripted input that will collect/receive your events, perform the lookup (and probably additional event processing routines) and return extended event records to Splunk.
Scripted inputs overview section of "Developing Views and Apps for Splunk Web" Manual could be a good starting point, I guess.
If by "volatile" information you mean information that changes over time, you might want to have a look at time-based lookups. They can be used to lookup different DHCP info for the same src_ip at different times.
Thanks for that input
These solutions would work well if I can first dump everything into splunk and then query it. This usually works within one enterprise/organisation and/or if you have sufficient storage.
But this does not work for us. Depending on the use case we very fast talk about multiple TB/day, just to later drop 99.999% of it ...
While this would be really fun to do ... the cost benefit ratio is not in my favor 🙂
Therefore, I want to lookup some information based on the information in the event, and then decide upon this to keep or drop the event.
DHCP is just a simple example because everyone knows the problem ...
What you can do with splunk is use regular expressions to decide whether to index an event or not. If your events do not contain something you can detect with regular expressions (maybe certain subnets?), then that info needs to be added before splunk.
Yeah I thought of that ... it would be like "hardcoding" the list(s) as REGEX rather than defining a config, i.e. lookup, and updating the list, but it could do the job ...
... given that it scales to several hundred entries and way beyond 100k events/s
I would not recommend that. Splunk's parsing is certainly built to handle that amount of data, and throwing stuff to nullqueue instead of indexing it will not cause any problems. But such regexes over each and every individual event are not what you want to throw at this problem.
If you can, I'd suggest using some other system in front of splunk to determine the relevant hosts and route their output to splunk. Splunk itself is made for ingesting data and dropping individual events, it is not a platform to permanently and dynamically enable and disable inputs.
I would never touch the raw event and would retain it's purity.
for volatile information, what i tend to do is to "index" those information on a daily basis to another index. In your case, I would index the DHCP information into a separate index with that days time stamp as _time
You can always co-relate your event with this indexed/volatile data at any time later.