topic Re: Part-time on-demand Indexing in Getting Data In

Part-time on-demand Indexing

maverick — Tue, 08 Nov 2011 14:50:38 GMT

Currently, I have a forwarder monitoring a directory of files that are being logged in real time. My indexer is receiving all of the latest info from the forwarder as expected.

I now have the following requirement requested of me:

Only index the events being logged in real time to these files between the hours of 8:00p and 6:00a each night.
Allow all events logged between those hours each night to be searchable at any time of the day.

My thoughts are to setup the followTail = True option and just turn off the forwarder when I don't need to log real time events (e.g. between the hours of 6:01a and 7:59p)

Does anyone else have a better idea?

Re: Part-time on-demand Indexing

kristian_kolb — Sat, 12 Nov 2011 22:33:57 GMT

Hmm, I'm not 100% sure that followTail=1 will be honoured in the way one may think. The following is from the docs for inputs.conf;

followTail = [0|1]
* Determines whether to start monitoring at the beginning of a file or at the end (and then index all events 
  that come in after that). 
* If set to 1, monitoring begins at the end of the file (like tail -f).
* If set to 0, Splunk will always start at the beginning of the file. 
* This only applies to files the first time Splunk sees them. After that, Splunk's internal file position 
  records keep track of the file. 
* Defaults to 0.

You have a few options I guess, some of which may not be feasible;

a) prevent the log files from being written to during the daytime (6am-8pm). Or possibly write to a daytime directory which is not being monitored. Not very neat solution.

b) stop the forwarder as you suggested, and delete (parts of) the fishbucket, which should give your forwarder a convenient case of amnesia, thus allowing for the followTail=1 to work again. Depending on your setup, i.e. what else is being monitored by the forwarder, this is perhaps not so easy a/o may produce strange results. Then again, it may work just fine.

c) Route all events originating during the day to the nullQueue, so they do not get indexed. You would have to craft a regex to match event timestamps for 6am-8pm, but I'm not sure what fields are available to you at this part of the process. Would probably be the neatest way of doing it, but I haven't tried anything similar, so it may not work at all.

UPDATE:

d) as _d_ pointed out, you could work with ignoreOlderThan to control which files will be read by the monitor. The option here would then be to
i) ensure all logs are rotated at 7.59PM
ii) use ignoreOlderThan=1m for the directory monitor stanza
iii) start the forwarder through cron or whatever at 8.01PM
iv) stop the forwarder through cron or whatever at 6.00AM

this ensures that the events from between 6AM-8PM will not get indexed, since ignoreOlderThan goes by the modtime of the file.

Hope this helps, or at least serves as inspiration to somebody more knowledgeable than me to work out the exact steps to take.

regards,

kristian

Re: Part-time on-demand Indexing

_d_ — Sat, 12 Nov 2011 23:27:02 GMT

Perhaps a better option than completely turning off the forwarder would be to simply disable that input. The assumption here is that you may need the forwarder to monitor other files.

I normally pack an app.conf and an inputs.conf in an (rather conveniently called) input app; both files reside under $SPLUNK_HOME/etc/apps/my_input_app/local. The inputs.conf contains the monitor stanza that points to where your files reside and other options including followTail=1 ; the app.conf contains the following:

[install] state = enabled

I would then have a cron job that runs according to your schedule and does the following:

overwrites or swaps out the normal app.conf with one that has state=disabled
restarts forwarder's splunkd

EDIT_1: As Kristian points out the followTail=1 only applies to files the first time they are picked up. After that, Splunk's internal file position records keep track of the file. This means that the fishbucket files will tell Splunk where it left off an it will pick up the old, unnecessary data as well as real time ones. As i remark below, I would try playing with ignoreOlderThan setting (using seconds for better resolution ).

Hope this helps.

> please upvote and accept answer if you find it useful - thanks!

Re: Part-time on-demand Indexing

kristian_kolb — Sun, 13 Nov 2011 01:45:44 GMT

Does the enabling/disabling of an app/monitor stanza actually clear the fishbucket for the inputs involved, i.e. wont the forwarder pick up where it left off?

Re: Part-time on-demand Indexing

_d_ — Sun, 13 Nov 2011 01:55:42 GMT

No, you're right - the fishbucket won't be purged. He can, though, try to play with ignoreOlderThan setting (in minutes or seconds for better resolution ). But, yes, it is not a trivial and requires a lot of testing.

Re: Part-time on-demand Indexing

kristian_kolb — Sun, 13 Nov 2011 11:39:20 GMT

I UPDATED my original answer, since you may be on to something here.

Re: Part-time on-demand Indexing

dwaddle — Sun, 13 Nov 2011 17:00:47 GMT

The ignoreOlderThan and followTail options are definitely interesting and might work. But it sounds like the most straightforward approach is to have the originating system rotate logs at 20:00 and 06:00. Or, even hourly, if the system is producing enough logs to justify it. (And hourly might be easier to configure in something like log4j). And then use blacklist and whitelist, both of which are well-known and surprise free. The other options are complicated enough to worry me about long-term reliability.

As Kristian mentioned, if you could nullQueue this data that would be the most ideal approach and wouldn't require application changes. From the docs on transforms.conf, _time is a valid field to use as a SOURCE_KEY. So, in theory, you could precompute a series of regular expressions expressing periods of 06:01 - 19:59 in time_t format for future dates. Such regexes would probably be nontrivial and would need to be maintained for the life of the system to add in in new time_t values. I wouldn't suggest trying this at home.

If you can't get the providers of the logfiles to do rotation to help you, then I would suggest filing an ER to ask for something like a _time_of_day key (In the format of HH:MM:SS.ssssss or similar) that would be usable in transforms.conf for the purpose of sending data to the nullQueue.

Re: Part-time on-demand Indexing

kristian_kolb — Mon, 28 Sep 2020 10:05:55 GMT

Exactly my point regarding the regex for _time - I haven't had time/reason to figure out if date_hour (which is derived from _time) is computed at the parsing stage, or rather if it's computed before the nullQueue routing would take place. Dealing directly with epoch time is more likely than not going to give headaches in the long run.

Re: Part-time on-demand Indexing

dwaddle — Mon, 14 Nov 2011 02:54:50 GMT

According to the docs for transforms.conf, date_hour is not a supported field for SOURCE_KEY. So, I'm quite confident it is computed too late. Agreed that dealing with epoch time would be insanely difficult.