Solved: Put Data in Separate Index Based on Timestamp II

chris · ‎09-07-2010

I know this Question has been asked before (http://answers.splunk.com/questions/712/put-data-in-separate-index-based-on-timestamp) but we will start with end of year tests soon. Some of our test servers will simulate what will happen on Dec 31st at midnight. We would like to have the data from those test servers in a different index somehow.

I'd like to know if anyone has done anaything similar before. We're thinking about setting up a temporary indexer and then reconfigure syslog and our Splunk Forwarders to make sure that our main data does not get polluted.

Any Ideas?

Thanks Chris

Lowell · ‎09-07-2010

Another option is to use a transformer to set the _MetaData:Index property. I would only suggest this if you have very simmilar timestamps across all of your events; otherwise writing a proper regular expression will be very difficult.

This example assumes that only events for Dec 31 2010 and Jan 1 2011 will occur for this test. In other words, if you forget to correct your clock and the system rolls over to Jan 2, 2011 the that your event will end up in your current index. Here is an example set of config files: (I would recommend you put them in an app that you disable as soon as your testing period is done. You obviously don't want your real events on Dec 31 and Jan 1 to end up in your testing index.)

props.conf

[syslog]
TRANSFORMS-year_end_testing = route_index_YE_testing

[sourcetype-n]
TRANSFORMS-year_end_testing = route_index_YE_testing
...

transforms.conf:

[route_index_YE_testing]
REGEX = ^(Dec\s+31|Jan\s+1)\s
FORMAT = test_ye
DEST_KEY = _MetaData:Index

In this example, "test_ye" is the name of your testing index which you must create. Also, "sourcetype-n" is a placeholder. You must explicitly list out all all sourcetypes that will be involved. And each sourcetype must use this transformer (or a simmilar transformer, if you create a different transformer for your timestamp formats) of only part of your data will be routed to the correct location.

If you aren't very familiar with indexing routing like this, fluent with writing and testing regular expression, or don't have full control over your sourcetypes than one other options would probably be better. They all have different pros/cons, and this could be rather tricky to get right on the first try....

View solution in original post

Lowell · ‎09-07-2010

Another option is to use a transformer to set the _MetaData:Index property. I would only suggest this if you have very simmilar timestamps across all of your events; otherwise writing a proper regular expression will be very difficult.

This example assumes that only events for Dec 31 2010 and Jan 1 2011 will occur for this test. In other words, if you forget to correct your clock and the system rolls over to Jan 2, 2011 the that your event will end up in your current index. Here is an example set of config files: (I would recommend you put them in an app that you disable as soon as your testing period is done. You obviously don't want your real events on Dec 31 and Jan 1 to end up in your testing index.)

props.conf

[syslog]
TRANSFORMS-year_end_testing = route_index_YE_testing

[sourcetype-n]
TRANSFORMS-year_end_testing = route_index_YE_testing
...

transforms.conf:

[route_index_YE_testing]
REGEX = ^(Dec\s+31|Jan\s+1)\s
FORMAT = test_ye
DEST_KEY = _MetaData:Index

In this example, "test_ye" is the name of your testing index which you must create. Also, "sourcetype-n" is a placeholder. You must explicitly list out all all sourcetypes that will be involved. And each sourcetype must use this transformer (or a simmilar transformer, if you create a different transformer for your timestamp formats) of only part of your data will be routed to the correct location.

If you aren't very familiar with indexing routing like this, fluent with writing and testing regular expression, or don't have full control over your sourcetypes than one other options would probably be better. They all have different pros/cons, and this could be rather tricky to get right on the first try....

chris · ‎09-09-2010

Yes that makes a lot of sense

maverick · ‎09-08-2010

Another option, assuming that all syslog events being sent in from your specific test servers contain the range of future timestamps, would be to use this SAME config shown above, but regex match the host value and re-route to your text index based on that instead of the timestamp. Make sense?

chris · ‎09-08-2010

Cool, this is what I was hoping for. The challenge is, that test is a bit longer than just 31 Dec/Jan 1 (I wanted to illustrate what we are doing) and there will be several timejumps on those test servers. We will discuss if we want to try this or if we will set up the seperate instance (which you don't consider to be the wrong approach). Thank you for your input

Lowell · ‎09-07-2010

Also keep in mind that in Splunk 4.0 and newer it is possible to have multiple "hot" buckets per index which helps in this kind of situation where you have data being loaded from different points in time (although more often this is used for historical data, there is no reason why future data would be handled differently.) I think the default bucket span is 90 days, so as of right now, loading any data for Dec 31, 2010 should cause a new bucket to be created (as the date approaches, this will no longer be true... it all depends on the rotation of your buckets.)

With that said, using a separate index would be best. And if you have any concern about missing inputs or not being able to separate everything out, then perhaps setting up a temporary "test" splunk instance may be worth the effort. (If you've ever dealt with the results of messed up timestamps before you know how painful it can be to fix this after the fact.) Some of this will depend on whether or not you want to keep around this test data after your done testing or not. You have to decide what your comfortable with.

Config settings to consider:

Make sure you review the following settings in props.conf. You may need to customize these in order for splunk to accept your future dates:

MAX_DAYS_HENCE = <integer>
MAX_DIFF_SECS_HENCE = <integer>

Also see the following settings in indexes.conf:

quarantineFutureSecs = <non-negative number>

I would suggest that you read the docs related to these settings and understand what is going on before trying this.

chris · ‎09-08-2010

Yes we will look at those settings, thank you

maverick · ‎09-07-2010

Unless there is some piece of this setup that I am unaware of, it's pretty simple to do what you are asking because the index is set when you add the Data Input monitor in Splunk Manager.

The default index is set as 'main', but you can override that and specify a new test index that you create yourself in the Manager>>Indexes page.

Therefore, on your test servers, add the data inputs and be sure to specify the test index you create and all of that data will go into the index you specify.

I hope this makes sense.

chris · ‎09-09-2010

We will change the system time on our test servers so all the events will be in the future. The suggestion in your comment to Lowells answer will help us. Thank you very much

maverick · ‎09-08-2010

Oh I see now. So you may have another option, in this case. But need to confirm: Will you have specific test hosts that will ONLY send syslog containing future timestamps? OR will your test hosts send both real and future timestamped syslog events?

chris · ‎09-08-2010

I didn't describe our setup properly ... that is what i meant to say

chris · ‎09-08-2010

I did describe our setup properly. Changeing the Splunk LWFs configuration will be easy. We have a central syslog server (we had that before we had Splunk) that collects all the syslog stuff in a directory (with a lot of subdirectories for facilities & severities per server) and we just index that directory in splunk recursively. So the syslog data from the tests will get mixed with our real data if we don't do anything. I was hoping for an easy switch that will seperate everything that is in the future to a different place so won't end up with a mess with our main data.

Put Data in Separate Index Based on Timestamp II

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!