Getting Data In

Index Best Practice

ipicbc
Explorer

I am ingesting events from log files. There are 50 log files, each with 10,000 lines a day, and they get rolled daily with retention of 10 days. The file formats are identical, so there is only 1 source type. So I have 500 files in total of which 50 are changing at any time, and maybe 5,000,000 total events in Splunk.

My question relates to best practice for indexing for query performance. I don't believe that there are good reasons in my use case for go in any particular direction due to access control or retention.

At the moment I just have 1 index for everything. But I could create a new index each day across all log files, including the date in the index name. Alternatively I could have a separate index for each log file. Or both.

I would like to hear about what would be best practice in terms of theory and your practical experience, please.

1 Solution

mattymo
Splunk Employee
Splunk Employee

Hi ipicbc!

Based in what you have advised, I would suggest you are already set up for success.

If the data is truly all the same format, then one sourcetype is the way to go.

If there are any logical segregations in the hosts/files..perhaps the service or function the hosts provide, the group who will be searching the data (although you already eluded to no need for access control), or any other grouping, then maybe I'd split up the indexes accordingly.

Otherwise I would keep the one index and rely on writing searches that are explicit in targetting the events I want to see. Creating tons of indexes will lead to a bad time. Whatever perf you might gain will be easily be outweighed by admin overhead.

Splunk creates index time fields like _time, host, sourcetype, source that allow you filter your events down efficiently. The Search processing language (SPL) should be able to write very efficient searches that will make sifting through those events real easy and performant.

If the events contain fields that you want to report on and the searches need to be even faster, the next levers to pull for ensuring quick search/report results would be summary indexing and data modelling/creation of tsidx files, which will help prepare the info you want to work with and shed some of the data you don't need to gain insight into your data.

There are many ways to ensure your search performance is optimum, but in short, based in what you have advised, I wouldn't chase segementing indexes as one of them.

View solution in original post

mattymo
Splunk Employee
Splunk Employee

Hi ipicbc!

Based in what you have advised, I would suggest you are already set up for success.

If the data is truly all the same format, then one sourcetype is the way to go.

If there are any logical segregations in the hosts/files..perhaps the service or function the hosts provide, the group who will be searching the data (although you already eluded to no need for access control), or any other grouping, then maybe I'd split up the indexes accordingly.

Otherwise I would keep the one index and rely on writing searches that are explicit in targetting the events I want to see. Creating tons of indexes will lead to a bad time. Whatever perf you might gain will be easily be outweighed by admin overhead.

Splunk creates index time fields like _time, host, sourcetype, source that allow you filter your events down efficiently. The Search processing language (SPL) should be able to write very efficient searches that will make sifting through those events real easy and performant.

If the events contain fields that you want to report on and the searches need to be even faster, the next levers to pull for ensuring quick search/report results would be summary indexing and data modelling/creation of tsidx files, which will help prepare the info you want to work with and shed some of the data you don't need to gain insight into your data.

There are many ways to ensure your search performance is optimum, but in short, based in what you have advised, I wouldn't chase segementing indexes as one of them.

View solution in original post

mattymo
Splunk Employee
Splunk Employee

As a follow up topic that can also help ensure your indexes are configured as best as they can be, definitely get comfortable with the concept of buckets and how they age:

http://docs.splunk.com/Documentation/Splunk/6.5.2/Indexer/HowSplunkstoresindexes

General rule, if the index is doing more that 10GB a day, you want to ensure auto_high_volume is used.

See docs for the full story.

0 Karma

ipicbc
Explorer

Great advice, very much appreciated!

0 Karma

ryan_gates
Explorer

I downvoted this post because this should be a comment rather than an answer.

0 Karma

ppablo
Community Manager
Community Manager

Hey @ryan_gates

Just fyi, please reserve downvoting for proposed solutions that could possibly be harmful in a Splunk environment or is against known best practices, not posting something in the wrong area. We want to encourage an environment in the forum where people don't feel afraid to contribute. Just commenting that the answer should have been a comment would have been fine for something like this, and we can just get it converted from there.

Thanks for being a part of the Answers community, and hope to see some questions and answers from you in the near future 🙂

Patrick

.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!