Indexing multiple sourcetypes - 1 index vs multipl...

shahzadarif · ‎02-28-2017

I would like to know what is the best approach to this.
I need to index various logs in Splunk for our web servers. These servers are running nginx, celery, supervisord, nodejs, custom application etc.. I'm thinking of creating a separate index for each of these type of logs, for example all nginx (access/error) logs go to one index, celery logs to another and so on. My reasons are, it gives me more flexibility with regards to retention policy, searches would be faster if indexname is provided, different access level if required.
I would like to know if this the best approach? How are you guys currently doing it within your environment?
Thanks

adayton20 · ‎02-28-2017

I recently went through the Architecting and Deploying Splunk course and the instructor touched on this subject. There are multiple correct answers and multiple wrong answers, but it really depends on your environment, your preferences, use case, and the volume of data per index. If the reasoning surrounds data retention and you aren’t worried about things like correlation or complexity with multiple index/sourcetype search queries, then go for it.

More indexes might help with search performance in some circumstances. When you divide similar data types into different indexes, you will need to specify each index before you search to see a positive difference in search performance. Specifying which index to look at during search time should help with search performance.

If you create an index with the thought of organization and optimal search performance, in my opinion, I'd stick with separating the each of those log sources into individual sourcetypes. For example, I'm using a syslog server in my environment to forward logs from various appliances in my infrastructure. I created rules in my syslog.conf file to organize the data into individual folder structures, and a universal forwarder sits on the server to monitor each directory and forward the events to their respective sourcetype while keeping the index of syslog. My data is organized in a way that makes sense to me (and really, that's what matters in your case since you are the one managing it). I can simply specify sourcetype=cisco or sourcetype=juniper to search different logs instead of specifying a different index. This also helps me when I'm correlating data. There's really no drastic performance benefit to either technique from what I've experienced so far, but then again, I haven't experimented with creating a massive repository of indexes.

I touched on the syslog stuff here: https://answers.splunk.com/answers/504420/forward-syslogs-with-correct-sourcetypes.html#answer-50445...

On the other hand, I do have some special cases where I forward data from the syslog appliance to its own index due to the volume of data generated from the appliance. In my case, my proxy logs are massive, so I keep those separate from my syslog index. Again, I think it really comes down to preference and use case.

Every index you create means that Splunk has to look in more buckets for the tsidx files relating to your search terms. If you need to find a search term over a 30 day period in all indexes, the more indexes you have means Splunk must look in each one of those indexes, and the speed of the returned results of your search is commensurate on the hardware specs of your server (especially disk type), how many indexes you have, the volume of those indexes, the complexity of your search, and the time range you’re searching. If you intend on searching multiple indexes with large volumes over long periods of time, creating multiple indexes might create additional latency during search time. On the other hand, if you’re searching specific indexes with smaller volumes over smaller periods of time, your searches should be much faster than if you were looking in one index with multiple sourcetypes.

There is also the added element of difficulty correlating searches. In your case, you want to create indexes for nginx, celery, supervisord, nodejs, custom application, etc. If for any reason you need to correlate results from those indexes, you might need to create a search query containing several boolean operators, multiple subsearches, transactions, joins and/or appends. This will impact search performance and may cause difficulty getting accurate, fast, meaningful results. There will also probably be an issue if you have a team of people looking at dashboards containing multiple panels with complex search queries like that. Bear in mind that each search ties up a core of the processor and will not release that core until the search finishes. Then again, if you have no use case for making searches and correlating each of the indexes/sourcetypes, then I wouldn’t worry too much about it.

I also encountered a customer who did something similar by creating several indexes. They had around 150 indexes and created over 50 dashboards with 5 or more panels per dashboard consisting of tables and visualizations. While the panels were all meaningful representations of their data, it overwhelmed analysts and admins very quickly, created confusion, searches and panels executed slowly, and they often missed important events occurring in their infrastructure.

Just some food for thought.

nickhills · ‎02-28-2017

I think you are on exactly the right track.

You can use different indexes to have different retention periods, and security restrictions.
Your approach is sensible and is spot on.

Good luck

If my comment helps, please give it a thumbs up!

Indexing multiple sourcetypes - 1 index vs multiple

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk