I have several questions about data architecture that are rooted in CIM data models and performance considerations.
Background:
We have about 2T of new log data every day.
Some sourcetypes get 100's of M of new events per day, one gets 1.1 B new events per day, quite a few get a few M new events per day.
From a data architecture standpoint, we generally drop our events from a given log generator type into a index and sourcetype for the technology, such as windows events go into index = win sourcetype = win. These are not the real names, but you get the idea.
When evaluating the CIM data models, windows events span a range of data models, depending on the event type.
As an example, Windows events can potentially be a part of the following CIM data models (list not complete) -
Alerts
Application State
Authentication
Certificates
Inventory
etc...
Questions:
Given that we have massive data considerations and this could adversely affect the performance of any given search, wouldn't it be prudent to create a data architecture that would sort data into smaller piles by index and sourcetype that more closely mimics the CIM data models?
Would changing our sourcetype for windows events from sourcetype = win to sourcetype = win-authentication and sourcetype = win-application-state (et. al.) have significant implications on performance and potentially reduce the search target area of a given model from a really big 'pile' to a smaller, more specific 'pile' of event types?
Would such a data architecture give noticeably better performance improvements over data model acceleration or in addition to data model acceleration or would it be a wash?
Does anyone else out there leverage any data architecture based designs at the index and sourcetype levels for their data due to performance concerns? If so, can you give an example of your data architecture design and ballpark volumes of data? What other considerations may have led you to that data architecture design?
Are there any flaws in this line of thinking? Is it potentially too much work to manage when contrasted with potentially small performance gains? Are the performance gains worth the overhead of setting up and maintaining the data architecture?
The key to Splunk performance is ensuring that you reduce a search to the minimum amount of results as soon as possible, ie, at the indexer. If you have an index with all windows events, and are searching for just authentication events, you'll be better off doing as you said - adding a sourcetype which separates those auth events from all the other win events. Splitting the events into a different index won't make too much of a difference in performance, unless it's very sparse data that you're regularly searching. Having a large number of indexes won't help your sanity. It's more about making sure the data you want to search has specific information that can be searched that eliminates all other unneeded data as early as possible in the search process. 2T/day is not a lot of data for Splunk if tuned properly. Your search optimization will matter more than anything. Read up on Splunk search optimization here - https://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutoptimization.
Make sure you're well tuned for large data ingestion (auto_high_volume buckets, etc), and that your searches are designed to limit the data being looked at. Tight time ranges, using source types, explicit indexes, etc. Don't do index=* sourcetype=win*. Run the data quality reports in the management console and make sure your events are clean, and that your time spans of data are current. Add tags to data on ingest if necessary.
If you run a search which returns huge volumes of data to the search head, then you can quickly make a mess of your environment. For example, if you had one billion events with sourcetype "data" that were not well delimited so Splunk could not extract fields from it (say, a syslog formatted message with a json object inside), and you wanted to search for a key with value "pizza" which occurs only once in those billion events, you'd be hosed. With json data within a syslog message field, the individual keys in the json will not be extracted by default at index time (you will have a timestamp and a message key, and no fields within the json object). You would have to run an spath to extract the fields (something like "index=blah sourcetype=data | spath input=message | fieldname=pizza"), and Splunk cannot do that on the indexers, but has to retrieve all billion events in the initial search, and return them to the search head and then the spath runs on the search head. If your data is fixed so the fields are properly extracted at index time, then your search would be "index=blah sourcetype=data fieldname=pizza", and Splunk would find the one event with that key on the indexer super quickly.