You have a few options and each have their own pros and cons and without knowing the data, I can only make an estimated guess on what would work best for you. Data Acceleration - you could put you...
See more...
You have a few options and each have their own pros and cons and without knowing the data, I can only make an estimated guess on what would work best for you. Data Acceleration - you could put your data into data models - either existing or custom ones that fit your data and accelerate the data. This will "accelerate" your data which in theory should significantly boost the speed at which you search, Mileage may vary, but often you get orders of magnitude faster searching. The cons with this is that you are probably going to double the size of your "indexed data" because acceleration is keeping your non accelerated logs and putting a set of accelerated data on the index meaning that you will be using more storage space. Additionally, every 5 minutes or so, your accelerated data will be running the search to accelerate your data and that it going to occupy ram and cpu permanently on the box. Plus depending on your comfort building or fitting your data into a datamodel, this is a little labor intensive to set up the first time. As for RBAC, Splunk will maintain the same rbac rules to your accelerated data as exists on the index, so you won't need any special rbac considerations. Summary indexing - This is an amazing tool for doing exactly that, summarizing the data. For example if you have network logs - you have probably seen that in a given time period when two machines talk to each other, you may find that you have 100s of "connection logs". If your use case is not interested in each of those 100 logs, but is more interested in - did these two IPs talk - (think threat inteligence - did we go to bad site x) than you could create a single summary log that says IP address x talked to IP address y 100 times. You write this data to a summary index. In reality, summary data gets its speed advantages because instead of speeding up the way you look for a needle in the haystack, you shrink the haystack so it is smaller - like in my example it is 1 / 100 smaller than the original index. This is a useful solution if a summary of the logs is good enough for what your analysts are looking for and that may or may not be the case. In the world of threat intel, we often have to look back at network traffic 18 months. We look at the summary data, if we have a hit, the summary data tells us what date the hit was on, but the analysts may have to go look at the unsummarized log for that day to get a better idea of what really happened because summary logs gain their power by being exactly that - a summary. For RBAC purposes, you can just make your summary index reside on the same index that it was created for. The term summary index implies that you have a special index, but that is not really the case. A summary index can be written to any index it is just a new source and the sourcetype is stash. So if you summarize your data to the same index that the original logs came from, they will have the same rbac rules on them. Here is a video on how to summarize data https://youtu.be/mNAAZ3XGSng Below is a simple spl concept of summarizing palo alto firewall logs index=pan sourcetype=connections | stats sum(bytes_in) as bytes_in sum(bytes_out) as bytes_out earliest(_time) as _time count by src_ip, dest_ip | collect index=pan source="summarized_pan_connections" You now need to determine how often you are summarizing your logs and set up a saved search to run that query. Once it runs you just query the data with index=pan source="summarized_pan_connections" Another option you can have is to schedule search your dashboard panels- this means that each of the panels will run the query one time at some specified time and everyone who comes to the dashobards will get the data that was created during the scheduled search. This is relatively simple to set up, keeps rbac rules, but if having the latest logs included on the dashboard panels is your biggest priority, this one starts to fall apart. I have given three suggestions, in my environment I have a similar situation as you, large amount of data and looking back long periods of time is slow. We actually run a little mixture of all of it. We accelerate a days worth of data, then in the middle of the night, we summarize yesterdays logs. Then when the users search the dashboard the query is a combination of the accelerated data for today's data, and the summarized data for the previous days data. Hope this gives you some ideas of a path forward. There will be plenty of things that you need to consider, particularly how "fresh" does the data need to be. Is the summary of the logs good enough, can you have static data in your dashboards that refreshes every day or every hour?