We added Palo Alto data to our Splunk environment a little over two months ago and installed the Palo Alto Networks App for Splunk (v5.2.0)
After two months, we are seeing PA App datamodel_summary files that approach the size of the total indexed data. I need to appropriately plan disk space and not sure what I need to be asking for. Is this normal for the Palo Alto Networks App for Splunk?
Is there a rule of thumb for how I should think about datamodel_summary size to indexed volume?
What options do I have for containing the PA App datamodel?
We have worked with the Splunk DataModel Team to optimize our datamodel the best we could. However, since every customers needs are different we have included fields in the datamodel that may not be important in your environment. The Splunk admin has the ability to remove fields that may not be of importance. This will help shrink the datamodel storage needs.
In an effort to continue to optimize our data model. Could you please provide feedback on which fields were removed and why? We would really appreciate it.
by "approach the size of the total indexed data" I mean that I have been told by my Splunk admin that after two months, our pan_logs index is 830GB and he gave me two numbers for the datamodel_summary files: 670GB and 850GB. So the datamodel_summary file disk needs appear to be on the order of the indexed data.
I need to understand if this is a linear trend that will continue and what ability I have to control this trend. If I cannot do this, I cannot help my Splunk admin define storage needs.
ddrilic brings up a good point, what level of data acceleration is your PA App set to? Limiting the amount of historical data to accelerate will significantly reduce the summary index consumption.
I am not so much concerned about generally limiting the amount of data as being able to plan for what I should need. Whatever the number, it's a management decision on cost/benefit. But if I estimate wrong and we run out of space or budget, things will not go great for me or our Splunk implementation.
That said, I just looked at the acceleration stats. They show 7 days of data model acceleration and ~100GB size on disk. So there is either something more, or some way that we can be more agressive in cleaning up the datamodel_summary files.
The default appears to be 7 Days of acceleration for Firewall Logs, Endpoint Logs, and Wildfire Malware Reports.
I'd run the search below to determine the usage per day against your Palo Alto Indexes and Summary indexes to be able to project an average monthly usage by Palo Alto. The search will return a day by day basis of usage per Index.
index=_internal source="*license_usage.log*" type=Usage | eval yearmonthday=strftime(_time, "%Y%m%d") | eval yearmonth=strftime(_time, "%Y%m%d") | stats sum(eval(b/1024/1024/1024)) AS volume_b by idx yearmonthday yearmonth | chart sum(volume_b) over yearmonth by idx
When you say "...approach the size of the total indexed data", what do you mean? Total indexed data over the last two months or today?
Summary indexes are great for large time window searches such as annual reporting, so they will be a subset of your overall indexed data.
It's interesting here - Palo Alto Networks App for Splunk
It says -
-- Datamodel acceleration might rebuild itself after installation due to updated constraints - ...
I just wonder if you use the Datamodel acceleration...