I'm trying to use commands like
trendline to write a search that will alert on a predicted license violation for the day. While http://answers.splunk.com/answers/39980/license-violation-prediction.html has some good information as noted in its comments the search does not return accurate results in terms of data volume.
In writing this search I've realized that what I need to predict is the ever-growing sum of license volume during the day. In other words, say X is my total license volume for the day. Each data point in my search is going to be an ever-increasing value for X as more and more data is indexed (example of data points: 1GB at 1AM, 2GB at 2AM, 3GB at 3AM, 4GB at 4AM, etc). My goal, of course, is to predict what X will be at midnight.
While the search commands I know of (such as those in http://wiki.splunk.com/Community:TroubleshootingIndexedDataVolume) will provide a total indexed volume for a time span I know of none that will plot a series of data points which represent a sum of a value at different points in the day, as with index volume. In other words, right now I can run a search that spans from midnight today until the current time and sums up the total volume of indexed data. However the search I would need to execute in order to make a prediction would have to give me the sum volume of indexed data from midnight until 1am as well as from midnight until 2am, from midnight until 3am, and so on up to the current time.
How would I go about creating such a search that gives me the data points I need to make a prediction?
First to answer the question at the bottom, assuming you have a search that gives you a figure for midnight to 1am, 1am to 2am, and so on - basically a timechart of indexed volume - you can turn that into an accumulated figure using
earliest=@d latest=now ... | timechart sum(volume) as hourly_volume span=1h | accum hourly_volume as running_total
As for the actual use case of predicting today's volume, consider something like this:
index=_internal source=*license_usage.log* type=Usage | timechart span=1h sum(b) AS volume_b | predict volume_b as prediction future_timespan=24 | addinfo | where _time>=relative_time(info_max_time, "@d") AND _time<relative_time(info_max_time, "+d@d") | fields - info* | eval merged = coalesce(volume_b, prediction) | stats sum(merged) as predicted_volume sum(volume_b) as volume_so_far
If run over a reasonably long timerange this will use historical data to predict the volume for the remaining hours of the day and compute a sum of the actual data for today until now plus the predicted data for the remainder of today.
Make sure to use
@h as latest to not count a partial hour as a whole hour, or decrease the bucket size. Also make sure the figure for
volume_so_far lines up with the LURV figure.
Wow that's fantastic! Much more than I had anticipated getting. That search worked great too. The one challenge I have in it is I can't seem to get any results for "volumesofar" relative start/end times with Earliest and Latest to "@d" and "@h," respectively; I'm guessing that's conflicting with the "where" statements. But that's not a big deal, I've been able to surmount that with the scheduled search settings which don't seem to offend the results much.
Thanks again for the great help and for the complete response.
In case others see this and want the results in GB as opposed to raw bytes I modified the search ever so slightly to give results in terms of GB:
index=_internal source=*license_usage.log* type=Usage | eval GB=((b/1024)/1024)/1024 | timechart span=1h sum(GB) AS volume_b | predict volume_b as prediction future_timespan=24 | addinfo | where _time>=relative_time(info_max_time, "@d") AND _time<relative_time(info_max_time, "+d@d") | fields - info* | eval merged = coalesce(volume_b, prediction) | stats sum(merged) as predicted_volume sum(volume_b) as volume_so_far
For added efficiency do the conversion to GB at the end so you only do the calc once rather than per event 😄
You should run this search over e.g. -7d@d to @h, so your prediction has some data to work with. After the prediction is run the where cuts off everything from before today and sums up the data so far and the data so far plus the prediction for the remainder of the day.