Solved: How to predict disk usage for the future in Splunk...

akhil36109 · ‎01-27-2018

I am storing my customer's devices logs in my index and each customer has many devices and each device has a file path.
I have last 30 days of data. Can I predict for the next 10 days?
Do I have to do use linear regression algorithm or forecast it? I am using Machine learning ToolKit.
I am trying hard but I could not get the segregation per customer per device per filesystem.

Please help me

Thank you!

DalJeanis · ‎01-27-2018

This doesn't sound like a machine learning use case to me, although I could be wrong there.

Always do the most basic estimate you can first, because it will tell you if your more complicated attempts have gone wonky somehow.

First approximation is to aggregate by customer, calculate sum and average-per-device and number-of-devices, then predict the trends and see what they say.

Second, I'd probably start by trying to figure out workable features and clustering the individual device usage data. That way, if there are perhaps five different kinds of devices that have different usage patterns, you could use that as part of your predictions.

Third, another approach I might take is to "normalize" all the actual numbers to an arbitrary scale before clustering, for instance so that the numbers on day 15 were all exactly 100. This would give another idea whether there were multiple "trajectories" of usage.

I'd also see about extending how long the data stays around, because I'd bet that devices may have one pattern when new and another when "mature"... although that depends entirely on your underlying use case, which you haven't told us.

View solution in original post

DalJeanis · ‎01-27-2018

This doesn't sound like a machine learning use case to me, although I could be wrong there.

Always do the most basic estimate you can first, because it will tell you if your more complicated attempts have gone wonky somehow.

First approximation is to aggregate by customer, calculate sum and average-per-device and number-of-devices, then predict the trends and see what they say.

Second, I'd probably start by trying to figure out workable features and clustering the individual device usage data. That way, if there are perhaps five different kinds of devices that have different usage patterns, you could use that as part of your predictions.

Third, another approach I might take is to "normalize" all the actual numbers to an arbitrary scale before clustering, for instance so that the numbers on day 15 were all exactly 100. This would give another idea whether there were multiple "trajectories" of usage.

I'd also see about extending how long the data stays around, because I'd bet that devices may have one pattern when new and another when "mature"... although that depends entirely on your underlying use case, which you haven't told us.

akhil36109 · ‎01-27-2018

Thank u very much !!!!! please go through this and give me more insight if u can!
thank u in advance !!

i can predict for a whole customer but not for per customer per device per filesystem , i am unable to break the prediction per customer per device per filesystem.

Let me give u an example of data

cus_name device_name idx_label disk_used

Alex pixel /var 356216
Alex pixel /var/log 2576
Alex pixel /home 4567
Tom apple /var 7656
Tom apple / 71928
Mary Note8 /var/log/audit 69897
Mary Note8 /var 98709

                    Like this each Customer has Large number of devices and each device has different filesystem the data is getting written.

                                         This is the log data that is coming into our indexers and we r storing .

so my team want to predict for each customer per device per filesystem.

all i am getting is predicting the avg(d_used) for future ,fitting an algorithm and predicting it for future .

index=cus_data splunk_server=CustomerData originalsourcetype=rawData | bin _time span=1d
|table _time, cust_name, device_name, idx_label, d_used, d_used_percent | fit RandomForestRegressor "d_used" from "_time" "_cust_name" "_device_name" "idx_label" into "device_prediction_randomforest" | table _time,"d_used","predicted(d_used)" | rename predicted(d_used) as Dused | timechart span=1d avg(Dused) | predict "avg(Dused)" as prediction algorithm="LLP5" future_timespan=3"

I want it per customer per device_name per idx_label

Thanks in advance!!

DalJeanis · ‎02-12-2018

@akhil36109 - Sorry about the delay, I didn't notice your question. Do you still need this?

Suhailahmed648 · ‎06-25-2019

Even we have got the similar requirement. Please suggest!

How to predict disk usage for the future in Splunk Machine Learning Toolkit?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!

Federated Search for Dynamic Data Self Storage Is Now Generally Available on Splunk ...

Index This | What has many keys but can’t unlock a door?

Join the Conversation