All Apps and Add-ons

## How to predict disk usage for the future in Splunk Machine Learning Toolkit?

New Member

I am storing my customer's devices logs in my index and each customer has many devices and each device has a file path.
I have last 30 days of data. Can I predict for the next 10 days?
Do I have to do use linear regression algorithm or forecast it? I am using Machine learning ToolKit.
I am trying hard but I could not get the segregation per customer per device per filesystem.

Thank you!

1 Solution
Legend

This doesn't sound like a machine learning use case to me, although I could be wrong there.

Always do the most basic estimate you can first, because it will tell you if your more complicated attempts have gone wonky somehow.

First approximation is to aggregate by customer, calculate sum and average-per-device and number-of-devices, then predict the trends and see what they say.

Second, I'd probably start by trying to figure out workable features and clustering the individual device usage data. That way, if there are perhaps five different kinds of devices that have different usage patterns, you could use that as part of your predictions.

Third, another approach I might take is to "normalize" all the actual numbers to an arbitrary scale before clustering, for instance so that the numbers on day 15 were all exactly 100. This would give another idea whether there were multiple "trajectories" of usage.

I'd also see about extending how long the data stays around, because I'd bet that devices may have one pattern when new and another when "mature"... although that depends entirely on your underlying use case, which you haven't told us.

Legend

This doesn't sound like a machine learning use case to me, although I could be wrong there.

Always do the most basic estimate you can first, because it will tell you if your more complicated attempts have gone wonky somehow.

First approximation is to aggregate by customer, calculate sum and average-per-device and number-of-devices, then predict the trends and see what they say.

Second, I'd probably start by trying to figure out workable features and clustering the individual device usage data. That way, if there are perhaps five different kinds of devices that have different usage patterns, you could use that as part of your predictions.

Third, another approach I might take is to "normalize" all the actual numbers to an arbitrary scale before clustering, for instance so that the numbers on day 15 were all exactly 100. This would give another idea whether there were multiple "trajectories" of usage.

I'd also see about extending how long the data stays around, because I'd bet that devices may have one pattern when new and another when "mature"... although that depends entirely on your underlying use case, which you haven't told us.

New Member

Thank u very much !!!!! please go through this and give me more insight if u can!
thank u in advance !!

i can predict for a whole customer but not for per customer per device per filesystem , i am unable to break the prediction per customer per device per filesystem.

Let me give u an example of data

cus_name device_name idx_label disk_used

Alex pixel /var 356216
Alex pixel /var/log 2576
Alex pixel /home 4567
Tom apple /var 7656
Tom apple / 71928
Mary Note8 /var/log/audit 69897
Mary Note8 /var 98709

``````                    Like this each Customer has Large number of devices and each device has different filesystem the data is getting written.

This is the log data that is coming into our indexers and we r storing .
``````

so my team want to predict for each customer per device per filesystem.

all i am getting is predicting the avg(d_used) for future ,fitting an algorithm and predicting it for future .

index=cus_data splunk_server=CustomerData originalsourcetype=rawData | bin _time span=1d
|table _time, cust_name, device_name, idx_label, d_used, d_used_percent | fit RandomForestRegressor "d_used" from "_time" "_cust_name" "_device_name" "idx_label" into "device_prediction_randomforest" | table _time,"d_used","predicted(d_used)" | rename predicted(d_used) as Dused | timechart span=1d avg(Dused) | predict "avg(Dused)" as prediction algorithm="LLP5" future_timespan=3"

I want it per customer per device_name per idx_label

Legend

@akhil36109 - Sorry about the delay, I didn't notice your question. Do you still need this?

Engager

Even we have got the similar requirement. Please suggest!

Get Updates on the Splunk Community!

#### Cloud Platform | Customer Change Announcement: Email Notification Will Be Available ...

The Notification Team is migrating our email service provider from Postmark to AWS Simple Email ...

#### Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

#### Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...