Re: MLTK - What algorithm should I use?

genesiusj · ‎05-09-2025

Hello,

Here is what I have.

Lookup file containing 52K rows
Fields: DATE, USER, COUNT
Require forecasting user access, on Sundays, to sensitive data based on
6 months of events to train (Jan-Jun)
6 months forecasting (Jul-Dec)
Data from 2020, so we know the results, but we want to see how close the forecasting was to the actual data
DATE format YYYY-MM-DD beginning with 2020-01-05 and ending on 2020-12-27 (Sundays) 52 values
USER 1000 values
Lookup file; there are 1000 USER values for every DATE; the COUNT is 0 if they did not attempt access, otherwise it is the number of attempts

The original lookup is over 1.5 million events (each containing the USER and TIME of attempt) Original TIME value was YYYY-MM-DD HH"MM:SS. But we are concerned with how many attempts that day.

Went to ChatGPT to help code the SPL; however, it "claimed" MLTK needed to be in a count of each user for every Sunday, and could work with the original events.

Thanks in advance for your help.
God bless,
Genesius

genesiusj

Apologies. I did not see any notification in my email about this question receiving responses.

I am moved from project to project, and this one is now on hold.

@asimit and @isoutamo I gave you some karma.

God bless.

isoutamo

Thanks, If you will continue with this and can solve, please inform us what the solution was!

ChatGPT proposes the next algorithms for this case

Algorithm in MLTKWhen to UseStrengthsWeaknesses

StateSpaceForecast(algo=StateSpaceForecast)	Small seasonal datasets with trend	Captures seasonality & trend	Not designed for sparse series with many zeros
ARIMA (algo=ARIMA)	Strong seasonality, autocorrelation	Handles short time series	Needs continuous values (zeros can reduce model quality)
LLP (Local Linear Projection)	Simple, quick	Light-weight	Limited for complex patterns
DenseNNRegressor / LSTM	Complex, nonlinear series	Can learn patterns over multiple users at once	Requires MLTK deep learning toolkit, more tuning
One-Class SVM / Isolation Forest (for anomaly detection instead of forecasting)	Detecting abnormal future access	Robust to sparse data	Not a forecasting method per se

Given your case — weekly time series, short history (26 points for training), seasonality (weekly), and sparsity — the StateSpaceForecast algorithm is usually the safest starting point in MLTK for this type of forecasting.

asimit · ‎05-16-2025

Hi @genesiusj,

Based on your description, you're dealing with a time series forecasting problem where you want to predict future user access patterns on Sundays. For this type of scenario in MLTK, I would recommend the following algorithms:

## Recommended Algorithms

1. Prophet
   a. Excellent for time series data with strong seasonal patterns (like your Sunday-only data)
   b. Handles missing values well, which is useful since many users may have zero counts on certain days
   c. Can capture multiple seasonal patterns (weekly, monthly, yearly)
   d. Works well when you have 6 months of historical data

2. ARIMA (AutoRegressive Integrated Moving Average)
   a. Good for detecting patterns and generating forecasts based on historical values
   b. Works well for data that shows trends over time
   c. Can handle seasonal patterns with the seasonal variant (SARIMA)
   d. Requires stationary data (you might need to difference your time series)

## Implementation Approach

For your specific use case with 1000 users, I would recommend using a separate model for each user who has sufficient historical data. Here's how you could implement this with Prophet:

```
| inputlookup your_lookup.csv
| where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
| rename DATE as ds, COUNT as y
| fit Prophet future_timespan=26 from ds y by USER
| where isnull(y)
| eval date_str=strftime(ds, "%Y-%m-%d")
| rename ds as DATE
| fields DATE USER yhat yhat_lower yhat_upper
| eval predicted_count = round(yhat)
| fields DATE USER predicted_count
```

For comparison with actual values:

```
| inputlookup your_lookup.csv
| where DATE >= "2020-07-05" AND DATE <= "2020-12-27"
| join type=left USER DATE
    [| inputlookup your_lookup.csv
    | where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
    | rename DATE as ds, COUNT as y
    | fit Prophet future_timespan=26 from ds y by USER
    | where isnull(y)
    | eval DATE=strftime(ds, "%Y-%m-%d")
    | fields DATE USER yhat
    | eval predicted_count = round(yhat)]
| rename COUNT as actual_count
| eval error = abs(actual_count - predicted_count)
| eval error_percentage = if(actual_count=0, if(predicted_count=0, 0, 100), round((error/actual_count)*100, 2))
```

## Handling Your Data Structure

Since you have 1000 users and 52 Sundays, I have a few recommendations for improving your forecasting:

1. Focus first on users with non-zero access patterns
   a. Many users might have sparse or no access attempts, which can result in poor models
   b. Consider filtering to users who accessed the system at least N times during the training period

2. Consider feature engineering
   a. Add month and quarter features to help the model capture broader seasonal patterns
   b. Include special event indicators if certain Sundays might have unusual patterns (holidays, etc.)
   c. You might want to include a lag feature (access count from previous Sunday)

3. Model evaluation
   a. Compare MAPE (Mean Absolute Percentage Error) across different users and algorithms
   b. For users with sparse access patterns, consider MAE (Mean Absolute Error) instead
   c. Establish a baseline model (like average access count per Sunday) to compare against

4. Alternative approach for sparse data
   a. For users with very sparse access patterns, consider binary classification 
   b. Predict whether a user will attempt access (yes/no) rather than count
   c. Use algorithms like Logistic Regression or Random Forest for this approach

Hope this helps point you in the right direction! With 6 months of training data focused on weekly patterns, Prophet is likely your best starting point.

Please give 👍 for support 😁 happly splunking .... 😎

isoutamo · ‎05-16-2025

Have you looked these guides?

I suppose that those helps you to select suitable algorithms with your test data.

MLTK - What algorithm should I use?

other

Can’t make it to .conf25? Join us online!

Community Content Calendar, September edition

Splunkbase Unveils New App Listing Management Public Preview

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you a member of the Splunk Community?