Splunk Search

MLTK - What algorithm should I use?

genesiusj
Builder

Hello,

Here is what I have.

  • Lookup file containing 52K rows
  • Fields: DATE, USER, COUNT
  • Require forecasting user access, on Sundays, to sensitive data based on
  • 6 months of events to train (Jan-Jun)
  • 6 months forecasting (Jul-Dec)
  • Data from 2020, so we know the results, but we want to see how close the forecasting was to the actual data
  • DATE format YYYY-MM-DD beginning with 2020-01-05 and ending on 2020-12-27 (Sundays) 52 values
  • USER 1000 values
  • Lookup file; there are 1000 USER values for every DATE; the COUNT is 0 if they did not attempt access, otherwise it is the number of attempts

The original lookup is over 1.5 million events (each containing the USER and TIME of attempt) Original TIME value was YYYY-MM-DD HH"MM:SS. But we are concerned with how many attempts that day.

Went to ChatGPT to help code the SPL; however, it "claimed" MLTK needed to be in a count of each user for every Sunday, and could work with the original events.

Thanks in advance for your help.
God bless,
Genesius

Labels (1)
Tags (2)
0 Karma

genesiusj
Builder

Apologies. I did not see any notification in my email about this question receiving responses.

I am moved from project to project, and this one is now on hold.

@asimit and @isoutamo  I gave you some karma.

God bless.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Thanks, If you will continue with this and can solve, please inform us what the solution was!

ChatGPT proposes the next algorithms for this case

Algorithm in MLTKWhen to UseStrengthsWeaknesses

StateSpaceForecast(algo=StateSpaceForecast)

Small seasonal datasets with trend

Captures seasonality & trend

Not designed for sparse series with many zeros

ARIMA (algo=ARIMA)

Strong seasonality, autocorrelation

Handles short time series

Needs continuous values (zeros can reduce model quality)

LLP (Local Linear Projection)

Simple, quick

Light-weight

Limited for complex patterns

DenseNNRegressor / LSTM

Complex, nonlinear series

Can learn patterns over multiple users at once

Requires MLTK deep learning toolkit, more tuning

One-Class SVM / Isolation Forest (for anomaly detection instead of forecasting)

Detecting abnormal future access

Robust to sparse data

Not a forecasting method per se

 

Given your case — weekly time series, short history (26 points for training), seasonality (weekly), and sparsity — the StateSpaceForecast algorithm is usually the safest starting point in MLTK for this type of forecasting.

 

0 Karma

asimit
Path Finder

Hi @genesiusj,

Based on your description, you're dealing with a time series forecasting problem where you want to predict future user access patterns on Sundays. For this type of scenario in MLTK, I would recommend the following algorithms:

## Recommended Algorithms

1. Prophet
   a. Excellent for time series data with strong seasonal patterns (like your Sunday-only data)
   b. Handles missing values well, which is useful since many users may have zero counts on certain days
   c. Can capture multiple seasonal patterns (weekly, monthly, yearly)
   d. Works well when you have 6 months of historical data

2. ARIMA (AutoRegressive Integrated Moving Average)
   a. Good for detecting patterns and generating forecasts based on historical values
   b. Works well for data that shows trends over time
   c. Can handle seasonal patterns with the seasonal variant (SARIMA)
   d. Requires stationary data (you might need to difference your time series)

## Implementation Approach

For your specific use case with 1000 users, I would recommend using a separate model for each user who has sufficient historical data. Here's how you could implement this with Prophet:

```
| inputlookup your_lookup.csv
| where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
| rename DATE as ds, COUNT as y
| fit Prophet future_timespan=26 from ds y by USER
| where isnull(y)
| eval date_str=strftime(ds, "%Y-%m-%d")
| rename ds as DATE
| fields DATE USER yhat yhat_lower yhat_upper
| eval predicted_count = round(yhat)
| fields DATE USER predicted_count
```

For comparison with actual values:

```
| inputlookup your_lookup.csv
| where DATE >= "2020-07-05" AND DATE <= "2020-12-27"
| join type=left USER DATE
    [| inputlookup your_lookup.csv
    | where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
    | rename DATE as ds, COUNT as y
    | fit Prophet future_timespan=26 from ds y by USER
    | where isnull(y)
    | eval DATE=strftime(ds, "%Y-%m-%d")
    | fields DATE USER yhat
    | eval predicted_count = round(yhat)]
| rename COUNT as actual_count
| eval error = abs(actual_count - predicted_count)
| eval error_percentage = if(actual_count=0, if(predicted_count=0, 0, 100), round((error/actual_count)*100, 2))
```

## Handling Your Data Structure

Since you have 1000 users and 52 Sundays, I have a few recommendations for improving your forecasting:

1. Focus first on users with non-zero access patterns
   a. Many users might have sparse or no access attempts, which can result in poor models
   b. Consider filtering to users who accessed the system at least N times during the training period

2. Consider feature engineering
   a. Add month and quarter features to help the model capture broader seasonal patterns
   b. Include special event indicators if certain Sundays might have unusual patterns (holidays, etc.)
   c. You might want to include a lag feature (access count from previous Sunday)

3. Model evaluation
   a. Compare MAPE (Mean Absolute Percentage Error) across different users and algorithms
   b. For users with sparse access patterns, consider MAE (Mean Absolute Error) instead
   c. Establish a baseline model (like average access count per Sunday) to compare against

4. Alternative approach for sparse data
   a. For users with very sparse access patterns, consider binary classification 
   b. Predict whether a user will attempt access (yes/no) rather than count
   c. Use algorithms like Logistic Regression or Random Forest for this approach

Hope this helps point you in the right direction! With 6 months of training data focused on weekly patterns, Prophet is likely your best starting point.

Please give 👍 for support 😁 happly splunking .... 😎
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...