Splunk Search

MLTK - What algorithm should I use?

genesiusj
Builder

Hello,

Here is what I have.

  • Lookup file containing 52K rows
  • Fields: DATE, USER, COUNT
  • Require forecasting user access, on Sundays, to sensitive data based on
  • 6 months of events to train (Jan-Jun)
  • 6 months forecasting (Jul-Dec)
  • Data from 2020, so we know the results, but we want to see how close the forecasting was to the actual data
  • DATE format YYYY-MM-DD beginning with 2020-01-05 and ending on 2020-12-27 (Sundays) 52 values
  • USER 1000 values
  • Lookup file; there are 1000 USER values for every DATE; the COUNT is 0 if they did not attempt access, otherwise it is the number of attempts

The original lookup is over 1.5 million events (each containing the USER and TIME of attempt) Original TIME value was YYYY-MM-DD HH"MM:SS. But we are concerned with how many attempts that day.

Went to ChatGPT to help code the SPL; however, it "claimed" MLTK needed to be in a count of each user for every Sunday, and could work with the original events.

Thanks in advance for your help.
God bless,
Genesius

Labels (1)
Tags (2)
0 Karma

asimit
Engager

Hi @genesiusj,

Based on your description, you're dealing with a time series forecasting problem where you want to predict future user access patterns on Sundays. For this type of scenario in MLTK, I would recommend the following algorithms:

## Recommended Algorithms

1. Prophet
   a. Excellent for time series data with strong seasonal patterns (like your Sunday-only data)
   b. Handles missing values well, which is useful since many users may have zero counts on certain days
   c. Can capture multiple seasonal patterns (weekly, monthly, yearly)
   d. Works well when you have 6 months of historical data

2. ARIMA (AutoRegressive Integrated Moving Average)
   a. Good for detecting patterns and generating forecasts based on historical values
   b. Works well for data that shows trends over time
   c. Can handle seasonal patterns with the seasonal variant (SARIMA)
   d. Requires stationary data (you might need to difference your time series)

## Implementation Approach

For your specific use case with 1000 users, I would recommend using a separate model for each user who has sufficient historical data. Here's how you could implement this with Prophet:

```
| inputlookup your_lookup.csv
| where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
| rename DATE as ds, COUNT as y
| fit Prophet future_timespan=26 from ds y by USER
| where isnull(y)
| eval date_str=strftime(ds, "%Y-%m-%d")
| rename ds as DATE
| fields DATE USER yhat yhat_lower yhat_upper
| eval predicted_count = round(yhat)
| fields DATE USER predicted_count
```

For comparison with actual values:

```
| inputlookup your_lookup.csv
| where DATE >= "2020-07-05" AND DATE <= "2020-12-27"
| join type=left USER DATE
    [| inputlookup your_lookup.csv
    | where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
    | rename DATE as ds, COUNT as y
    | fit Prophet future_timespan=26 from ds y by USER
    | where isnull(y)
    | eval DATE=strftime(ds, "%Y-%m-%d")
    | fields DATE USER yhat
    | eval predicted_count = round(yhat)]
| rename COUNT as actual_count
| eval error = abs(actual_count - predicted_count)
| eval error_percentage = if(actual_count=0, if(predicted_count=0, 0, 100), round((error/actual_count)*100, 2))
```

## Handling Your Data Structure

Since you have 1000 users and 52 Sundays, I have a few recommendations for improving your forecasting:

1. Focus first on users with non-zero access patterns
   a. Many users might have sparse or no access attempts, which can result in poor models
   b. Consider filtering to users who accessed the system at least N times during the training period

2. Consider feature engineering
   a. Add month and quarter features to help the model capture broader seasonal patterns
   b. Include special event indicators if certain Sundays might have unusual patterns (holidays, etc.)
   c. You might want to include a lag feature (access count from previous Sunday)

3. Model evaluation
   a. Compare MAPE (Mean Absolute Percentage Error) across different users and algorithms
   b. For users with sparse access patterns, consider MAE (Mean Absolute Error) instead
   c. Establish a baseline model (like average access count per Sunday) to compare against

4. Alternative approach for sparse data
   a. For users with very sparse access patterns, consider binary classification 
   b. Predict whether a user will attempt access (yes/no) rather than count
   c. Use algorithms like Logistic Regression or Random Forest for this approach

Hope this helps point you in the right direction! With 6 months of training data focused on weekly patterns, Prophet is likely your best starting point.

Please give 👍 for support 😁 happly splunking .... 😎
0 Karma
Get Updates on the Splunk Community!

Aligning Observability Costs with Business Value: Practical Strategies

 Join us for an engaging Tech Talk on Aligning Observability Costs with Business Value: Practical ...

Mastering Data Pipelines: Unlocking Value with Splunk

 In today's AI-driven world, organizations must balance the challenges of managing the explosion of data with ...

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0

Did you know that for Splunk Enterprise 9.4, Python 3.9 is the default interpreter? This shift is not just a ...