Hello,
Here is what I have.
The original lookup is over 1.5 million events (each containing the USER and TIME of attempt) Original TIME value was YYYY-MM-DD HH"MM:SS. But we are concerned with how many attempts that day.
Went to ChatGPT to help code the SPL; however, it "claimed" MLTK needed to be in a count of each user for every Sunday, and could work with the original events.
Thanks in advance for your help.
God bless,
Genesius
Hi @genesiusj,
Based on your description, you're dealing with a time series forecasting problem where you want to predict future user access patterns on Sundays. For this type of scenario in MLTK, I would recommend the following algorithms:
## Recommended Algorithms
1. Prophet
a. Excellent for time series data with strong seasonal patterns (like your Sunday-only data)
b. Handles missing values well, which is useful since many users may have zero counts on certain days
c. Can capture multiple seasonal patterns (weekly, monthly, yearly)
d. Works well when you have 6 months of historical data
2. ARIMA (AutoRegressive Integrated Moving Average)
a. Good for detecting patterns and generating forecasts based on historical values
b. Works well for data that shows trends over time
c. Can handle seasonal patterns with the seasonal variant (SARIMA)
d. Requires stationary data (you might need to difference your time series)
## Implementation Approach
For your specific use case with 1000 users, I would recommend using a separate model for each user who has sufficient historical data. Here's how you could implement this with Prophet:
```
| inputlookup your_lookup.csv
| where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
| rename DATE as ds, COUNT as y
| fit Prophet future_timespan=26 from ds y by USER
| where isnull(y)
| eval date_str=strftime(ds, "%Y-%m-%d")
| rename ds as DATE
| fields DATE USER yhat yhat_lower yhat_upper
| eval predicted_count = round(yhat)
| fields DATE USER predicted_count
```
For comparison with actual values:
```
| inputlookup your_lookup.csv
| where DATE >= "2020-07-05" AND DATE <= "2020-12-27"
| join type=left USER DATE
[| inputlookup your_lookup.csv
| where DATE >= "2020-01-05" AND DATE <= "2020-06-28"
| rename DATE as ds, COUNT as y
| fit Prophet future_timespan=26 from ds y by USER
| where isnull(y)
| eval DATE=strftime(ds, "%Y-%m-%d")
| fields DATE USER yhat
| eval predicted_count = round(yhat)]
| rename COUNT as actual_count
| eval error = abs(actual_count - predicted_count)
| eval error_percentage = if(actual_count=0, if(predicted_count=0, 0, 100), round((error/actual_count)*100, 2))
```
## Handling Your Data Structure
Since you have 1000 users and 52 Sundays, I have a few recommendations for improving your forecasting:
1. Focus first on users with non-zero access patterns
a. Many users might have sparse or no access attempts, which can result in poor models
b. Consider filtering to users who accessed the system at least N times during the training period
2. Consider feature engineering
a. Add month and quarter features to help the model capture broader seasonal patterns
b. Include special event indicators if certain Sundays might have unusual patterns (holidays, etc.)
c. You might want to include a lag feature (access count from previous Sunday)
3. Model evaluation
a. Compare MAPE (Mean Absolute Percentage Error) across different users and algorithms
b. For users with sparse access patterns, consider MAE (Mean Absolute Error) instead
c. Establish a baseline model (like average access count per Sunday) to compare against
4. Alternative approach for sparse data
a. For users with very sparse access patterns, consider binary classification
b. Predict whether a user will attempt access (yes/no) rather than count
c. Use algorithms like Logistic Regression or Random Forest for this approach
Hope this helps point you in the right direction! With 6 months of training data focused on weekly patterns, Prophet is likely your best starting point.
Please give 👍 for support 😁 happly splunking .... 😎
Have you looked these guides?
I suppose that those helps you to select suitable algorithms with your test data.