<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic MLTK - What algorithm should I use? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/745863#M241556</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Here is what I have.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Lookup file containing&amp;nbsp;52K rows&lt;/LI&gt;&lt;LI&gt;Fields: DATE, USER, COUNT&lt;/LI&gt;&lt;LI&gt;Require forecasting user access, on Sundays, to sensitive data based on&lt;/LI&gt;&lt;LI&gt;6 months of events to train (Jan-Jun)&lt;/LI&gt;&lt;LI&gt;6 months forecasting (Jul-Dec)&lt;/LI&gt;&lt;LI&gt;Data from 2020, so we know the results, but we want to see how close the forecasting was to the actual data&lt;/LI&gt;&lt;LI&gt;DATE format YYYY-MM-DD beginning with 2020-01-05 and ending on 2020-12-27 (Sundays) 52 values&lt;/LI&gt;&lt;LI&gt;USER 1000 values&lt;/LI&gt;&lt;LI&gt;Lookup file; there are 1000 USER values for every DATE; the COUNT is 0 if they did not attempt access, otherwise it is the number of attempts&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The original lookup is over 1.5 million events (each containing the USER and TIME of attempt) Original TIME value was YYYY-MM-DD HH"MM:SS. But we are concerned with how many attempts that day.&lt;/P&gt;&lt;P&gt;Went to ChatGPT to help code the SPL; however, it "claimed" MLTK needed to be in a count of each user for every Sunday, and could work with the original events.&lt;/P&gt;&lt;P&gt;Thanks in advance for your help.&lt;BR /&gt;God bless,&lt;BR /&gt;Genesius&lt;/P&gt;</description>
    <pubDate>Fri, 09 May 2025 17:33:52 GMT</pubDate>
    <dc:creator>genesiusj</dc:creator>
    <dc:date>2025-05-09T17:33:52Z</dc:date>
    <item>
      <title>MLTK - What algorithm should I use?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/745863#M241556</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Here is what I have.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Lookup file containing&amp;nbsp;52K rows&lt;/LI&gt;&lt;LI&gt;Fields: DATE, USER, COUNT&lt;/LI&gt;&lt;LI&gt;Require forecasting user access, on Sundays, to sensitive data based on&lt;/LI&gt;&lt;LI&gt;6 months of events to train (Jan-Jun)&lt;/LI&gt;&lt;LI&gt;6 months forecasting (Jul-Dec)&lt;/LI&gt;&lt;LI&gt;Data from 2020, so we know the results, but we want to see how close the forecasting was to the actual data&lt;/LI&gt;&lt;LI&gt;DATE format YYYY-MM-DD beginning with 2020-01-05 and ending on 2020-12-27 (Sundays) 52 values&lt;/LI&gt;&lt;LI&gt;USER 1000 values&lt;/LI&gt;&lt;LI&gt;Lookup file; there are 1000 USER values for every DATE; the COUNT is 0 if they did not attempt access, otherwise it is the number of attempts&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The original lookup is over 1.5 million events (each containing the USER and TIME of attempt) Original TIME value was YYYY-MM-DD HH"MM:SS. But we are concerned with how many attempts that day.&lt;/P&gt;&lt;P&gt;Went to ChatGPT to help code the SPL; however, it "claimed" MLTK needed to be in a count of each user for every Sunday, and could work with the original events.&lt;/P&gt;&lt;P&gt;Thanks in advance for your help.&lt;BR /&gt;God bless,&lt;BR /&gt;Genesius&lt;/P&gt;</description>
      <pubDate>Fri, 09 May 2025 17:33:52 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/745863#M241556</guid>
      <dc:creator>genesiusj</dc:creator>
      <dc:date>2025-05-09T17:33:52Z</dc:date>
    </item>
    <item>
      <title>Re: MLTK - What algorithm should I use?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/746344#M241620</link>
      <description>&lt;P&gt;Have you looked these guides?&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://machinelearningmastery.com/practical-guide-choosing-right-algorithm-your-problem/" target="_blank"&gt;https://machinelearningmastery.com/practical-guide-choosing-right-algorithm-your-problem/&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://www.geeksforgeeks.org/choosing-a-suitable-machine-learning-algorithm/" target="_blank"&gt;https://www.geeksforgeeks.org/choosing-a-suitable-machine-learning-algorithm/&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://labelyourdata.com/articles/how-to-choose-a-machine-learning-algorithm" target="_blank"&gt;https://labelyourdata.com/articles/how-to-choose-a-machine-learning-algorithm&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://scikit-learn.org/stable/machine_learning_map.html" target="_blank"&gt;https://scikit-learn.org/stable/machine_learning_map.html&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I suppose that those helps you to select suitable algorithms with your test data.&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 16:15:59 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/746344#M241620</guid>
      <dc:creator>isoutamo</dc:creator>
      <dc:date>2025-05-16T16:15:59Z</dc:date>
    </item>
    <item>
      <title>Re: MLTK - What algorithm should I use?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/746367#M241622</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/116827"&gt;@genesiusj&lt;/a&gt;,&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;Based on your description, you're dealing with a time series forecasting problem where you want to predict future user access patterns on Sundays. For this type of scenario in MLTK, I would recommend the following algorithms:

## Recommended Algorithms

1. Prophet
   a. Excellent for time series data with strong seasonal patterns (like your Sunday-only data)
   b. Handles missing values well, which is useful since many users may have zero counts on certain days
   c. Can capture multiple seasonal patterns (weekly, monthly, yearly)
   d. Works well when you have 6 months of historical data

2. ARIMA (AutoRegressive Integrated Moving Average)
   a. Good for detecting patterns and generating forecasts based on historical values
   b. Works well for data that shows trends over time
   c. Can handle seasonal patterns with the seasonal variant (SARIMA)
   d. Requires stationary data (you might need to difference your time series)

## Implementation Approach

For your specific use case with 1000 users, I would recommend using a separate model for each user who has sufficient historical data. Here's how you could implement this with Prophet:

```
| inputlookup your_lookup.csv
| where DATE &amp;gt;= "2020-01-05" AND DATE &amp;lt;= "2020-06-28"
| rename DATE as ds, COUNT as y
| fit Prophet future_timespan=26 from ds y by USER
| where isnull(y)
| eval date_str=strftime(ds, "%Y-%m-%d")
| rename ds as DATE
| fields DATE USER yhat yhat_lower yhat_upper
| eval predicted_count = round(yhat)
| fields DATE USER predicted_count
```

For comparison with actual values:

```
| inputlookup your_lookup.csv
| where DATE &amp;gt;= "2020-07-05" AND DATE &amp;lt;= "2020-12-27"
| join type=left USER DATE
    [| inputlookup your_lookup.csv
    | where DATE &amp;gt;= "2020-01-05" AND DATE &amp;lt;= "2020-06-28"
    | rename DATE as ds, COUNT as y
    | fit Prophet future_timespan=26 from ds y by USER
    | where isnull(y)
    | eval DATE=strftime(ds, "%Y-%m-%d")
    | fields DATE USER yhat
    | eval predicted_count = round(yhat)]
| rename COUNT as actual_count
| eval error = abs(actual_count - predicted_count)
| eval error_percentage = if(actual_count=0, if(predicted_count=0, 0, 100), round((error/actual_count)*100, 2))
```

## Handling Your Data Structure

Since you have 1000 users and 52 Sundays, I have a few recommendations for improving your forecasting:

1. Focus first on users with non-zero access patterns
   a. Many users might have sparse or no access attempts, which can result in poor models
   b. Consider filtering to users who accessed the system at least N times during the training period

2. Consider feature engineering
   a. Add month and quarter features to help the model capture broader seasonal patterns
   b. Include special event indicators if certain Sundays might have unusual patterns (holidays, etc.)
   c. You might want to include a lag feature (access count from previous Sunday)

3. Model evaluation
   a. Compare MAPE (Mean Absolute Percentage Error) across different users and algorithms
   b. For users with sparse access patterns, consider MAE (Mean Absolute Error) instead
   c. Establish a baseline model (like average access count per Sunday) to compare against

4. Alternative approach for sparse data
   a. For users with very sparse access patterns, consider binary classification 
   b. Predict whether a user will attempt access (yes/no) rather than count
   c. Use algorithms like Logistic Regression or Random Forest for this approach

Hope this helps point you in the right direction! With 6 months of training data focused on weekly patterns, Prophet is likely your best starting point.

Please give &lt;span class="lia-unicode-emoji" title=":thumbs_up:"&gt;👍&lt;/span&gt; for support &lt;span class="lia-unicode-emoji" title=":beaming_face_with_smiling_eyes:"&gt;😁&lt;/span&gt; happly splunking .... &lt;span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:"&gt;😎&lt;/span&gt;&lt;/LI-CODE&gt;</description>
      <pubDate>Fri, 16 May 2025 19:50:48 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/746367#M241622</guid>
      <dc:creator>asimit</dc:creator>
      <dc:date>2025-05-16T19:50:48Z</dc:date>
    </item>
    <item>
      <title>Re: MLTK - What algorithm should I use?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/751543#M242519</link>
      <description>&lt;P&gt;Apologies. I did not see any notification in my email about this question receiving responses.&lt;/P&gt;&lt;P&gt;I am moved from project to project, and this one is now on hold.&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/310230"&gt;@asimit&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/214410"&gt;@isoutamo&lt;/a&gt;&amp;nbsp; I gave you some karma.&lt;/P&gt;&lt;P&gt;God bless.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Aug 2025 13:35:30 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/751543#M242519</guid>
      <dc:creator>genesiusj</dc:creator>
      <dc:date>2025-08-13T13:35:30Z</dc:date>
    </item>
    <item>
      <title>Re: MLTK - What algorithm should I use?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/751553#M242523</link>
      <description>&lt;P&gt;Thanks, If you will continue with this and can solve, please inform us what the solution was!&lt;/P&gt;&lt;P&gt;ChatGPT proposes the next algorithms for this case&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Algorithm in MLTK&lt;/STRONG&gt;&lt;STRONG&gt;When to Use&lt;/STRONG&gt;&lt;STRONG&gt;Strengths&lt;/STRONG&gt;&lt;STRONG&gt;Weaknesses&lt;/STRONG&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;StateSpaceForecast&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;algo=StateSpaceForecast&lt;SPAN class=""&gt;)&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Small seasonal datasets with trend&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Captures seasonality &amp;amp; trend&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Not designed for sparse series with many zeros&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;ARIMA&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt; (&lt;/SPAN&gt;algo=ARIMA&lt;SPAN class=""&gt;)&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Strong seasonality, autocorrelation&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Handles short time series&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Needs continuous values (zeros can reduce model quality)&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P class=""&gt;&lt;STRONG&gt;LLP (Local Linear Projection)&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Simple, quick&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Light-weight&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Limited for complex patterns&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P class=""&gt;&lt;STRONG&gt;DenseNNRegressor / LSTM&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Complex, nonlinear series&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Can learn patterns over multiple users at once&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Requires MLTK deep learning toolkit, more tuning&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;One-Class SVM / Isolation Forest&lt;/STRONG&gt;&lt;/SPAN&gt; (for anomaly detection instead of forecasting)&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Detecting abnormal future access&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Robust to sparse data&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P class=""&gt;Not a forecasting method per se&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Given your case — weekly time series, short history (26 points for training), seasonality (weekly), and sparsity — the &lt;SPAN class=""&gt;&lt;STRONG&gt;StateSpaceForecast&lt;/STRONG&gt;&lt;/SPAN&gt; algorithm is usually the safest starting point in MLTK for this type of forecasting.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Aug 2025 14:03:44 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/MLTK-What-algorithm-should-I-use/m-p/751553#M242523</guid>
      <dc:creator>isoutamo</dc:creator>
      <dc:date>2025-08-13T14:03:44Z</dc:date>
    </item>
  </channel>
</rss>

