tfidf+anomalydetection

damucka · ‎03-04-2019

Hello,

I would like to evaluate my logs searching anomalies. Let us say I would like to evaluate a new software kernel, scanning the corresponding logs for unusual occurrences of the phrases, which did not happen before and based on that alert and perform a root cause analysis.

For that, I would like to build a model that calculates the daily word / phrases frequencies based on the logs from the past 90 days. Then I would compare the last day frequencies to the one from model and decide if these are anomalous or not.

If I skip the time constraints (90 days, daily frequencies, last day), I guess the SPL that would describe more less what I want to achieve is:

index=mlbso sourcetype=ISP_abaptraces ERROR| eval text = _raw| table text | fit TFIDF text | anomalydetection

So, how would I now apply the time constraint from above to achieve what I want?

The TFIDF should create a kind of model with the daily frequencies from the past 90 days, store it somewhere, and the anomalydetection should come on daily basis and compare the last day to what TFIDF created ...

Kind Regards,
Kamil

skoelpin · ‎03-08-2019

So if I understand this correctly, you want to count the number of unique phrases/words and compare them against a baseline of what you've had in the past? Then you want to alert if the unique number of words/phrases increases relative to a point in time? If so, then yeah I can walk you through that

skoelpin · ‎03-28-2019

Any update on this? @damucka

damucka · ‎03-29-2019

Hello @skoelpin

Thank you for reminding me ...
Unfortunately after thinking a bit about it I am not able to find a reasonable way to utilize the TFIDF for that, perhaps you can help. The point is that the output of the TFIDF is a frequency matrix and the n-gram structure. Now, if I apply this model as a baseline to my events from the last day, then I get the second matrix with the n-gram frequency, okay. But if I make it an input to the anomalydetection, will it be enough for the algorithm? I mean at the end these are only 2 vectors of the frequencies for the given event(pattern) - the one from the model and the current one.
Also choosing the proper ngram ranges is not so trivial I think, do you have any experience with that?

At the end what I did is I skipped tfidf and rely on the "anomalies" and "anomalydetection" applied straight to the pre-parsed text or _raw field. I have 2 cases:

1/
Select the errors in the last 180 days, clear the events (dates, digits) and apply the "anomalies" command to it. Take the 200 most unexpected events and filter only those that happened in the last 24 hours.

index=mlbso sourcetype=ISP_abaptraces ( mtx OR mmx OR mm_diagmode OR sigigenaction OR thierrhandle OR mutex OR "ca blocks" ) AND ERROR 
earliest=-180d@d latest=now  
| rename comment AS "1.) ------------------------- SPL Search string based on the expert domain knowledge --------------------------"   

| rex field=_raw "^(?<firstLine>.*)\n.(?<remainingLines>[\s\S]*)$"
| eval text=replace(remainingLines,"\d{0}\d+","")
|rename comment AS "2.) ------------------------- Pre-parsing: get rid of first line with date / time and the digits ---------------"

| anomalies field=text
| rename comment AS "3.) ------------------------- Anomalies -----------------------------------------------------------------------"

| table _time _raw text unexpectedness
| sort 200 -unexpectedness
| cluster field=text
| eval time_ago_24 = relative_time(now(), "-24h")
| where _time > time_ago_24
| sort 3 -unexpectedness

2/
Alert, executed hourly checking last 7 days error events, with the similar principle as 1/ but shorter time horizon and using the "anomalydetection" command:

index=mlbso sourcetype=ISP_hanatraces loglevel="e" 
earliest=-7d@d latest=now  
| rename comment AS "1.) ------------------------- SPL Search string based on the expert domain knowledge --------------------------"   

| anomalydetection _raw
| rename comment AS "2.) ------------------------- Anomalydetection -----------------------------------------------------------------------"

| eval time_ago_last_h = relative_time(now(), "-1h")
| where _time > time_ago_last_h
| cluster 
| table _time _raw text log_event_prob max_freq probable_cause probable_cause_freq

| rename comment AS "3.) ------------------------- Only if occurs within the last ----------------------------------------------------------"

These are quite simple ideas, we are testing now the quality of the outcome.
What also came to my mind would be sth like creating each day the output of the "cluster", save it somwhere and compare to the series of the outputs from before using anomalydetection. The idea is to check if each day there is the same number of errors/entries per cluster and if not, alert on it. Perhaps one could also use kmeans/xmeans for that. However I would not know how to store this daily data - at the end there would have to be a kind of cluster count timeseries. Not sure though if that would be better than just an anomalydetection on _raw.

... so you see I have a pretty mess concerning the ideas. It would be great if you could give your thoughts on that.
What is your experience from other users/customers?
Is the log/text based anomaly detection case specific or are there any general hints you could give?

Also, I requested the Splunk User Group access on Slack as recommended by @niketnilay.

Kind Regards,
Kamil

woodcock · ‎03-06-2019

Maybe you should pre-parse the words and separate them out?

 index=mlbso sourcetype=ISP_abaptraces ERROR
| rex max_match=0 "(?<word>[A-z]+)"
| stats count BY word

damucka · ‎03-07-2019

@woodcock

Thank you.
I am pre-parsing by skipping first event line including date / time, which I find irrelevant and also getting read of all digits, it looks at the moment as follows:
index=mlbso sourcetype=ISP_abaptraces ( mtx OR mmx OR mm_diagmode OR sigigenaction OR thierrhandle OR mutex OR "ca blocks" ) AND (WARNING OR ERROR)
earliest=-14d@d latest=-1d@d
|rename comment AS "1.) ------------------------- SPL Search string based on the expert domain knowledge ----------------------------------------------"

| rex field=_raw "^(?<firstLine>.*)\n.(?<remainingLines>[\s\S]*)$"
| eval text=replace(remainingLines,"\d{0}\d+","")
| table _raw text
|rename comment AS "2.) ------------------------- Pre-parsing: get rid of first line with date / time and the digits ----------------------------------"


| fit TFIDF text max_features=15 stop_words=english ngram_range=3-5
|rename comment AS "3.) ------------------------- Fit TFIFD and store the model -----------------------------------------------------------------------"

| anomalydetection 
|rename comment AS "4.) ------------------------- Anomaly Detection -----------------------------------------------------------------------------------"

The idea with counting the words is interesting, but then I would not know how to deploy the anomalydetection on that. I mean I get back the list of the words and their frequencies, which is done by both TFIDF (even more complicated, with ngrams, max_features, exclusion lists and the sophisticated way of counting frequency) and also the anomalydetection itself is calculating the histograms. But for both I need to have the event list as an input, not an aggregate.
Or is there sth I missunderstood completely?

Kind Regards,
Kamil

woodcock · ‎03-07-2019

It is really that I am misunderstanding you. I have no idea what TDIF means or anything on that entire line.

niketn · ‎03-04-2019

@damucka, you can try the following steps with run anywhere example based on Splunk's _internal index.

Step 1: Clean data as per your needs. Following is an independent search where I am looking for Error/Warn splunkd logs as an example (I have run for last 7 days where seems like you want to run for last 90 days which would require serious compute).

index=_internal sourcetype=splunkd component IN ("Search*","Dispatch*", "Exec*")  log_level=ERROR earliest=-7d@d latest=-1d@d
| eval text = _raw
| table text

Step 2: Once you have set of required field and values, you can apply the fit command to create internal model based on TFIDF (it is actually a matrix of token occurrences stored in internal lookup file for example in below case __mlspl_ml_experiment_tfidf_text.csv file).

| fit TFIDF text into ml_experiment_tfidf_text max_features=10000 stop_words=english ngram_range=3-5

Step 3: Finally apply the above model on test data/current data as per your need:

index=_internal sourcetype=splunkd component IN ("Search*","Dispatch*", "Exec*") earliest=-0d@d latest=now
| eval text = _raw
| table text
| apply ml_experiment_tfidf_text
| anomalydetection

While this is for example, I am pretty much sure, real-world implementation would not be as easy as this 🙂 Check out the Machine Learning Toolkit app Showcase examples to see if any existing example fits your needs.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

damucka · ‎03-07-2019

@niketn

Thank you, it is a good starting point for me. As you correctly guessed, I have tons of questions.

1/
First, how would I pass only the specific columns / column number ranges of the TFIDF output further to the anomalydetection command? I would like to make it generic of course and not by name as it can change based on the n-gram and frequency ...
If I do not do it, then the anomalydetection detects anomalies mostly based on the event text itself ((probable_cause), meaning I do not need the TFIDF at all in such case and I would like to test it as a pre-processor.

2/
Are there an rule of thumb values for the max_features being an input for anomaly_detection? In your example it is 10000, which means the anomalydetection deals with the multivariant of 10000 series, not sure about quality of that. Would it be enough that 1 series has anomaly to report it? ... I know, I can read the documentation / test it, but I thought you have already experience with this.

3/
If I would have my test event being anomalous, is it possible to make the whole procedure a bit automatic in a way that:
- changes in the TFIDF parameter are performed (max_features / ngram_range)
- changes in the anomalydetection parameters are done (e.g. algorithm)
- the tfidf model gets saved, applied and anomalydetection run over it
- ... if the test anomalous event is returned back by the specific parameter combination then I get it presented in a reasonable way (output).
I know it sounds a bit like a programming already, but still let me ask this naive question: is it possible using SPL or do I have to go for Python in this case?

Kind Regards,
Kamil

niketn · ‎03-08-2019

@damucka I would recommend getting on the #machinelearning channel on Slack. Request Splunk User Group access so that you can work directly with ML experts: https://answers.splunk.com/answers/443734/is-there-a-splunk-slack-channel.html

Paging ... @skoelpin , @astein_splunk

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

damucka · ‎04-02-2019

@niketnilay

Thank you.
I issued the form to join the Splunk-Usergroups on Slack (Splunk-Usergroups Slack Signup) twice, like last Friday and yesterday. The message there is that I should get an e-mail within 30 minutes, which did not happen.
Now I do not know if this is an automatic process and sth went wrong or there is someone at Splunk side to confirm it and I should wait for it.

Could you advise?

Kind Regards,
Kamil

tfidf+anomalydetection

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life