I have a field of titles that are filled with sentences about why a test was failed in a security audit, but they are separated by each asset. So there can be two different assets with the same reason listed but in different words. For example, one might say "Login Password is empty" and another asset failure will say "Login password did not meet requirements". If I could aggregate them based on words like "password", I can get more value from the data. I can't hardcode it because I don't know all the possible aggregates.
Here is what I have so far, and I'm open to any feedback:
earliest=-1d@d latest=@d index=cdb_summary sourcetype=cfg_summary source=CDM_*_Daily_Summary
| search hva=*
| eval FailedSTIGs=mvsort(split(FailedSTIGs,","))
| stats values(fismaid) as fismaid dc(asset_id) as Affected by FailedSTIGs,hva
| lookup DHS_Expected_Checks "STIG ID" as FailedSTIGs output "Rule Title"
| fit TFIDF "Rule Title" as rule_tfidf ngram_range=1-12 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans rule_tfidf* k=8 | fields cluster "Rule Title" | sample 6 by cluster | sort by cluster
This is similar to the varied logs from different applications that share common business and technology domains, just more "freehand". We tried to "encourage" standardization but that only went so far. I still couldn't predict what the developers would throw at me. I had to manually tune my aggregation strategies, and update from time to time.
Ideally, you'll have a natural language model to deal with them. Failing that, you can use ML to do some clustering and start tuning from there. In all cases, this is going to be dynamic.