I've recently begun exploring the FieldSelector command to better understand what fields are the best predictor for an ML model. During my research, I've gained what I think to be a decent understanding of what constitutes a good predictor field based largely on its p-value (anything below .05), and the score values (the higher the better). I've been running through some tests and noticed that the fields being selected by the FieldSelector don't represent what I would think to be the most optimal selection of fields. I've pasted the fit command I'm using below: |fit FieldSelector num from PC_* value_hashed_* type=numeric mode=k_best param=10 into combined_field_selector Once this is run, I compare the output to the summary of the combined_field_selector model, which provides score and p-values for all the fields: | summary combined_field_selector One of the ten fields selected via FieldSelector was PC_2, with a score of .3293 and a p-value of .5661. Of the 132 fields passed to this fit command, PC_2 ranked 115th in score and was the 15th highest p-value. This seems to tell me it was not a good predictor for the model. Plus, I had more than ten fields with better score/p-value combinations. I know this type of question falls in no man's land between the underlying python, statistical algorithms, and Splunk, but Splunk is really my only means of applying ML to this data and troubleshooting the results. I'm hoping someone has a better understanding of what's going on and can potentially explain why these fields are being selected.
... View more