Solved: How can i prepare my data set to be ready for mach...

delgendy · ‎07-28-2017

I am starting to learn the machine learning on Splunk through the tool kits Embedded, however in order to fit the model, you need first to prepare your data set which i m trying to do through preparing the fields that i will use as a factor to predict the error response code, i will be using the Predict Categorical Fields .
The factors used to predict are : Response Duration for each API call from an index which is different than the other index which will be providing me with the response code .
I would like to know how i can combine the 2 indexes to have both fields the Response code which i m getting through Regex and the Duration.
My Query as follows, please note also i m using accelerated data model for one of the data sets

| tstats count as total,max(Root_Transaction.duration) as "Max Duration",perc95(Root_Transaction.duration) as "95th Percentile Duration" FROM datamodel=X  WHERE nodename=Root_Transaction by Root_Transaction.requesting_system,Root_Transaction.service_name
| appendcols [search  index=Y  membername=APINAME  "Response: StatusCode" | rex "status..:(?\d+)"
  | join traceid [ search index=Y   membername=Sla | stats sum(timing) as Duration by traceid  | eval Duration= Duration/10000]
  | stats max(Duration) as MAX_Duration by respcode]

This query will get me the Max duration taken by a certain API call, and from the other component "index" i m getting the response code for this API call which i will need to predict .
so i m saying based on the Duration predict the response codes .
i m not able to get a good correlation between the 2 indexes so i can have a data set to fit to the model.

DalJeanis · ‎07-28-2017

What is missing in your problem description is how you expect the system to predict anything from what you were doing.

Are you trying to say, "I want the system to predict the slowest response code for each requesting_system and service_name based on the maximum response time for that system, compared to the maximum duration for each response code? "

Below is a description of what your code is currently doing, which I don't think is what you meant it to do.

This part (if working) is going to give you one record per combination of requesting_system and service_name, with the max and 95th percentile...

| tstats count as total, 
    max(Root_Transaction.duration) as "Max Duration", 
    perc95(Root_Transaction.duration) as "95th Percentile Duration" 
    FROM datamodel=X  
    WHERE nodename=Root_Transaction 
    by Root_Transaction.requesting_system, Root_Transaction.service_name

This part (if working) is going to give you a table of respcode and MAX_Duration... I'm guessing that
the respcode is the field you were rexing out of there.

[search  index=Y  membername=APINAME  "Response: StatusCode" 
| rex "status..:(?<respcode>\d+)"
| join traceid 
       [ search index=Y   membername=Sla 
       | stats sum(timing) as Duration by traceid  
       | eval Duration= Duration/10000
       ]
| stats max(Duration) as MAX_Duration by respcode
]

This part is going to arbitrarily assign the first result from tstats to the first respcode from the table, the second result from tstats to the second respcode of the table, and so on.

| appendcols

Notice - There is no attempt to match requesting_system and service_name to any relevant respcode. They are just paired up based on the order in which they happen to be returned.

Moral of the story - appendcols doesn't do anything you should find useful. There's always a better way.

View solution in original post

DalJeanis · ‎07-28-2017

What is missing in your problem description is how you expect the system to predict anything from what you were doing.

Are you trying to say, "I want the system to predict the slowest response code for each requesting_system and service_name based on the maximum response time for that system, compared to the maximum duration for each response code? "

Below is a description of what your code is currently doing, which I don't think is what you meant it to do.

This part (if working) is going to give you one record per combination of requesting_system and service_name, with the max and 95th percentile...

| tstats count as total, 
    max(Root_Transaction.duration) as "Max Duration", 
    perc95(Root_Transaction.duration) as "95th Percentile Duration" 
    FROM datamodel=X  
    WHERE nodename=Root_Transaction 
    by Root_Transaction.requesting_system, Root_Transaction.service_name

This part (if working) is going to give you a table of respcode and MAX_Duration... I'm guessing that
the respcode is the field you were rexing out of there.

[search  index=Y  membername=APINAME  "Response: StatusCode" 
| rex "status..:(?<respcode>\d+)"
| join traceid 
       [ search index=Y   membername=Sla 
       | stats sum(timing) as Duration by traceid  
       | eval Duration= Duration/10000
       ]
| stats max(Duration) as MAX_Duration by respcode
]

This part is going to arbitrarily assign the first result from tstats to the first respcode from the table, the second result from tstats to the second respcode of the table, and so on.

| appendcols

Notice - There is no attempt to match requesting_system and service_name to any relevant respcode. They are just paired up based on the order in which they happen to be returned.

Moral of the story - appendcols doesn't do anything you should find useful. There's always a better way.

delgendy · ‎10-06-2017

Thanks so much for all the details , appreciate it.

DalJeanis · ‎10-06-2017

@delgendy - You have to step back and apply root principles to the problem set first. If all the data you are going to have is 95th percentile and max duration, then there's no point in using the machine learning stuff, however cool it is. If you timecharted that, you might be able to see a trend. If you clustered it, you might be able to detect groupings of indexes. But machine learning is inappropriate to the data.

If you want to predict response times, and you want to use machine learning, and you want to take EVERYTHING into account, then you can start by thinking of all the things that could affect response times (time of day, day of week, cpu load, index, type of API call, etc) and figure out how to collect and clean and prep and correlate all that data.

Then, having collected and prepped a decently-sized mess of data, the machine learning may be able to explain something interesting to you. (For instance, you may find that a particular call at a particular time of weekday predicts a slow response, and from investigating that you may find that there is a competing process scheduled at that point that needs to be optimized.)

DalJeanis · ‎07-28-2017

Please remember to mark your code as code, for example using the 101 010 button, so that the interface does not delete/obey anything that looks like HTML. I've done that for you on this one, but you willnotice that it has still deleted the name of the field you were extracting from "Response: StatusCode" .

How can i prepare my data set to be ready for machine learning ?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

How can i prepare my data set to be ready for machine learning ?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases