Performance very slow for index join on 200MB data...

jonu4u · ‎07-06-2019

Hi I've json data in one index containing student_id,grade_id,class_id . and another static dataset(csv) in another index which contains a the names based on ids. Say student_name,student_id,grade_name,grade_id.
I need to show a dashboard in realtime which will update the count of students per grade per class , however the filter are on the names and not the ids. So I need to join these two datasets. The grade_id,class_id are inside nested json and are extracted at search time before they are joined with the static index. This query is taking 20seconds!!! too slow. What techniques can be done to make it under 2-3seconds since data volume is very small.

woodcock · ‎07-06-2019

Even better, I would use a scheduled search to turn the 2nd CSV dataset back into a lookup like this:

(index=<student data> AND sourcetype=<student data>)
| dedup student_name student_id grade_name grade_id
| table student_name student_id grade_name grade_id
| outputlookup MyStudentLookup.csv

Then do this:

 index=<json data> AND sourcetype=<json data>
| lookup MyStudentLookup.csv student_id grade_id

woodcock · ‎07-06-2019

This is one of the main reasons you keep hearing us say never use join; instead use stats like this:

(index=<json data> AND sourcetype=<json data> OR (index=<student data> AND sourcetype=<student data>)
| stats values(*) AS * BY student_id grade_id

Then do your stuff from here.

richgalloway · ‎07-06-2019

Please share your search.
Also, please explain why a dashboard needs to be real-time. Is the number of students in a class changing so fast a 20-second query can't keep up? If a human is processing the results of the query then real-time is not necessary.

Perhaps by "realtime" you mean "fast". That's different and by seeing your query we may be able to suggest improvements.
What is the size of the "very small" data volume?

---
If this reply helps you, Karma would be appreciated.

jonu4u · ‎07-07-2019

So we need to monitor student activity. So any instant if a student is inactive for less than 5 seconds we see that. So we want it realtime.

Performance very slow for index join on 200MB data only

Harnessing Splunk’s Federated Search for Amazon S3

Infographic provides the TL;DR for the 2024 Splunk Career Impact Report

Enterprise Security Content Update (ESCU) | New Releases