I am doing some online course completion analysis and am trying to get a report which seems to me to be quite difficult to achieve. I have data from 2 sources, one containing the learners in the system and which company they belong to and the other a list of courses completed by those learners.
A simplified example of a learners record is
and course completions is
There are 3 different learner types, "Direct sales", "Indirect sales", "Other" and I want to ignore Other in this report.
I have 2 sourcetypes,
What I want to get out is a report that shows
Number of passes
Total number of learners in the company
Percentage complete as total learners by # passes
It needs to group the report by course, so the course name is only reported once and there is a line for each of the companies with the 3 additional columns of data.
I have got as far as two types of splunk command, each of which goes part way to solve the problem. The first
sourcetype=completions learner_type!="other" | stats count(learner_type) as Passed by course, company | chart list(company) AS Company, list(Passed) as Passed by course
That gives me the courses and within each course there are 2 columns, 1 the company and 2, the number of courses passed for that company. Cool, but not enough...
The second uses transaction to group by course
sourcetype=completions learner_type!="other" | transaction course | sort course | table course, company
That gives me a table with course and companies, with just a single line for course.
So, the first is missing the count of learners for each company, which must somehow come from the learners source_type as well as the % complete.
The second is missing most stuff, but is an example of partially achieving what I wanted with the transaction command, which I had not used before.
I feel like I need to do an initial count of the learners for each company and pass that onto the rest of the search pipeline, but am now stuck on how to progress...
Any help much appreciated.
There are several ways to approach this.
First approach: Use a lookup table. This method will scale nicely. It also leverage the fact that Splunk is well-suited to manipulate time-based data. That might not be important today, but... In this method, you continue to index the completions, but the learner information becomes a lookup table (which is not indexed).
If you place the learner data in a lookup table, you could have a
learner.csv file that looks like this
userid,name,learner_type,company djones,Dee Jones,employee,Big Data Inc. etc.
Once you set up the lookup (tutorial here), you could run the search like this:
sourcetype=completions | lookup learnerLookup userid OUTPUT learner_type company | where learner_type!="Other" | eventstats dc(userid) as LearnerCount by company | stats count as Passed avg(LearnerCount) as Learners by course, company | eval PercentComplete = round(Passed*100/Learners,2)
Second approach: Use a join Joins in Splunk are a little different than joins in SQL. It won't work if there are more than 10,000 results in the subsearch. For this, you would keep your learner data in Splunk as you do now.
sourcetype=completions | join userid [ search sourcetype=learners learner_type!="Other" | fields userid company ] | eventstats dc(userid) as LearnerCount by company | stats count as Passed avg(LearnerCount) as Learners by course, company | eval PercentComplete = round(Passed*100/Learners,2)
A final point - how will you update your learner data in the future? Splunk does not allow you to update indexed data - you would have to delete it and then index new data. A lookup table can be reloaded at any time, without any impact on your Splunk license.
Thanks Iguinn, option 1 is not possible, the learners change all the time and a monthly snapshot of learners is taken and added to the index.
So, option 2 is taking me forward, but not giving me quite what I want. I need to look at why it is giving me what it is then come back to this, but thanks for far, that's really helpful.