I want to create plain and simple histogram in Splunk, like everyone used to do in school days on graph paper. I have selected "id", and "pr" fields. I want "id" to be on x-axis and its corresponding value of "pr" on y-axis. How should I do that? Splunk isn't allowing me to do that. I don't want to use Sum, Count, Max, Min, Standard Deviation, Mode.
source="HVR_1 PageRank.csv" id="*" pr="*" | chart pr over id
Can anyone correct my code? Please!
I've had this sort of question come up a lot, and I thought maybe I'd give a different kind of answer, in case it was helpful or complementary.
Questions are more or less "I want to just chart the raw values, as points on a screen. Typically a timechart. "
And they come up in two ways:
a) I don't want to bucket the times, and I don't want to think about avg/min/max, because there aren't very many of them! I just want the values on the screen
b) I don't want to bucket the times and/or think about avg/min/max because I want the human eye to see the storm of points as a scatter plot and I think that'll be better than some clever statistic.
And there are a few ways to answer it.
1) OK, you can throw the raw points at the chart, you just have to use no actual transforming command at all!
Here's a good canonical answer
Con - If your time granularity exceeds (or greatly exceeds) the number of pixels on the screen..... you're not going to have a good time. ie the "storm of points" may just be a weird fuzzy block of noise.
Con - the charting framework doesn't really like to graph tens of, or hundreds of thousands of points. You might now or down the road get some truncation and error messages about truncation.
2) Sometimes the correct answer is to really come back and use some statistical aggregation, and resign yourself to a particular bucketing of the time values. Here's a good, if verbose question that covers this:
3) and there are sometimes other outlier answers, like this one here to use first() as a shoot from the hip heuristic.
but this seems imo pretty problematic and potentially misleading. use with caution.
Kind of sprawling answer. Perhaps not really an "answer" at all and more of a "further reading" post. 😃
You are rejecting the methods that work. WHY?
You are focused on creating a histogram, which means that for each value of id, there must be a single unique numeric value of pr that constitutes how tall the bar will be.
What, exactly, does the value of pr mean? It must be a number, for the one-dimensional histogram you are asking for to exist.
If pr is not a number, then COUNT is the only aggregate function that makes sense. Use that. (If there are multiple possible values of pr for each id, you could use distinct count also, or you could abandon the single-dimension histogram in favor of something else.)
If pr is a number, and if there is only one event for each value of pr in each value of id, then SUM, MAX, MIN, AVG will all work and will all get the same answer.
If pr is a number, and there are multiple possible events for each combination of pr and id, then you need to decide exactly what you are trying to graph. Figure out the math for "how do I know how tall the bar needs to be?" and then code that into the chart command (or any other command).
On the other hand, if you want to do an x-y plot of various values, try visualizations that are not bar charts. Specifically, try the bubble chart and other x-y plots to see if they meet your need.
"pr" is the PageRank of "id" node. Every "id" has only one "pr". Is not "Sum of pr" is addition of all nodes "pr"? Or, does it just plot the histogram on 1:1 basis, like one value from x-axis pointing to only one value on y-axis?
That depends on your query. If each value of id has only one value of pr returned by the query, and that value is numerical, then that value is indistinguishable from most aggregate functions: mathematically, it is equal to the max, the min, the mode, the mean, and the average; sequentially, it is the first, the last, the earliest, and the latest; Set-wise, it is completely equivalent to the list() and the values(). So, for that case where the id-pr relationship is 1-1, almost any meaningful aggregate function will serve. (Okay, not the stdev, but that wouldn't be meaningful.)
| chart sum(pr) over id, then for each id, splunk will calculate the sum of the pr values.
So what if there's more than one pr for one id? Which pr value should it use? How would Splunk know that?
Is there a time aspect to this data? Or is it only a "most recent value" type dataset?
Yes. I can provide you. "pr" is PageRank of the "id" node. Each node has only 1 "pr". Following is the sample.
name id pr count name2 id2
148 148 0.199162542 64 148 148
243 243 1.126083355 29 243 243
31 31 0.17263125 55 31 31
85 85 0.16646875 136 85 85
137 137 0.207598883 51 137 137
251 251 0.505910879 26 251 251
65 65 0.729124137 25 65 65
53 53 0.38208409 55 53 53