Hi all,
I have gathered into Splunk sales information of store branches across the US.
The data is in the format:
<City Name> <Branch ID> <Number of Workers> <Sales>
I would like to create a search that select the top 10 cities, sorted by their number of branches, with all of their branches sales statistics.
If possible, the results should not be grouped by the city name but just all records that match.
For example, this could be the data of the first city (that was selected first because it has 4 branches, more than the rest of the cities) :
<New York> <1> <10> <6000>
<New York> <2> <10> <5000>
<New York> <3> <5> <5000>
<New York> <4> <5> <5000>
How could I do that?
If I use stats to group all the records by city, I can use "dedup 10 City_Name" but then all statistics are gathered if they are equals. For example, I would get:
City_Name Branch_ID Number_Of_Workers Sales
1 10 6000
New York 2 5 5000
3
4
If I don't use stats, I don't know how I can select only 10 cities, since the number of records of those 10 cities is unknown and can only be resolved inside the search.
Thanks,
Ori.
Whenever you want to run stats, but you don't want to actually transform the set, you should look to eventstats
and streamstats
for example, foo | eventstats dc(branchId) as branchCount by cityName
will actually look at every row, calculate the number of branches for each city, and then write back onto the original incoming rows, one new field called branchCount
, without actually transforming the rows.
If we then sort the set by that number, we can use streamstats
to work through row by row and count the number of distinct values of cityName we see. (Streamstats works as it goes, as opposed to eventstats which calculates the statistics for the entire search result set. )
foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank
This will give us a field called rank that is "1" for the city that has the most branches, "2" for the city with the secondmost branches, and so on and so forth.
Then we just throw on a | where rank<5 | sort - _time
, to filter to just the cities that are in the top 4, and we sort by time descending again, to basically restore our original sort order.
foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank | where rank<5 | sort - _time
(Note that streamstats dc(branchCount)
would have been problematic, for two cities in the top 4 might have the same number of branches, and that would lead to us displaying the top 5 or 6 cities by mistake. However assuming your cityName values are unique, dc(cityName) wont suffer that problem. )
http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/Streamstats
http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/EventStats
Whenever you want to run stats, but you don't want to actually transform the set, you should look to eventstats
and streamstats
for example, foo | eventstats dc(branchId) as branchCount by cityName
will actually look at every row, calculate the number of branches for each city, and then write back onto the original incoming rows, one new field called branchCount
, without actually transforming the rows.
If we then sort the set by that number, we can use streamstats
to work through row by row and count the number of distinct values of cityName we see. (Streamstats works as it goes, as opposed to eventstats which calculates the statistics for the entire search result set. )
foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank
This will give us a field called rank that is "1" for the city that has the most branches, "2" for the city with the secondmost branches, and so on and so forth.
Then we just throw on a | where rank<5 | sort - _time
, to filter to just the cities that are in the top 4, and we sort by time descending again, to basically restore our original sort order.
foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank | where rank<5 | sort - _time
(Note that streamstats dc(branchCount)
would have been problematic, for two cities in the top 4 might have the same number of branches, and that would lead to us displaying the top 5 or 6 cities by mistake. However assuming your cityName values are unique, dc(cityName) wont suffer that problem. )
http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/Streamstats
http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/EventStats
Amazing.
Thank you very much for the answer and the excellent explanation.
Good Explanation