The manual entry for the metadata command says "...in environments with large numbers of values per category, the data might not be complete. This is intentional and allows the metadata command to operate within reasonable time and memory usage."
I know that the metadata command is reading the data on sourcetypes, hosts and sources that is stored within each bucket, rather than reading the individual events. I also think that the metadata command ignores the time range picker.
Does anyone know what "large numbers of values per category" actually means? Would the command be accurate for 1000 sourcetypes? Or is it really related to the number of buckets that must be retrieved?
If the metadata command works like search, then it would process buckets in reverse time sequence (newest to oldest) - so even if the data was incomplete, the lastTime and recentTime should be accurate for the objects reported. I guess that older objects, which had not been updated recently, might be missed. Is this true, or did I just make up something that sounds logical?
Finally, would restricting the metadata command to a single index help? In other words, would the results be more likely to be complete if metadata does not have as many indexes to search?
Use | tstats min(_time) as firstTime max(_time) as lastTime max(_indextime) as recentTime count by host
instead.
Will be accurate-to-the-timerange and work across large configurations/cardinalities. Might be a bit slower though, and doesn't do real-time.
Based on other answers and comments, I wonder if the timerange picker is identifying the buckets, but the metadata command is using the *.data
files to compile the results. The *.data
files could have data outside the timerange, and so the results might exceed the timerange... Or something like that.
I think metadata
will touch every bucket that has events within the time range, and then grab the entire bucket's data. In a reasonably small environment you could verify that using dbinspect
.
Are you sure, @martin_mueller? I think it should touch every bucket in the time range, but it shouldn't need to grab the entire bucket's data - the metadata that the command needs is all in Hosts.data
and Sources.data
and Sourcetypes.data
.
Yeah, I'm sure. The metadata within one bucket has no timeseries data, so it's takeall or nothing.
You can verify this by running metadata over the last minute, and then comparing the counts with the hot buckets returned by dbinspect.
Possible alternate way to solve my problem is to use metasearch
instead, although it may be slower...
cf. this answer
So I've been awarding points for strong answers and good ideas. I've almost given up on truly understanding how metadata
works - at some point maybe I will catch a Splunk engineer and ask.
But the bounty is still available if you have the ultimate answer!