The manual entry for the metadata command says "...in environments with large numbers of values per category, the data might not be complete. This is intentional and allows the metadata command to operate within reasonable time and memory usage."
I know that the metadata command is reading the data on sourcetypes, hosts and sources that is stored within each bucket, rather than reading the individual events. I also think that the metadata command ignores the time range picker.
Does anyone know what "large numbers of values per category" actually means? Would the command be accurate for 1000 sourcetypes? Or is it really related to the number of buckets that must be retrieved?
If the metadata command works like search, then it would process buckets in reverse time sequence (newest to oldest) - so even if the data was incomplete, the lastTime and recentTime should be accurate for the objects reported. I guess that older objects, which had not been updated recently, might be missed. Is this true, or did I just make up something that sounds logical?
Finally, would restricting the metadata command to a single index help? In other words, would the results be more likely to be complete if metadata does not have as many indexes to search?
So I've been awarding points for strong answers and good ideas. I've almost given up on truly understanding how
metadata works - at some point maybe I will catch a Splunk engineer and ask.
But the bounty is still available if you have the ultimate answer!
Here is my observation of metadata command in Splunk 6.2.2
1) The metadata command does respect time range picker (you can see the firstTime and totalCount values will get change between last 15 min to last 30 days, again data has to be available. It may not work on data but the buckets)
2) The metadata command has categories (Type parameter) as hosts/sources/sourcetype and generally for larger number of these categories would required metadata data command to parse more buckets to get the information, hence wouldn't be efficient. I've see it working fine for my 30,000 hosts (retention period 1 year), so I do believe it should work for 1000 sourcetypes provided the number of buckets scanned is reasonable.
3) The metadata command does work like search and it should report the lastTime and recentTime correctly with totalCount and firstTime can be incorrect, as it's not scanning though oldest buckets (my guess is that Splunk finalizes the search after certain memory limit is reached).
4) Restricting the metadata search to one index would definitely help, but again it depends on the content of the index (I had 53 index, with 1 indexes containing 70% of data, so it may not work for that particular index).
Hope this helps.
Thanks for the thoughts.
1) I did not realize that the timerange made a difference for the metadata command, but it does. I just can't figure out which timestamp that the timerange picker considers, or how it works. I have things showing up in the list where the firstTIme, lastTime and recentTime are all outside the timerange of the search.
2) I also hope it would work, but I am looking for the actual limit...
3&4) I am looking for the specific conditions that must be met for the information to be reliable.
Oh, to see interesting stuff, this is the search I ran (as admin of course) on a test splunk instance:
| metadata index=_* type=sources | addinfo | fields - info_sid info_search_time | eval gt_lastTime=if(info_min_time > lastTime,"Yes","-") | eval gt_recentTime=if(info_min_time > recentTime,"Yes","-") | eval gt_firstTime=if(info_min_time > firstTime,"Yes","-") | table info_min_time info_max_time firstTime lastTime recentTime gt* source | fieldformat firstTime=strftime(firstTime,"%x %X") | fieldformat lastTime=strftime(lastTime,"%x %X") | fieldformat recentTime=strftime(recentTime,"%x %X") | fieldformat info_max_time=strftime(info_max_time,"%x %X") | fieldformat info_min_time=strftime(info_min_time,"%x %X")
I had to look at the data closely, but there were many sources with timestamps out of the search range...
metadata will touch every bucket that has events within the time range, and then grab the entire bucket's data. In a reasonably small environment you could verify that using
Are you sure, @martinmueller? I think it _should touch every bucket in the time range, but it shouldn't need to grab the entire bucket's data - the metadata that the command needs is all in
Yeah, I'm sure. The metadata within one bucket has no timeseries data, so it's takeall or nothing.
You can verify this by running metadata over the last minute, and then comparing the counts with the hot buckets returned by dbinspect.
Based on other answers and comments, I wonder if the timerange picker is identifying the buckets, but the metadata command is using the
*.data files to compile the results. The
*.data files could have data outside the timerange, and so the results might exceed the timerange... Or something like that.