Splunk Search

Accuracy of metadata command in large environments?

lguinn2
Legend

The manual entry for the metadata command says "...in environments with large numbers of values per category, the data might not be complete. This is intentional and allows the metadata command to operate within reasonable time and memory usage."

I know that the metadata command is reading the data on sourcetypes, hosts and sources that is stored within each bucket, rather than reading the individual events. I also think that the metadata command ignores the time range picker.

Does anyone know what "large numbers of values per category" actually means? Would the command be accurate for 1000 sourcetypes? Or is it really related to the number of buckets that must be retrieved?

If the metadata command works like search, then it would process buckets in reverse time sequence (newest to oldest) - so even if the data was incomplete, the lastTime and recentTime should be accurate for the objects reported. I guess that older objects, which had not been updated recently, might be missed. Is this true, or did I just make up something that sounds logical?

Finally, would restricting the metadata command to a single index help? In other words, would the results be more likely to be complete if metadata does not have as many indexes to search?

Tags (1)
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Use | tstats min(_time) as firstTime max(_time) as lastTime max(_indextime) as recentTime count by host instead.

Will be accurate-to-the-timerange and work across large configurations/cardinalities. Might be a bit slower though, and doesn't do real-time.

View solution in original post

pj
Contributor

You could probably also use the license log - it is likely to be a faster than metasearch. Something like:

index=_internal sourcetype=splunkd source="/opt/splunk/var/log/splunk/license_usage.log*" component=LicenseUsage
| stats count latest(_time) as latest_time by st, h, idx

The caveat here is that the h value is sometimes squished and not present (it is usually present in 98%+ of events tho). For st and idx, it is my understanding that it is very accurate. Additionally, you can model the license data and it would be lightening fast. Of course - this log doesnt give you the latest/earliest timestamps of the actual events, but gives you the earliest/latest times for when the data hit the Splunk license server. Eitherway - if you are using it as a way to determine if you saw events in the last x hours or not, then it is useful.

It is a shame that there doesnt not seem to be a definitive / accurate method in Splunk to do this (in a fast manner) out of the box for host, sourcetype and index. It comes up all the time when folks are trying to use the data to write metrics and alerts related to whether data sources or hosts have dropped etc.

pj
Contributor

I should add that the |tstats count method above in one of the other answers does seem like a fairly nice approach and is reasonably quick. In a large environment, you could potentially have a KV based state table that is frequently populated by a |tstats based search, then you would be able to leverage the KV data fairly instantly.

lguinn2
Legend

Overall, I like this idea. It is a novel approach to the problem. And the comment is good too.

This is definitely something to think about.

0 Karma

Runals
Motivator

Depending on the size of the environment I might do something like the following and probably run it in fast mode (fast mode likely not to make much difference). To your point though depending on volume of data in the indexes and timeframe of your search this might be ugly (backgrounding your search FTW).

| metasearch index=* | fields index host sourcetype | stats count min(_time) as earliest max(_time) as latest by index host sourcetype

To address this at scale in my environment I have the following search which runs once an hour. As background I have a lookup which contains the index name, department name, if there is data worthy of being counted (ie not a summary index), and a few other pieces of data. Of course there is another query that runs once a week to alert on new indexes that aren't on this list =). Foo, in this case, is really the tag for the centralized IT shop and there is a second query that populates a separate summary index where Foo=f. In the rex command I'm stripping out the elements of the sourcetype if it ends in a dash + number or "-too_small" with the idea that at some point in the future I will magically have enough time to have cleaned them up. The query output also has ties into some of the uses I had/have for the data taxonomy stuff I baked into the Data Curator app.

| metasearch [|inputlookup index_list | search Foo=t reportable=t | fields index] | eval host=lower(host) | fields host index sourcetype | rex field=sourcetype "(?<sourcetype>.+)(?:-\d+|-too_small)" | stats count by host index sourcetype

The search itself it fairly performant (7-8am 193M events > 28k results > 68s runtime in the UI so faster when it runs under the covers). At any rate to answer the question you pose, in my environment I'd go to the summary indexes populated by this query and its mate. This is somewhat more of an administrative solution vs simply a query. If folks want to spin up something similar they should consider creating their query(ies) and then use the backfill summary script that ships with Splunk to seed the summary index with their historical data.

lguinn2
Legend

Search mode of "fast" is not likely to help at all, since Splunk is only returning metadata (host, source, source, index, splunk_server, etc.) so the only field extraction at all would be for the conditions of the metasearch...

0 Karma

lguinn2
Legend

This is my favorite answer so far - so 25 Karma points to you! I am not going to "accept" this answer yet, because an "unanswered question" generates more traffic and therefore more discussion.

0 Karma

Runals
Motivator

I always thought the metadata command had the same/similar sorts of limitations of the internal metrics log in that it can only keep reasonable track of something like 2k unique host/source/sourcetype combinations. The metrics log for certain can't keep up with an environment with 1k sourcetypes so I've somewhat rolled my own /shrug. In playing around with metadata early on I found the timerange picker seemed to be ignored so dumped its use as metasearch is both more accurate and useful. I also have 130+ indexes so while knowing how many events we have of one sourcetype, host, or source has value I generally need to know those counts by index (ie having 10 systems named DC3 or other host name collision for servers named Gimli, Thor, etc)

At any rate while an answer to your original question could prove valuable if you could shed light on what it is you are trying to accomplish other options might come to light.

jrodman
Splunk Employee
Splunk Employee

The limitation was that pulling the set of all sources, for example, from all search peers, and then merging all that data on the search head lead to Bad Situations. The total amount of information that's being processed here is certainly a data quantity that could be handled on splunk, but |metadata was built to be a quick preview tool in the UI, and (used to) operate largely in-memory, so these large datasets were operating in an in-memory fashion inside main splunkd, which would cause it to explode and fall over.

At some point between then and now, I believe metadata was banished to a proper search process, which causes some of the concerns to go away.

However there was also a sort of unreasonable overhead in maintaining complete index-level information on all sources/sourcetypes/etc so at some point (~5.0 or so?) we stopped maintaining complete index-level records of all sources, for example, which meant we couldn't efficiently answer the query for "what are all the sources". At that time, the intent was to build an intentionally incomplete index-level dataset with low maintenance overhead, and answer queries from that.

I'm not sure of the current status.

0 Karma

lguinn2
Legend

I am just looking for an accurate overall summary of the environment that shows a simple report by index, host and sourcetype, with a count and the first/last update timestamp for each.

While index=* | stats count earliest(_time) latest(_time) by index host sourcetype works, it is terribly inefficient except over short time ranges.

martin_mueller
SplunkTrust
SplunkTrust

That search screams for tstats. Based on rough back-of-my-Splunk measurements it should be in the 100x faster region:

| tstats count earliest(_time) latest(_time) where index=* by index host sourcetype

lguinn2
Legend

@martin_mueller - I would give you karma for that, but I can't give karma for comments - I can only "like".
Thanks! Nice tip.

martin_mueller
SplunkTrust
SplunkTrust

There's an almost identical answer too 🙂

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Use | tstats min(_time) as firstTime max(_time) as lastTime max(_indextime) as recentTime count by host instead.

Will be accurate-to-the-timerange and work across large configurations/cardinalities. Might be a bit slower though, and doesn't do real-time.

lguinn2
Legend

So the answer could be

| tstats min(_time) as firstTime max(_time) as lastTime max(_indextime) as recentTime count by index host sourcetype 
| fieldformat firstTime=strftime(firstTime,"%x %X")
| fieldformat lastTime=strftime(lastTime,"%x %X")
| fieldformat recentTime=strftime(recentTime,"%x %X")

I think I have a winner. Best accuracy and reasonable speed over a chosen time range

0 Karma

pj
Contributor

I would probably recommend writing it like this:

| tstats count min(_time) as firstTime, max(_time) as lastTime, max(_indextime) as recentTime where index=* by host, sourcetype, index

The approach you use above doesn't search over all indexes - well on my system at least it just provides results from the main index.

martin_mueller
SplunkTrust
SplunkTrust

The key there would be the indexes searched by default for your roles. lguinn's approach takes those, yours takes the indexes searchable for your roles. Both valid approaches, just different.

0 Karma

pj
Contributor

On the basis of this question/answer and something I am working on; I just developed an app called Meta Woot! and put it in the app store. It basically leverages the tstats command above and maintains a near real-time KV store based state table of host, sourcetype and index metadata. Not only that, I have it summarizing down, such that it also offers event count trending over time. Enjoy! https://splunkbase.splunk.com/app/2949/

0 Karma

somesoni2
Revered Legend

Here is my observation of metadata command in Splunk 6.2.2

1) The metadata command does respect time range picker (you can see the firstTime and totalCount values will get change between last 15 min to last 30 days, again data has to be available. It may not work on data but the buckets)
2) The metadata command has categories (Type parameter) as hosts/sources/sourcetype and generally for larger number of these categories would required metadata data command to parse more buckets to get the information, hence wouldn't be efficient. I've see it working fine for my 30,000 hosts (retention period 1 year), so I do believe it should work for 1000 sourcetypes provided the number of buckets scanned is reasonable.
3) The metadata command does work like search and it should report the lastTime and recentTime correctly with totalCount and firstTime can be incorrect, as it's not scanning though oldest buckets (my guess is that Splunk finalizes the search after certain memory limit is reached).
4) Restricting the metadata search to one index would definitely help, but again it depends on the content of the index (I had 53 index, with 1 indexes containing 70% of data, so it may not work for that particular index).

Hope this helps.

lguinn2
Legend

Thanks for the thoughts.

1) I did not realize that the timerange made a difference for the metadata command, but it does. I just can't figure out which timestamp that the timerange picker considers, or how it works. I have things showing up in the list where the firstTIme, lastTime and recentTime are all outside the timerange of the search.

2) I also hope it would work, but I am looking for the actual limit...

3&4) I am looking for the specific conditions that must be met for the information to be reliable.

Thanks!

0 Karma

lguinn2
Legend

Oh, to see interesting stuff, this is the search I ran (as admin of course) on a test splunk instance:

| metadata index=_* type=sources | addinfo | fields - info_sid info_search_time
| eval gt_lastTime=if(info_min_time > lastTime,"Yes","-")
| eval gt_recentTime=if(info_min_time > recentTime,"Yes","-")
| eval gt_firstTime=if(info_min_time > firstTime,"Yes","-")
| table info_min_time info_max_time firstTime lastTime recentTime gt* source
| fieldformat firstTime=strftime(firstTime,"%x %X")
| fieldformat lastTime=strftime(lastTime,"%x %X")
| fieldformat recentTime=strftime(recentTime,"%x %X")
| fieldformat info_max_time=strftime(info_max_time,"%x %X")
| fieldformat info_min_time=strftime(info_min_time,"%x %X")

I had to look at the data closely, but there were many sources with timestamps out of the search range...

0 Karma
*NEW* Splunk Love Promo!
Snag a $25 Visa Gift Card for Giving Your Review!

It's another Splunk Love Special! For a limited time, you can review one of our select Splunk products through Gartner Peer Insights and receive a $25 Visa gift card!

Review:





Or Learn More in Our Blog >>