My companies Splunk data set is getting large. (Although I know some people would consider the numbers I'm talking about pretty trivial :))
For example here is our data summary:
Here is a quick search of Host/Source pairs:
Our infrastructure is coping fine, more or less. What my problem is though is that it is hard to give a sensible answer to "What data do we have?". If someone asks for a specific datum, its trivial for me to provide metadata. For example if you want to know about web traffic logs I can tell you how much data we have , for which hosts, and even show them on a map. It's easy to know if something is in Splunk if you know what you are looking for. But how can you generally say what is in Splunk? If a user comes to you and asks what have you got, how do you go about answering that in a meaningful way and not give them a list with 500,000 rows? I suppose this is a general library management question, but I'm hoping some of you out there have solved the problem or at least tried some things that didn't work.
If you have set meaningful sourcetypes (not splunk default), then below search result should be a start
|tstats count where index=* by index host source sourcetype
Thanks for that, but as I said the specific metadata is not something we have an issue with. What we're trying to create are more generalized catalog browsing tools. For example the command you suggestgenerates almost half a million rows, and there's only so much information you can have in a sourcetype name.
Cheers,
JP
As mentioned above, if we have meaningful sourcetypes set, then from source and sourcetype we should be able to know atleast "from where" and "what" data is being pushed to splunk. This is what we normally do. I am not sure if you can get more generalized catalog from splunk especially when you have millions & millions of events. Looking forward to get a better idea 🙂
~3 trillion events and counting 🙂
Our sourcetype naming is fairly good, but its fairly domain specific to the individual services being monitored, many of which are in house and so bewildering to new staff. This is a big company problem rather than a specific Splunk issue - I'm just wondering how other people have solved/failed to solve the problem