So I am pulling together a checklist of things to ensure initial and ongoing log data quality. This is obviously a pretty broad and universal topic and one which I'm certain a PhD or two may have been written on. I was hoping to initially pull together a list of things to check for (the checklist) and then to work out how I might apply some automated checking within Splunk. I started some research and plenty of thinking and it occurred to me... there's probably plenty of people in this community who have already done this and will most likely have come up with something much better than I will so, I put it to you (the Splunk>answers community):
So you don't feel I am entirely shirking my own cerebral responsibilities, this is what I have:
What to look for
1 - All expected logs are being indexed
2 - All logs are being indexed as expected (e.g. they are complete, there is no truncation and/or concatenation)
3 - _time appropriately matches log generation time-stamp
4 - I have appropriately applied field extractions (e.g. they are complete, there is no truncation and/or concatenation)
How do I look ensure initial log data quality?
1 - I perform a manual check on the log source to see what has been generated and then run a query within the Splunk Search Head to confirm I can see all the different types of logs
2 - I manually perform ad hoc queries across the log source and all relevant source types to see if there is any obvious truncation or concatenation. Following this, and provided I am confident in the quality of my regex, I use field extractions which I expect to appear on every log (e.g. Active Directory Event ID) and check to see if they appear on 100% of my events.
3 - I manually check each source and sourcetype to ensure the correct time-stamp is being interpreted by Splunk (especially if the log is not being directly ingested from the log source, e.g. if via a syslog repository where the log data may be pre-pended with the syslog time-stamp)
4 - Individually test every field extraction over "All time" to ensure is appearing in 100% of my events (if expected to do so) and then I visually test the values by rendering the values in a table (after applying a dedup of course).
How do I ensure the data quality is maintained over time?
1 - I have scheduled a daily query which looks to identify when a log has not been ingested in over 25 hours
| metadata index=* type=sourcetypes | eval age = now()-lastTime | where age > 90000 | sort age d | convert ctime(lastTime) | fields age,sourcetype,lastTime
2 - No ongoing automated process for this one
3 - No ongoing automated process for this one
4 - No ongoing automated process for this one
In terms of how do I identify data quality issues, I feel my approach is too reliant on me visually detecting an anomaly. I appreciate in some circumstances this is the best I can expect but in others I'm sure there's a better way.
Please, please, I invite you to critique my approach and to pass on what you do as I think this is probably going to provide many more people other than myself some significant value.
I think the sort of thing you might look at doing is an Icon Only rangemap, which you can use to display a Red Amber Green status for all of the above tasks.
if you can provide some example data, and exactly what you're comparing against, I'm sure, if not me, someone in the community will be able to provide you with some searches.
Thanks for your response (and apologies for the sizable delay in mine).
I suppose this is probably more of a theoretical question so I don't really have any specific example data. I am interested to see what people in the community are doing to ensure the quality of their log data and the field extractions when they are initially ingested AND over time. I am particularly interested in any automated checks which may be performed.
The types of things I believe will affect the quality of log data may include:
I hope this clarifies my request.
There really is no rule or set of procedures to follow here as everyone has different needs and there are so many different type of log data out there.
You are on the right track though with
(1) Data stopped ingestion - That search should definitely satisfy checking of data
(2) Data truncation - I can't think of anything specific here but the tests in item (4) might catch these errors.
(3) Correct timestamps recognition - This would be the hardest but you could say compare _indextime with _time to check for large drifts. This will only work for logs that are coming in real time. Historic or delayed logs would have issues.
(4) Field extraction Integrity - You would likely have to craft a scheduled search for each data type and check for specific critical fields missing.
ie sourcetype="fred" NOT criticalField=*
if that search returns results then your field extractions may be broken.
or you could check for a change in the number of fields extracted for a sourcetype (not tried this and doesn't look too friendly)
Hope that helps