Hello,
So I am pulling together a checklist of things to ensure initial and ongoing log data quality. This is obviously a pretty broad and universal topic and one which I'm certain a PhD or two may have been written on. I was hoping to initially pull together a list of things to check for (the checklist) and then to work out how I might apply some automated checking within Splunk. I started some research and plenty of thinking and it occurred to me... there's probably plenty of people in this community who have already done this and will most likely have come up with something much better than I will so, I put it to you (the Splunk>answers community):
What do you look for, and how, to ensure initial log data quality?; and
How do you ensure it is ongoing?
So you don't feel I am entirely shirking my own cerebral responsibilities, this is what I have:
What to look for
1 - All expected logs are being indexed
2 - All logs are being indexed as expected (e.g. they are complete, there is no truncation and/or concatenation)
3 - _time appropriately matches log generation time-stamp
4 - I have appropriately applied field extractions (e.g. they are complete, there is no truncation and/or concatenation)
How do I look ensure initial log data quality?
1 - I perform a manual check on the log source to see what has been generated and then run a query within the Splunk Search Head to confirm I can see all the different types of logs
2 - I manually perform ad hoc queries across the log source and all relevant source types to see if there is any obvious truncation or concatenation. Following this, and provided I am confident in the quality of my regex, I use field extractions which I expect to appear on every log (e.g. Active Directory Event ID) and check to see if they appear on 100% of my events.
3 - I manually check each source and sourcetype to ensure the correct time-stamp is being interpreted by Splunk (especially if the log is not being directly ingested from the log source, e.g. if via a syslog repository where the log data may be pre-pended with the syslog time-stamp)
4 - Individually test every field extraction over "All time" to ensure is appearing in 100% of my events (if expected to do so) and then I visually test the values by rendering the values in a table (after applying a dedup of course).
How do I ensure the data quality is maintained over time?
1 - I have scheduled a daily query which looks to identify when a log has not been ingested in over 25 hours
| metadata index=* type=sourcetypes | eval age = now()-lastTime | where age > 90000
| sort age d | convert ctime(lastTime) | fields age,sourcetype,lastTime
2 - No ongoing automated process for this one
3 - No ongoing automated process for this one
4 - No ongoing automated process for this one
In terms of how do I identify data quality issues, I feel my approach is too reliant on me visually detecting an anomaly. I appreciate in some circumstances this is the best I can expect but in others I'm sure there's a better way.
Please, please, I invite you to critique my approach and to pass on what you do as I think this is probably going to provide many more people other than myself some significant value.
Many thanks,
P
... View more