Getting Data In

What do you look for, and how, to ensure initial log data quality (including field extractions) and ensure it is ongoing?

pjb2160
Path Finder

Hello,

So I am pulling together a checklist of things to ensure initial and ongoing log data quality. This is obviously a pretty broad and universal topic and one which I'm certain a PhD or two may have been written on. I was hoping to initially pull together a list of things to check for (the checklist) and then to work out how I might apply some automated checking within Splunk. I started some research and plenty of thinking and it occurred to me... there's probably plenty of people in this community who have already done this and will most likely have come up with something much better than I will so, I put it to you (the Splunk>answers community):

  • What do you look for, and how, to ensure initial log data quality?; and
  • How do you ensure it is ongoing?

So you don't feel I am entirely shirking my own cerebral responsibilities, this is what I have:

What to look for
1 - All expected logs are being indexed
2 - All logs are being indexed as expected (e.g. they are complete, there is no truncation and/or concatenation)
3 - _time appropriately matches log generation time-stamp
4 - I have appropriately applied field extractions (e.g. they are complete, there is no truncation and/or concatenation)

How do I look ensure initial log data quality?
1 - I perform a manual check on the log source to see what has been generated and then run a query within the Splunk Search Head to confirm I can see all the different types of logs
2 - I manually perform ad hoc queries across the log source and all relevant source types to see if there is any obvious truncation or concatenation. Following this, and provided I am confident in the quality of my regex, I use field extractions which I expect to appear on every log (e.g. Active Directory Event ID) and check to see if they appear on 100% of my events.
3 - I manually check each source and sourcetype to ensure the correct time-stamp is being interpreted by Splunk (especially if the log is not being directly ingested from the log source, e.g. if via a syslog repository where the log data may be pre-pended with the syslog time-stamp)
4 - Individually test every field extraction over "All time" to ensure is appearing in 100% of my events (if expected to do so) and then I visually test the values by rendering the values in a table (after applying a dedup of course).

How do I ensure the data quality is maintained over time?
1 - I have scheduled a daily query which looks to identify when a log has not been ingested in over 25 hours

| metadata index=* type=sourcetypes | eval age = now()-lastTime | where age > 90000
| sort age d | convert ctime(lastTime) | fields age,sourcetype,lastTime

2 - No ongoing automated process for this one
3 - No ongoing automated process for this one
4 - No ongoing automated process for this one

In terms of how do I identify data quality issues, I feel my approach is too reliant on me visually detecting an anomaly. I appreciate in some circumstances this is the best I can expect but in others I'm sure there's a better way.

Please, please, I invite you to critique my approach and to pass on what you do as I think this is probably going to provide many more people other than myself some significant value.

Many thanks,
P

phoenixdigital
Builder

Hi P,

There really is no rule or set of procedures to follow here as everyone has different needs and there are so many different type of log data out there.

You are on the right track though with

(1) Data stopped ingestion - That search should definitely satisfy checking of data

(2) Data truncation - I can't think of anything specific here but the tests in item (4) might catch these errors.

(3) Correct timestamps recognition - This would be the hardest but you could say compare _indextime with _time to check for large drifts. This will only work for logs that are coming in real time. Historic or delayed logs would have issues.

(4) Field extraction Integrity - You would likely have to craft a scheduled search for each data type and check for specific critical fields missing.
ie sourcetype="fred" NOT criticalField=*

if that search returns results then your field extractions may be broken.

or you could check for a change in the number of fields extracted for a sourcetype (not tried this and doesn't look too friendly)
http://answers.splunk.com/answers/91632/getting-a-count-of-the-number-of-fields-associated-with-a-so...

Hope that helps

markthompson
Builder

PJB,
I think the sort of thing you might look at doing is an Icon Only rangemap, which you can use to display a Red Amber Green status for all of the above tasks.

if you can provide some example data, and exactly what you're comparing against, I'm sure, if not me, someone in the community will be able to provide you with some searches.

0 Karma

pjb2160
Path Finder

Hello Mark,

Thanks for your response (and apologies for the sizable delay in mine).

I suppose this is probably more of a theoretical question so I don't really have any specific example data. I am interested to see what people in the community are doing to ensure the quality of their log data and the field extractions when they are initially ingested AND over time. I am particularly interested in any automated checks which may be performed.

The types of things I believe will affect the quality of log data may include:

  • changes to log outputs (e.g. revising the format of a time stamp or the introduction of new data may break the regex)
  • poorly developed and/or inadequately tested regex
  • unexpected failure to ingest log data (e.g. infrastructure renames the host; or the password is changed on a database connection)

I hope this clarifies my request.

Many thanks,
P

0 Karma
Get Updates on the Splunk Community!

Splunk Enterprise Security 8.0.2 Availability: On cloud and On-premise!

A few months ago, we released Splunk Enterprise Security 8.0 for our cloud customers. Today, we are excited to ...

Logs to Metrics

Logs and Metrics Logs are generally unstructured text or structured events emitted by applications and written ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...