my system executes unique tasks that produce 2 logs - a summary and a raw log.
the source naming is something like: job-JOBID123-summary.log and job-JOBID123-raw.log
in summary.log, I have a high level "success" "failure" to indicate whether the task has failed or not.
In raw.log, the pass/fail information is not available, but a specific ERROR is logged.
The problem is that on some of the success, there are some ERRORs logged. For example, I see error="socket" in both pass/fail but need to investigate further if the summary is FAIL and error=socket
How would I go about counting the number of specific errors (from one source) only if my summary reports failure? (another source)
Thanks in advance,
@kelvinmarr it would be easier for the community to assist you better if you can provide some sample data from the two sources with any sensitive information mocked/anonymized. Also if you can provide your existing Splunk search with required fields. While posting Splunk search/Data use the code button
101010 or shortcut Ctrl+K to ensure that special characters do not escape.
Following is a sample query from what I can gather from the question, however, may not work if details provided are incorrect or missing.
<yourMainSearch> source IN ("*job-*-summary.log","*job-*-raw.log") | rex field=source "job-(?<jobID>[^\-]+)\-" | stats count(eval(source="*summary.log" AND searchmatch("failure"))) as summaryFailCount count(eval(source="*raw.log" AND searchmatch("error"))) as rawFailCount by jobID | search summaryFailCount>0 AND rawFailCount>0
first things first, you want to get the jobid in both logs. assuming your source is the filename/path, maybe something like this:
| rex field=source "(?<job_id>[^\-]+)\-[^\-]+\.log"
I don't want to keep typing that out, so i'll assume you do that in field extraction and that the job_id field is generally available. You can of course just put that rex command in the searches/subsearches as needed.
Depending on how many logs you have, one option may be better than the others. And since I don't know any details about your logs, this is really all pseudo spl for you - just giving you ideas on what to try.
I would probably use the last option, maybe the first.
You could try to use a subsearch to only include raw logs from ones with a failed summary. The subsearch returns a list of failed job_id's which is then used in the main search. From there, just count the errors by the job_id, since they are all failed jobs
index=jobs sourcetype=jobs:raw [ search index=jobs, sourcetype=jobs:summary | where result="failed" | table job_id] | dc(error) as distinct_errors by job_id
Not the most efficient command and there are some default limitations, but with a smaller set of events i think it's fine. In this case we use the job_id to join events from the raw data to the result in the summary data.
index=jobs sourcetype=jobs:raw | join type=left job_id [search index=jobs sourcetype=jobs:summary | table job_id, result] | where result="failed" | do whatever with failed job_ids
Most things you do with join, you can do with append, which is more efficient, but still has some default limits. In this case we'll get all the raw data and all the summary and then manipulate it with common job_id field as needed.
index=jobs sourcetype=jobs:raw | append [search index=jobs sourcetype=jobs:summary] | stats values(result) as result, values(error) as error by job_id | where result="failed"
Append isn't necessarily needed here because in my version of your data, it's all in one index, so we can just search for it all initially. Append is useful if your appending data sets that are too difficult to search together, for example if you need a lot of spl to actually get thing things you want to append. For just a bunch of raw data, just search for it all...
index=jobs (sourcetype=jobs:raw OR sourcetype=jobs:summary) | stats values(result) as result, values(error) as error by job_id | where result="failed"