I posted a comment on https://answers.splunk.com/answers/468612/how-to-search-a-lookup-table-and-return-the-matchi.html but didn't get any answer. So I'm opening a new question about this issue.
The provided solution works fine but it uses a lot of resources when the number of rows in csv file as well as index size grows. In my case, I have a structured data file like this:
Field-ID,Field-SourceType,Field-Substring
1,sourcetype1,Some text goes here
2,sourcetype1,Another other text with WILDCARD * here
3,sourcetype2,This is a different text for different sourcetype
...
I run the query mentioned there (returning "Field-Substring" field) against some index data/events to count the number of occurrences of substrings:
index="application_error" [| inputlookup my_lookup_table.csv | rename Field-Substring as search | fields search | format]
| rename _raw as rawText
| eval foo=[| inputlookup my_lookup_table.csv | stats values(Field-Substring) as query | eval query=mvjoin(query,",") | fields query | format "" "" "" "" "" ""]
| eval foo=split(foo,",")
| mvexpand foo
| eval foo=lower(foo)
| eval rawText=lower(rawText)
| where like(rawText,"%"+foo+"%")
| stats count by foo
As there are huge number of unstructured events and quite large number of substrings in the csv file, it takes ages to return the result. Just wondering if there's another method to expedite searching unstructured log files for all the values in my lookup csv file and return the stats/count/etc.
These unstructured indexed data/logs are only categorised based on different sourcetypes and as you can see in the lookup csv file, each line shows the substring and it's corresponding sourcetype which needs to be searched (which is not being used in the above query).
You have created an architecture that will make it very difficult to code an efficent search. In addition, including a wildcard into the csv means you need to strip the wildcard out before line 8 adds the concatenated %
, or your results will be in error. Also, I'm not sure how you got that to work at all with a hyphen in the middle of the field names; I would expect Splunk to have replaced that with an underscore.
Here are some suggestions.
First, limit the initial selection by sourcetype. This should reduce the number of records involved. Something like this should speed the query up a bit...
index="application_error"
[| inputlookup my_lookup_table.csv
| rename Field_SourceType as sourcetype, Field_Substring as killname
| format
| rex field=search mode=sed "s/killname=//g"
]
If your sourcetypes have multiple extensions on them, then you will want to concatenate a "*" on them above, after the rename
and before the format
.
Next, kill all unneeded fields, thereby limiting the number of bytes involved. We're saving the field _time
just in case you want a timechart
or something, and sourcetype
to try something later.
Come back and strip those out if you don't need them; a byte's a byte.
| rename _time as Time
| eval Raw = lower(_raw)
| fields - _*
| fields Time Raw sourcetype
Run that into your current search and see what happens to your run time.
One other potential architecture for this next section is to do a join on sourcetype
. This could be much more efficient as long as the lookup table has less than a few thousand lines and as long as the sourcetype does NOT have multiple potential extensions on it.
Adapt the method as needed. For example, you could create a synthetic match field of a certain length, and then do a full test of sourcetype for each of the remaining potential matches.
| join max=0 sourcetype
[| inputlookup my_lookup_table.csv
| rename Field_SourceType as sourcetype, Field_Substring as teststring
| table sourcetype teststring
| eval teststring=lower(teststring)
]
| rename COMMENT as "change asterisks to % and add % at front and end of teststring, tehn test and stats it up"
| rex mode=sed field=teststring "s/[*]/%/g s/^(.)/%\1/g s/(.)$/\1%/g"
| where like(Raw,teststring)
| stats count by sourcetype, teststring
By the way, sorry we missed your earlier question.
You have created an architecture that will make it very difficult to code an efficent search. In addition, including a wildcard into the csv means you need to strip the wildcard out before line 8 adds the concatenated %
, or your results will be in error. Also, I'm not sure how you got that to work at all with a hyphen in the middle of the field names; I would expect Splunk to have replaced that with an underscore.
Here are some suggestions.
First, limit the initial selection by sourcetype. This should reduce the number of records involved. Something like this should speed the query up a bit...
index="application_error"
[| inputlookup my_lookup_table.csv
| rename Field_SourceType as sourcetype, Field_Substring as killname
| format
| rex field=search mode=sed "s/killname=//g"
]
If your sourcetypes have multiple extensions on them, then you will want to concatenate a "*" on them above, after the rename
and before the format
.
Next, kill all unneeded fields, thereby limiting the number of bytes involved. We're saving the field _time
just in case you want a timechart
or something, and sourcetype
to try something later.
Come back and strip those out if you don't need them; a byte's a byte.
| rename _time as Time
| eval Raw = lower(_raw)
| fields - _*
| fields Time Raw sourcetype
Run that into your current search and see what happens to your run time.
One other potential architecture for this next section is to do a join on sourcetype
. This could be much more efficient as long as the lookup table has less than a few thousand lines and as long as the sourcetype does NOT have multiple potential extensions on it.
Adapt the method as needed. For example, you could create a synthetic match field of a certain length, and then do a full test of sourcetype for each of the remaining potential matches.
| join max=0 sourcetype
[| inputlookup my_lookup_table.csv
| rename Field_SourceType as sourcetype, Field_Substring as teststring
| table sourcetype teststring
| eval teststring=lower(teststring)
]
| rename COMMENT as "change asterisks to % and add % at front and end of teststring, tehn test and stats it up"
| rex mode=sed field=teststring "s/[*]/%/g s/^(.)/%\1/g s/(.)$/\1%/g"
| where like(Raw,teststring)
| stats count by sourcetype, teststring
By the way, sorry we missed your earlier question.
@DalJeanis Thanks for your reply and sorry about the long belated update, your proposal worked pretty well, I tweaked it to match my criteria and the performance was really improved.
About the hyphens in field names, they're underscore, I just put hyphen here by mistake while preparing the question.
Please post the query you are using.
Ah, I had a hyperlink in my question, didn't notice the link was missing. I updated my question anyway.