Getting Data In

CSV lookup, search unstructured index, find matches and return the statistics

msichani
Explorer

I posted a comment on https://answers.splunk.com/answers/468612/how-to-search-a-lookup-table-and-return-the-matchi.html but didn't get any answer. So I'm opening a new question about this issue.

The provided solution works fine but it uses a lot of resources when the number of rows in csv file as well as index size grows. In my case, I have a structured data file like this:

Field-ID,Field-SourceType,Field-Substring
1,sourcetype1,Some text goes here
2,sourcetype1,Another other text with WILDCARD * here
3,sourcetype2,This is a different text for different sourcetype
...

I run the query mentioned there (returning "Field-Substring" field) against some index data/events to count the number of occurrences of substrings:

index="application_error" [| inputlookup my_lookup_table.csv | rename Field-Substring as search | fields search | format] 
 | rename _raw as rawText
 | eval foo=[| inputlookup my_lookup_table.csv | stats values(Field-Substring) as query | eval query=mvjoin(query,",") | fields query | format "" "" "" "" "" ""] 
 | eval foo=split(foo,",") 
 | mvexpand foo 
 | eval foo=lower(foo)     
 | eval rawText=lower(rawText) 
 | where like(rawText,"%"+foo+"%")
 | stats count by foo 

As there are huge number of unstructured events and quite large number of substrings in the csv file, it takes ages to return the result. Just wondering if there's another method to expedite searching unstructured log files for all the values in my lookup csv file and return the stats/count/etc.
These unstructured indexed data/logs are only categorised based on different sourcetypes and as you can see in the lookup csv file, each line shows the substring and it's corresponding sourcetype which needs to be searched (which is not being used in the above query).

0 Karma
1 Solution

DalJeanis
Legend

You have created an architecture that will make it very difficult to code an efficent search. In addition, including a wildcard into the csv means you need to strip the wildcard out before line 8 adds the concatenated %, or your results will be in error. Also, I'm not sure how you got that to work at all with a hyphen in the middle of the field names; I would expect Splunk to have replaced that with an underscore.

Here are some suggestions.

First, limit the initial selection by sourcetype. This should reduce the number of records involved. Something like this should speed the query up a bit...

 index="application_error" 
    [| inputlookup my_lookup_table.csv 
     | rename Field_SourceType as sourcetype, Field_Substring as killname 
     | format 
     | rex field=search mode=sed "s/killname=//g"
     ] 

If your sourcetypes have multiple extensions on them, then you will want to concatenate a "*" on them above, after the rename and before the format.


Next, kill all unneeded fields, thereby limiting the number of bytes involved. We're saving the field _time just in case you want a timechart or something, and sourcetype to try something later.

Come back and strip those out if you don't need them; a byte's a byte.

 | rename _time as Time
 | eval Raw = lower(_raw)
 | fields - _*
 | fields Time Raw sourcetype

Run that into your current search and see what happens to your run time.


One other potential architecture for this next section is to do a join on sourcetype. This could be much more efficient as long as the lookup table has less than a few thousand lines and as long as the sourcetype does NOT have multiple potential extensions on it.

Adapt the method as needed. For example, you could create a synthetic match field of a certain length, and then do a full test of sourcetype for each of the remaining potential matches.

 | join max=0 sourcetype 
       [| inputlookup my_lookup_table.csv 
        | rename Field_SourceType as sourcetype, Field_Substring as teststring
        | table sourcetype teststring
        | eval teststring=lower(teststring)
       ]

 | rename COMMENT as "change asterisks to % and add % at front and end of teststring, tehn test and stats it up"
 | rex mode=sed field=teststring "s/[*]/%/g s/^(.)/%\1/g s/(.)$/\1%/g"
 | where like(Raw,teststring) 
 | stats count by sourcetype, teststring

By the way, sorry we missed your earlier question.

View solution in original post

0 Karma

DalJeanis
Legend

You have created an architecture that will make it very difficult to code an efficent search. In addition, including a wildcard into the csv means you need to strip the wildcard out before line 8 adds the concatenated %, or your results will be in error. Also, I'm not sure how you got that to work at all with a hyphen in the middle of the field names; I would expect Splunk to have replaced that with an underscore.

Here are some suggestions.

First, limit the initial selection by sourcetype. This should reduce the number of records involved. Something like this should speed the query up a bit...

 index="application_error" 
    [| inputlookup my_lookup_table.csv 
     | rename Field_SourceType as sourcetype, Field_Substring as killname 
     | format 
     | rex field=search mode=sed "s/killname=//g"
     ] 

If your sourcetypes have multiple extensions on them, then you will want to concatenate a "*" on them above, after the rename and before the format.


Next, kill all unneeded fields, thereby limiting the number of bytes involved. We're saving the field _time just in case you want a timechart or something, and sourcetype to try something later.

Come back and strip those out if you don't need them; a byte's a byte.

 | rename _time as Time
 | eval Raw = lower(_raw)
 | fields - _*
 | fields Time Raw sourcetype

Run that into your current search and see what happens to your run time.


One other potential architecture for this next section is to do a join on sourcetype. This could be much more efficient as long as the lookup table has less than a few thousand lines and as long as the sourcetype does NOT have multiple potential extensions on it.

Adapt the method as needed. For example, you could create a synthetic match field of a certain length, and then do a full test of sourcetype for each of the remaining potential matches.

 | join max=0 sourcetype 
       [| inputlookup my_lookup_table.csv 
        | rename Field_SourceType as sourcetype, Field_Substring as teststring
        | table sourcetype teststring
        | eval teststring=lower(teststring)
       ]

 | rename COMMENT as "change asterisks to % and add % at front and end of teststring, tehn test and stats it up"
 | rex mode=sed field=teststring "s/[*]/%/g s/^(.)/%\1/g s/(.)$/\1%/g"
 | where like(Raw,teststring) 
 | stats count by sourcetype, teststring

By the way, sorry we missed your earlier question.

0 Karma

msichani
Explorer

@DalJeanis Thanks for your reply and sorry about the long belated update, your proposal worked pretty well, I tweaked it to match my criteria and the performance was really improved.
About the hyphens in field names, they're underscore, I just put hyphen here by mistake while preparing the question.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Please post the query you are using.

---
If this reply helps you, Karma would be appreciated.
0 Karma

msichani
Explorer

Ah, I had a hyperlink in my question, didn't notice the link was missing. I updated my question anyway.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...