Splunk Search

reduce events by matching with a csv, but more than 10k matches

loganramirez
Path Finder

Ok, been learning alot about reducing event size from a recent conversation (here) and got linked a great article on search performance (this one) and an obvious key is reducing the events that come back (the first line is the most important).

For a lot of the reports I'll need to write, the way to do this would be the match DIRECTORY INFORMATION but that DOES NOT EXIST IN THE UNDERLYING DATA and this gets complicated with what I wrote in that other post about (2) streams of data.

Here is what I mean (specifics).

1. DS 1 (call data, JSON)
2. DS 2 (policy data, JSON)
3. directory.csv (inputlookup file with data, or I could query a DB using dbxquery)


So if I want to match 'mylist' in that csv then I have to do it AFTER the first line, like this:

 

index="my_data" resourceId="enum*" ("disposition.disposition"="TERMINATED" OR "connections{}.left.facets{}.number"=*)
| stats values(sourcenumber) as sourcenumber values(disposition) as disposition by guid
| lookup directory_listings.csv number AS sourcenumber OUTPUT lists 
| search lists="mylist"

 

This brings back the (2) Datasources (the first line), but then I have to read through 100% of it, then match the directory, then filter so this is huge 'false positive' (event to scan count ratio)

I've read before about using subsearch and this works great, but then leaves out one of the data sources.

In other words this:

index="policyguru_data" resourceId="enum*" ("disposition.disposition"="TERMINATED" OR sourcenumber=*)
[  | inputlookup pg_directory_listings.csv
| search lists="*mylist*"
| fields number
| rename number as sourcenumber
| format
]
| table *

runs fast and is 1:1 event-to-scan, BUT OMITS disposition entirely, because it's not 'joining' data, but sending the sourcenumber up to the first line, which then EXCLUDES disposition because it doesn't match.

Does that make sense?

I suppose I could use this entire search AS a subsearch to get back 'guid' values and then pass that UP into another search but feels very...INCEPTION at that point! 

Anyway, looking for ideas.


Thank you!

Labels (2)
0 Karma
1 Solution

dtburrows3
Builder

Does this query return the events with disposition in addition to events with the specific sourcenumber?

index="policyguru_data" resourceId="enum*" ("disposition.disposition"="TERMINATED" OR [  | inputlookup pg_directory_listings.csv | search lists="*mylist*" | fields number | rename number as sourcenumber | format])

 
I think the last search you shared was actually skipping over the disposition events because of how the subsearch was formatted into the parent search. An expanded version of your last search I believe would look like this.

index="policyguru_data" resourceId="enum*" 
    AND 
    ("disposition.disposition"="TERMINATED" OR sourcenumber=*) 
    AND
    ( ( sourcenumber="<val_1>" ) OR ( sourcenumber="<val_2>" ) OR ... OR ( sourcenumber="<val_n>" ) )

where val_1, val_2, ..., val_n are the sourcenumbers from the lookup that you are trying to filter on.


View solution in original post

dtburrows3
Builder

Does this query return the events with disposition in addition to events with the specific sourcenumber?

index="policyguru_data" resourceId="enum*" ("disposition.disposition"="TERMINATED" OR [  | inputlookup pg_directory_listings.csv | search lists="*mylist*" | fields number | rename number as sourcenumber | format])

 
I think the last search you shared was actually skipping over the disposition events because of how the subsearch was formatted into the parent search. An expanded version of your last search I believe would look like this.

index="policyguru_data" resourceId="enum*" 
    AND 
    ("disposition.disposition"="TERMINATED" OR sourcenumber=*) 
    AND
    ( ( sourcenumber="<val_1>" ) OR ( sourcenumber="<val_2>" ) OR ... OR ( sourcenumber="<val_n>" ) )

where val_1, val_2, ..., val_n are the sourcenumbers from the lookup that you are trying to filter on.


PickleRick
SplunkTrust
SplunkTrust

Just remember about the caveats regarding using subsearches (the limit for the execution time and for the number of returned results)

loganramirez
Path Finder

I've run into that 10k limit before, for sure, but this is also something that I thought | format helped with?

"mylist" directory, for example, might have 50k entries, but it's returned as a single line (1 row).

thank you!

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

No. The format command is only responsible for formatting the data on output (and if you don't include it explicitly, it's performed implicitly with default settings). The limit is the limit.

loganramirez
Path Finder

and for clarity is that limit 10k and 60s per this tech link?

https://docs.splunk.com/Documentation/Splunk/9.1.2/SearchTutorial/Useasubsearch

(i think i remember chatting with you before and the actual limit is 50k?)

i just tested a search and got back 20,878 rows so I think it's more than 10k (Splunk v9.06)

 

Tags (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

If you dig through the limits.conf file spec - https://docs.splunk.com/Documentation/Splunk/latest/Admin/Limitsconf you'll see there are several separate limits. Some aspects of subsearching can hit 10k results limit, others have default limit of 50k. If I remember correctly, the join command has a limit of 50k but the "direct subsearch" can only return 10k results.

0 Karma

loganramirez
Path Finder

Hi (again!),

YES!  Your searched worked and I get it!  The long form way you wrote it with the "AND" condition is exactly why it's excluding and what I meant.

I suppose I didn't think to put the 'bracket subsearch' INSIDE the parenthetic OR statement and this dramatically reduces the hits to (a) terminated + (b) the | format results.


Thank you!

 

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...