We need to run the same query over a list of values (10k to 100k) without knowing the exact key across various indexes where they might show up.
What's the best way to do this in a scalable way. For example, same search over a list of users
(user1 OR user2 OR user3) | stats count by index
The desired output should be a table with both user and index as columns, not just the index. But again, the field is not known ahead of time as it varies by index (or could simply be in _raw)
Should one use map or how to assign each of user_n to a variable?
Also, I don't have the ability to upload a CSV. The query must be composed programmatically via the python SDK.
This isn't scalable, but let's use it as an example to explore the topic. You can combine walklex with a subsequent subsearch to return a list of indexes containing an indexed term, with or without indexed field names. This will not find values derived at search time, e.g. a lookup that converts an user identifier to a user name.
| walklex index=* index=_*
| search
[| inputlookup alancalvitti_users.csv
| eval term=mvappend(user, "*::".user)
| fields term
| mvexpand term]
| table index term
The output may look something like this:
index | term |
main | user::user1 |
main | user2 |
I.e.:
index=main (TERM(user::user1) OR TERM(user2))
will return two or more events (at least one for each term).
Couple of questions
First, we are querying splunk using python SDK. How can we pass the input data that in the reply is in the CSV file?
Second, while the question as posed was meant to be a minimal example, in reality we need to combine the match with arbitrary queries, eg regex or stats on the _raw events that contain any matching users in addition to the user itself. - does your answer allow this flexibility?
Last, you mention this is not a scalable solution? But are there scalable solutions (eg using map or for_each)? The enterprise logging we're querying is very high volume.
It's important to remember that Splunk is schema-less. Fields can exist in an index, but more often than not, fields are created dynamically at search time.
To use a lookup file, you must upload it to the search head. See Define a CSV lookup in Splunk Web. Splunk provides a REST API to manage lookups, but that's best discussed in a separate question. In any case, I only used a lookup as an example.
If I'm restating your question correctly, it's "How can I find all users in all events in any field?" Nearly all Splunk solutions use a lookup of some form to define known users and depend on normalized events with a field named user to identify arbitrary or unknown users.
Your solution should include a process for normalizing your data, typically performed by someone in your environment serving as a Splunk knowledge manager.
Combining the two in an arbitrary way:
index=* (user=* OR ([| inputlookup alancalvitti_users.csv | return 10000 $user ]))
The 10000 in the return command limits the subsearch result to the first 10,000 users in the lookup file. Unlike most commands with a limit, you can't pass 0 for unlimited. The return command is similar to the head command in this regard.
After the subsearch is executed, the parsed search becomes e.g.:
index=* (user=* OR ((user1) OR (user2) OR (user3)))
If an event in the result does not contain a user field, you could attempt to create one with a very expensive rex command:
index=* (user=* OR ([| inputlookup alancalvitti_users.csv | return 10000 $user ]))
| rex max_match=0 [| inputlookup alancalvitti_users.csv | mvcombine user | eval regex="\"(?<user>(?:".mvjoin(user, "|")."))\"" | return $regex]
These are just examples and not complete solutions, but they'll hopefully feed a bit of creativity.
The map and foreach commands have specific purposes.
The map command executes a subsearch for each result in the main search. It's similar to the appendcols command but executes many subsearches rather than one subsearch.
The foreach command executes a subsearch for each matching field in each result of the main search. It's similar to the appendpipe command but adds columns rather than rows.
"Nearly all Splunk solutions use a lookup of some form to define known users and depend on normalized events with a field named user to identify arbitrary or unknown users" - no this is not feasible. The Splunk instances we're querying are fed by hundreds of apps, some of which are 3rd party and cannot be forced to normalize their data.
The "username" might also appear in _raw as text not part of a key-value pair.
Can you please re-assess given this detail?