Solved: Re: Data flows between index search and lookups

amitlookin · ‎03-24-2020

I have a whole lot of servers data indexed for our project (index=* sourcetype=* source=) that needs to be searched based on my look up file on basis of host. I have a join query mentioned below:
| inputlookup bsl_project_host.csv | table host
| join type=left host[search index= sourcetype=* source=* [ | inputlookup bsl_project_host.csv | table host]

It would be really nice, if some one could help me understand how exactly does the data flows in this query, specially the search statement in the query.

to4kawa · ‎03-25-2020

| inputlookup bsl_project_host.csv 
| table host
| join type=left host [search index= sourcetype= 
source= [ | inputlookup bsl_project_host.csv | table host]]

OK, 2nd subsearch result is host=A OR host=B OR ....
and search index= sourcetype= source= (host=A OR host=B OR ....) have host
at last, If bsl_project_host.csv's host and
subsearch's host are same, these are results.

I think,

index= sourcetype= source= [ | inputlookup bsl_project_host.csv | table host |format] |table host

This could be OK.

View solution in original post

to4kawa · ‎03-25-2020

| inputlookup bsl_project_host.csv 
| table host
| join type=left host [search index= sourcetype= 
source= [ | inputlookup bsl_project_host.csv | table host]]

OK, 2nd subsearch result is host=A OR host=B OR ....
and search index= sourcetype= source= (host=A OR host=B OR ....) have host
at last, If bsl_project_host.csv's host and
subsearch's host are same, these are results.

I think,

index= sourcetype= source= [ | inputlookup bsl_project_host.csv | table host |format] |table host

This could be OK.

amitlookin · ‎03-25-2020

Thank you for the reply, i think this cleared my doubt about how the sub search flow is happening i.e. sub search result is acting like a criteria for the main search and then the results are returned.

DalJeanis · ‎06-16-2020

To be clear -

1) Searches work from the farthest inside braces, and work out, with the exception of the braces that follow an appendpipe (we'll come back to that.)

2) The subsearch in braces will return some set of information, and what set of information is defined by the verb that preceded it, if any.

3) If the braces are before the first pipe in the outside search, then the braces will return data as if the inside search had been run, then piped to the format command. thus, if the end of the search in braces looked like this...

[ search index=foo  ... more search terms ... | table host ]

then it returns as if the search had been run like this

index=foo ... more search  terms ... | table host | format

and the output field called search contains something like this

( host="value1" ) OR ( host="value2" ) OR .... ( host="value99" ) )

Before the first pipe, the value of field search is what will be dropped into the outer search.

In your case, the inner search is just taking the data out of the lookup table to create the above stuff.

The next step in your code, it runs that against the index as a subsearch. This is the place where @woodcock 's alarms went off. There are limits on how much data a subsearch will return, so you may get results that are wonky, depending on how long you run the search over.

But let's assume it works. Your data comes back from that subsearch, and then it gets run through on the right side of a join.

Now, joins in Splunk are always, at their most basic level, left joins. Every record on the left side gets matched to the first matching record on the right side. If you have NOT told it to match every record on the right, then only the first gets matched. Then, if you've told it you only want an inner join, it will throw away unmatched records from the left side.

(These are some of the reasons we suggest people avoid join when they can.)

Finally, what you are matching to on the far left is the same lookup table records you had at the far inside.

On the other hand, if you use woodcock's version, then it works this way

All the records from any host are read.

For each record, the lookup table is accessed in memory for a match. If it is found, then the host value from the lookup table is copied to a new field. Next, if the new field is not found, the record is dropped as not being wanted.

Now, if it was possible that there would be no records for any particular host, and you wanted to make sure they were there, then you could add an append at the end like this

| inputlookup bsl_project_host.csv append=true

And, one final thing. after appendpipe, instead of the braces being processed first, they are processed when the data reaches that point in the search. the entire set of records up until that point is put through the logic inside the braces, and then the output of that (subject to subsearch limitations) is added onto the end of the current set of records.

woodcock · ‎03-24-2020

Whenever possible, use lookup files with the lookup command like this:

index="*"
| lookup bsl_project_host.csv host OUTPUT host AS matched
| where isnotnull(matched)

amitlookin · ‎03-25-2020

Thank you for the response. Can you please explain how the search is working here in the join part:
search [index= sourcetype= source= [ | inputlookup bsl_project_host.csv | table host], i mean the data flow.

Thank in advance.

woodcock · ‎03-25-2020

It is a very inefficient and non-scalable way to do it. That is why I answered the way that I did. It should NOT be done that way.

amitlookin · ‎03-25-2020

As a beginner i appreciate these inputs and i am going to keep them in my mind going forward. Actually my intent was to understand the data flow, guess what i now know the data flow and the best practice as well. Thank you for the replies.

Data flows between index search and lookups

lookup

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Data flows between index search and lookups

lookup

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits