Solved: How can I "reverse-wildcard" lookups (wildcards in...

woodcock · ‎04-01-2016

I have 2 datasets:
1: Windows events to review that have a DoneBy user and a DoneTo user.
2: Work Orders in a DB that should have been the cause for some/most of the Windows events, each row having 3 fields: DoneBy, Notes, and Description.

The Work Order DB is too big to deal with directly (I won't get into the details), Each Work Order DB row consists mainly of a Note that has new content but other than the modified_date usually has no other changes (although occasionally somebody adjusts the Description field which is usually static row-to-row for every Work Order), so there is no good way to tie them directly together in a single SQL query. What I mean is that maybe there is a WO screener who assigns the WO to somebody else so that DB row has the correct DoneTo embedded in the Notes but the wrong DoneBy (because it is the screener, not the admin who will do the work). Then a later row may have the correct DoneTo but he may not reiterate what he did so there is no DoneBy in the Notes (which may just say something like "Done"). Because of this, I cannot write SQL to say, "Give me the row where DoneBy=DoneBy AND ( Notes contains DoneTo OR Description contains DoneTo ) because these may be fractured among multiple distinct rows.

What I had to do is create 3 SQL filters, one for each single-field match, then joining all the Notes rows together by Work Order ID.

So now I have both datasets together and I need to overlay the Work Order dataset onto the Windows Event dataset and that's really where the problem is. Keep in mind that the Windows Event dataset is necessary to generate the WorK Order dataset so the searches cannot be separated. I need to end up with just the Windows Event dataset where each row has a (sometimes multi-valued) Work Order ID field that is the only vestige of the other dataset.

I see no way of moving forward without saving off the Work Order dataset as a lookup file, which I definitely can do.

However, I need the wildcard capability to work opposite to the match_type capability because the Event data will have a short DoneTo field whose value will be embedded deep inside the lookup file's Notes or Description field. Is this possible with lookup?

The only other option that I can think to use is to encode all of Windows Event data into a single KVP field string, feed this to map, have map reconstitute each Windows Event table row (using rex to break apart the amalgam KVP single-field back into the original constituent multiple-fields row), then pulling in just the portions of the lookup file that match the fields of each single row. This is tedious and inefficient but I do not see any other way; do you?

woodcock · ‎04-03-2016

The answer to this question, is:

You should not ever have to, so don't!

My problem with making the Event dataset the lookup file is that in it, as-is, there is nothing to use as the OUTPUT field of the lookup. So because there is nothing unique, it cannot be done. Then I realized (DUH!) that we can easily manufacture like this:

... | streamstats count AS serial

will manufacture a unique key in the event dataset that can be used as the OUTPUT field of the lookup. I am working on this now.

Then I have an Events dataset lookup that looks like this:

DoneBy,DoneTo,serial
*XXX\gwoodcock*,*XXX\luser1*,1
*XXX\gwoodcock*,*YYY\luser2*,2

Now I can do this from the Work Order dataset

... | lookup LookupName DoneBy OUTPUT serial AS DoneBySerial
... | lookup LookupName DoneTo OUTPUT serial AS DoneToSerial

The only tricky part is that I have to enable both match_type = WILDCARD(DoneBySerial), WILDCARD(DoneToSerial) and also case_sensitive_match = false for it all to work.

I can enhance the Work Order dataset with serial numbers, drop those with no lookup matches, pull back in the Events dataset and use stats to merge on the *serial fields (after some normalizing) to get my final mashup.

View solution in original post

woodcock · ‎04-03-2016

The answer to this question, is:

You should not ever have to, so don't!

My problem with making the Event dataset the lookup file is that in it, as-is, there is nothing to use as the OUTPUT field of the lookup. So because there is nothing unique, it cannot be done. Then I realized (DUH!) that we can easily manufacture like this:

... | streamstats count AS serial

will manufacture a unique key in the event dataset that can be used as the OUTPUT field of the lookup. I am working on this now.

Then I have an Events dataset lookup that looks like this:

DoneBy,DoneTo,serial
*XXX\gwoodcock*,*XXX\luser1*,1
*XXX\gwoodcock*,*YYY\luser2*,2

Now I can do this from the Work Order dataset

... | lookup LookupName DoneBy OUTPUT serial AS DoneBySerial
... | lookup LookupName DoneTo OUTPUT serial AS DoneToSerial

The only tricky part is that I have to enable both match_type = WILDCARD(DoneBySerial), WILDCARD(DoneToSerial) and also case_sensitive_match = false for it all to work.

I can enhance the Work Order dataset with serial numbers, drop those with no lookup matches, pull back in the Events dataset and use stats to merge on the *serial fields (after some normalizing) to get my final mashup.

koshyk · ‎03-06-2017

thanks woodcock. this is very complicated and you answered with breeze

martin_mueller · ‎04-02-2016

So... in what way would a wildcarded lookup for *value* on the Notes field not work?

Sample data would really help illustrate this.

woodcock · ‎04-03-2016

Actually after more thought, it can be made to work. To answer your question (pre-theough):

Because there is no way to map a match in the one dataset back to a particular event. In other words, what will the OUTPUT field of the lookup field be that will be unique to the original dataset? Because there is nothing unique, it cannot be done. That is UNLESS we manufacture one! I realized that I can use |streamstats count AS serial will manufacture a unique key in the event dataset that can be used as the OUTPUT field of the lookup. I am working on this now.

I also had a breakthrough on the other way (using map) and will probably have both solutions done at about the same time and can do side-by-side comparison. I know the lookup solution will beat the map solution but I am curious how severely.

maciep · ‎04-01-2016

How big are the notes fields? Would it be possible/accurate and not too expensive to break each word in that field into an event, eg split the notes field by space and mvexpand to put them in each in their own event? And then join the windows event data to that subsearch on the doneby doneto fields?

woodcock · ‎04-02-2016

No, Notes and Description are up to 4000 characters each.

maciep · ‎04-03-2016

The other thought i had was whether it would help at all to make the lookup from the windows event data, since you could wildcard doneby and doneto, and then that could be used to match against the notes field. Not sure if that would be at all helpful/relevant thought

martin_mueller · ‎04-01-2016

I'm distilling "I have a field Notes that may contain a field DoneTo inside its value, and want to add a lookup to that DoneTo field"?

If so, extract the DoneTo field from the Notes field and do the lookup in the usual manner.
If not, do elaborate - ideally with sample data.

woodcock · ‎04-01-2016

s/may contain a field/may contain the value of field/

How can I "reverse-wildcard" lookups (wildcards inside Event data, not inside Lookup files)?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

How can I "reverse-wildcard" lookups (wildcards inside Event data, not inside Lookup files)?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?