Solved: What is the best way to join/combine/correlate fie...

packet_hunter · ‎03-16-2016

Scenario:
I am searching email event logs. I can find some of the needed fields by a unique id (UID) and I find some fields by diffferent unique id (X-UID). Some events contain both UID and X-UID but not all the fields I need.

Here is a sample of the code:

[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
    |stats list(subject) as subj list(sender) as sender list(recipient) as recp list(vendor_action) as status by UID 
[search index=mail sourcetype=xemail sender = "sender@domain.com" |stats count by XUID| fields XUID] 
    |stats list(dest) as dest_ip list(sender) by XUID

Ultimately I would like results to show
subj sender recp status dest_ip

Thank you

martin_mueller · ‎03-22-2016

Re your long comment on the question: That's exactly what transaction does, even spanning multiple chained ID fields.

Here's a working example:

| stats count as raw | eval raw = "subject=foo uid=123
subject=foo2 uid=321
sender=bar2 uid=321 xuid=cba
sender=bar uid=123 xuid=abc
recipient=baz xuid=abc
recipient=baz2 xuid=cba" | makemv delim="
" raw | mvexpand raw | rename raw as _raw | extract
| eval _time = time()-(random()%1000) | sort - _time
| transaction uid xuid | table _time duration eventcount subject sender recipient uid xuid

Make sure you keep the line breaks as they are here, that's important for this dirty kind of dummy data generation from within the search bar.

First I set up six events, three events per email, each event containing only one "email-y" field.
The events for subject and sender are tied together with uid, the events for sender and recipient are tied together with xuid, and the event for sender ties together the uid and the xuid giving you a nice transitive transaction.

If you want to search for subject, sender, etc before building the transaction you can either do that manually:

The good case: You have an event with the field to filter by (say, sender) and both ID fields.

  index=mail sourcetype=xemail
[ search index=mail sourcetype=xemail sender=foo | fields UID XUID | dedup UID XUID | format "(" "(" "OR" ")" "OR" ")" ]
| transaction UID XUID

That will search for events matching your sender and use the UID and XUID field to search all potential matches beyond the "sender-event", then build the transaction from there.

The bad case, #1: You have an event with the field to filter by, but only the UID field.

  index=mail sourcetype=xemail
[ search index=mail sourcetype=xemail 
  [ search index=mail sourcetype=xemail sender=foo | fields UID | dedup UID ]
  | fields UID XUID | dedup UID XUID | format "(" "(" "OR" ")" "OR" ")" ]
| transaction UID XUID

The innermost subsearch will go in, search for your sender, and come back with a list of UIDs. Those are inserted into the outer subsearch, that will go and retrieve all events with that UID - some of those will have the missing XUID! From there it proceeds like the good case above, using UID or XUID to collect together all relevant events and run the transaction.

The bad case, #2: You have an event with the field to filter by, but only the XUID field.
This works like the bad case #1, but you need to add two Xs to the innermost subsearch... so when filtering, you need to know which of the two bad cases to run 😞

Even though the two bad cases mean you have to go through your index thrice, each run should be a fairly rare search. This will be slower if you search for a sender that sent 90% of all emails, mind.

Instead of building those subsearch-monsters manually, there is a much-forgotten search command searchtxn to do just that for you (I think, don't have data handy to actually test).
To use that, you first have to set up a transaction type in transactiontypes.conf like this:

[xemail]
fields = UID, XUID
search = index=mail sourcetype=xemail

To confirm that this type works, run a regular non-filtering search with | transaction name=xemail and see that it returns the same things as manually specifying the fields. Once that's done, run this with nothing else in the search bar:

| searchtxn xemail sender=foo

That should collect together all the required IDs and neatly return only matching transactions without scanning everything.

http://docs.splunk.com/Documentation/Splunk/6.3.3/SearchReference/searchtxn

One caveat about searchtxn, it's not going to honour your time range picker.

View solution in original post

maciep · ‎03-23-2016

I may regret this, but thought I'd take a stab at it too. But first things first, I saw that martin asked for sample data almost immediately, but I didn't see any except what he later came up with. Can you share some of these logs with us? Do you have time obfuscate/mask private info? Or just come up with fake data in that log format for us? When it comes to these more complex searches, you will find (are finding?) there's a lot of "try this"..."nope that didn't work" back-and-forth, because we can't really test these searches to see if they give you what you think you need. Sample data would be helpful is all.

After reading everything (I think?), is it safe to assume the following:

You want results from one session
You want to supply the search with a field relevant to that session (If so, will that always be unique? For example, two separate sessions might have a subject of "test session")
A session will have exactly one single, unique uid
A session will also have exactly one singe, unique xuid
A session will have at least one event that contains both the uid and xuid for that session.

If that's the case, i'd be leaning toward a couple joins. Something like

index=email subject="important subject" | join uid  [search index=email xuid=*] | join xuid sender dest [index=email xuid=* NOT sender=*] | format all everything nice here

Not sure if that would work for you, just another idea. And of course, if I got those assumptions wrong, then you can disregard this answer, because then I still don't understand the rules/scenario of the problem.

packet_hunter · ‎03-24-2016

Thank you, I have considered join but could not get it to work properly. I will give a try too and let you know.

martin_mueller · ‎03-22-2016

Re your long comment on the question: That's exactly what transaction does, even spanning multiple chained ID fields.

Here's a working example:

| stats count as raw | eval raw = "subject=foo uid=123
subject=foo2 uid=321
sender=bar2 uid=321 xuid=cba
sender=bar uid=123 xuid=abc
recipient=baz xuid=abc
recipient=baz2 xuid=cba" | makemv delim="
" raw | mvexpand raw | rename raw as _raw | extract
| eval _time = time()-(random()%1000) | sort - _time
| transaction uid xuid | table _time duration eventcount subject sender recipient uid xuid

Make sure you keep the line breaks as they are here, that's important for this dirty kind of dummy data generation from within the search bar.

First I set up six events, three events per email, each event containing only one "email-y" field.
The events for subject and sender are tied together with uid, the events for sender and recipient are tied together with xuid, and the event for sender ties together the uid and the xuid giving you a nice transitive transaction.

If you want to search for subject, sender, etc before building the transaction you can either do that manually:

The good case: You have an event with the field to filter by (say, sender) and both ID fields.

  index=mail sourcetype=xemail
[ search index=mail sourcetype=xemail sender=foo | fields UID XUID | dedup UID XUID | format "(" "(" "OR" ")" "OR" ")" ]
| transaction UID XUID

That will search for events matching your sender and use the UID and XUID field to search all potential matches beyond the "sender-event", then build the transaction from there.

The bad case, #1: You have an event with the field to filter by, but only the UID field.

  index=mail sourcetype=xemail
[ search index=mail sourcetype=xemail 
  [ search index=mail sourcetype=xemail sender=foo | fields UID | dedup UID ]
  | fields UID XUID | dedup UID XUID | format "(" "(" "OR" ")" "OR" ")" ]
| transaction UID XUID

The innermost subsearch will go in, search for your sender, and come back with a list of UIDs. Those are inserted into the outer subsearch, that will go and retrieve all events with that UID - some of those will have the missing XUID! From there it proceeds like the good case above, using UID or XUID to collect together all relevant events and run the transaction.

The bad case, #2: You have an event with the field to filter by, but only the XUID field.
This works like the bad case #1, but you need to add two Xs to the innermost subsearch... so when filtering, you need to know which of the two bad cases to run 😞

Even though the two bad cases mean you have to go through your index thrice, each run should be a fairly rare search. This will be slower if you search for a sender that sent 90% of all emails, mind.

Instead of building those subsearch-monsters manually, there is a much-forgotten search command searchtxn to do just that for you (I think, don't have data handy to actually test).
To use that, you first have to set up a transaction type in transactiontypes.conf like this:

[xemail]
fields = UID, XUID
search = index=mail sourcetype=xemail

To confirm that this type works, run a regular non-filtering search with | transaction name=xemail and see that it returns the same things as manually specifying the fields. Once that's done, run this with nothing else in the search bar:

| searchtxn xemail sender=foo

That should collect together all the required IDs and neatly return only matching transactions without scanning everything.

http://docs.splunk.com/Documentation/Splunk/6.3.3/SearchReference/searchtxn

One caveat about searchtxn, it's not going to honour your time range picker.

martin_mueller · ‎03-23-2016

I have significantly extended the actual answer, please give the searchtxn approach a shot and let us know if it works. H/T to @sideview!

packet_hunter · ‎03-24-2016

Thank you Martin. I will review and apply your suggestions and definitely let you know if it works.

packet_hunter · ‎03-30-2016

Thank you Martin, your first option works. The searchtxn I have not confirmed yet, but I will keep working on it. Thank you for all the help!!!

packet_hunter · ‎03-23-2016

Thank you Martin for enduring my long response and providing the code. I will try it and let you know.

packet_hunter · ‎03-23-2016

Thank you Martin, I see your strategy, however my objective was to search by one field value, like subject = blah, and automate the subsequent search that would produce the other needed fields. With your suggestion I would need to lookup the uid and xuid with multiple searches. I am trying to avoid the tedious/manual multiple searches.

Do you know how to search for a field value and use the result to automatically launch a subsequent search based on the returned value? That is where I am stuck.

Thank you

martin_mueller · ‎03-23-2016

I see, you want to filter by either sender, subject, etc. without loading all the transactions. That's possible, but a bit more tricky.

The good case: You have an event with the field to filter by (say, sender) and both ID fields.

  index=mail sourcetype=xemail
[ search index=mail sourcetype=xemail sender=foo | fields UID XUID | dedup UID XUID | format "(" "(" "OR" ")" "OR" ")" ]
| transaction UID XUID

That will search for events matching your sender and use the UID and XUID field to search all potential matches beyond the "sender-event", then build the transaction from there.

The bad case, #1: You have an event with the field to filter by, but only the UID field.

  index=mail sourcetype=xemail
[ search index=mail sourcetype=xemail 
  [ search index=mail sourcetype=xemail sender=foo | fields UID | dedup UID ]
  | fields UID XUID | dedup UID XUID | format "(" "(" "OR" ")" "OR" ")" ]
| transaction UID XUID

The innermost subsearch will go in, search for your sender, and come back with a list of UIDs. Those are inserted into the outer subsearch, that will go and retrieve all events with that UID - some of those will have the missing XUID! From there it proceeds like the good case above, using UID or XUID to collect together all relevant events and run the transaction.

The bad case, #2: You have an event with the field to filter by, but only the XUID field.
This works like the bad case #1, but you need to add two Xs to the innermost subsearch... so when filtering, you need to know which of the two bad cases to run 😞

Even though the two bad cases mean you have to go through your index thrice, each run should be a fairly rare search. This will be slower if you search for a sender that sent 90% of all emails, mind.

lguinn2 · ‎03-23-2016

All I think you need to do is to add sender to the list of fields for transaction:

| transaction sender uid xuid | table _time duration eventcount subject sender recipient uid xuid

lguinn2 · ‎03-23-2016

Or perhaps, assume that you have used a form to collect the value of the subject field and have stored in a token named $subject$
Your search could be

index=mail sourcetype=xemail UID=* OR XUID=* sender=* subject="$subject$*"
| transaction sender uid xuid 
| table _time duration eventcount subject sender recipient uid xuid

You could use any field, not just subject. However, the trick to this search is that first you retrieve only the events that have the field of interest. Then use transaction to group them based on the sender and the uids. Note that sender is actually the only field that you named as being part of both events with UID and events with XUID. So sender needs to be part of the transaction command.

packet_hunter · ‎03-23-2016

So as you may have gathered, I am new to splunk...
Not sure what you mean by "using a form" is that like creating a lookup file? Say using the search to store results in a lookup and then use the lookup to find the other field values?

From a high level, is there a way to search and use the results to kick off another search with those results... say eval and searchmatch....? Basically chaining a search... other than transaction, b/c that is too slow...

Thank you

martin_mueller · ‎03-21-2016

Use a short time range for testing.

packet_hunter · ‎03-22-2016

Thank you for the advice, using the short time range, however I was aware of that. The problem is that "transaction " is just too expensive.

But perhaps I should rephrase my question for you again.
So I am working with email logs. The logs are such that subject, sender, recipient, attachment, etc are all in separate events that share a unique ID (uid). If I use:

[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 

|stats list(subject) as subj list(sender) as sender list(recipient) as recp list(vendor_action) as status by UID

I will get some of the fields that I want (e.g. subject, sender, recipient, status, etc.).

However there are other fields I want that are not associated with UID but rather XUID. If I use:

    [search index=mail sourcetype=xemail sender = "Blah" |stats count by XUID| fields XUID] 

    |stats list(dest) as dest_ip list(recipient) as recp by XUID

I will get some other fields I want (e.g. dest_ip).

There are events that contain both UID and XUID, but I don't know how to create a secondary search to pull up all the events containing the XUID found by the primary search of the UID.

Bottom line I am looking for a way to search a subject and get all the fields associated with the UID, which includes the XUID. Then use the XUID results to find all fields associated with the XUID number to combine all fields by UID and XUID for that email session.

If I am going about this the hard way and you have a better solution please let me know.

Thank you

martin_mueller · ‎03-18-2016

For multiple-chained-keys transactions the transaction command should be ideal. What happens if you run this?

index=mail sourcetype=xemail | transaction UID XUID | table _time duration subject sender recipient vendor_action dest

packet_hunter · ‎03-21-2016

Thank you, I will give it a try, however there are a large number of events so I will have to [subsearch] first and then use transaction otherwise it will take too long.

lguinn2 · ‎03-18-2016

 index=mail sourcetype=xemail UID=* OR XUID=* sender=*
 | stats list(recipient) as recipient  list(subject) as subject  list(vendor_action) as status 
         list(dest_ip) as dest_ip list(UID) as UID list(XUID) as XUID by sender

I would avoid the use of the transaction command if there is a large number of events.
This search is very fast and simple. Use "values" instead of "list" in the stats command if you want remove duplicates from the results. Add "_time" at the end of the stats command if you want to show the time that each event occurred.

packet_hunter · ‎03-18-2016

Thank you I will give it a try.

snoobzilla · ‎03-18-2016

How about using the transaction command on sender?

yoursearch | transaction mvlist=1 sender | table _time subject sender UID X_UID

You may want to through a fillnull command in there. You could also add another case/coalesce based field if you need to conditionally include based on UID X_UID.

martin_mueller · ‎03-16-2016

So, if they share the sender field, how do you distinguish two conversations by the same sender?

What is the best way to join/combine/correlate fields from separate events with separate UIDs?

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)