I am working with some email header data, starting with generating some multi-value fields and now get to this point...
index=teed sourcetype="emaildata" | eval max_date_sent= max(DateTimeSent) | dedup message_id, max_date_sent |table FromEmailAddress, Subject, DateTimeSent, ToRecipient, CcRecipient, BccRecipient | where FromEmailAddress != ""
but what I really want is to examine the individual relationship, so the final output will like this...
I've tried this
index=cms sourcetype="exchangeemails" | eval max_date_sent= max(DateTimeSent) | dedup InternetMessageID, max_date_sent | eval tocc=mvzip(ToRecipients,CcRecipients) | eval receivers=mvzip(tocc,CcRecipients) | mvexpand receivers | makemv delim="," receivers | table FromEmailAddress, receivers
but this returns some duplicated records for the receivers, and cannot distinguish if the receiver was originally a to/cc/bcc.
Can someone point direction? Thanks!
Hi,
I'm not sure if this is the 'best' way, as it feels somewhat overly complicated, but I think I can achieve close to what you're looking for with the following...
This first bit of code is just to simulate your data:
| makeresults
| fields - _time
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com")
| append
[| makeresults
| fields - _time
| eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")]
This gives 2 results like this:
Next we need some sort of unique ID for each result. You may well have this in your original data, but if not, you can just hash some (or all) fields:
| eval email_id=sha256(Sender.Subject.Date)
Then we add the 'type' to the different address, so that when we manipulate these address later, we don't lose what track of what they were:
| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g"
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g"
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g"
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g"
Next, put all of the addresses together into a single field and drop the other fields that we don't need:
| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender)
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject
Now we want to create a result for each individual email address. This is where the 'email_id' field is useful, so that we have an association between the addresses of individual emails:
| mvexpand AllEmails
And we can now create a 'Type' field to represent if the address was Sender,To,Cc or Bcc. And at the same time, tidy up the original fields:
| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))"
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g"
At this point, you should now have a result for every single address.
Now the fun bit. Since you want to associate some data that you've already got, with itself, we can use an appendpipe and a bit of stats to 'duplicate' the results. At the same time, we rename a number of fields to Email2 and Email2_Type:
| appendpipe
[| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails
| fields - AllEmails]
Now we use a selfjoin on the email_id to match results from the first part with the second part. Importantly we need to max=0 to ensure that there is an 'everything-to-everything' result:
| selfjoin max=0 email_id
Finally we rename some of the original fields to make more sense. And we filter out the results which didn't have a join, plus a bit of sorting:
| rename AllEmails AS Email1 Email_Type AS Email1_Type
| where isnotnull(Email2) AND Email1!=Email2
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type
And finally, we should have the table which you were looking for:
At this point you can drop the email_id field as it's no longer needed.
IMPORTANT: There is still one problem with this. It doesn't yet deal with the 'inverse' duplication problem.
I.e. Where you have:
Email1 Email1_Type Email2 Email2_Type
yyy@gmail.com Sender yyy@hotmail.com To
yyy@hotmail.com To yyy@gmail.com Sender
Sorry - I ran out of time for that bit, but if I get a chance, I'll see if I can figure it out later!
Hope that helps.
Below is the full search, so it's easier to copy and paste:
| makeresults
| fields - _time
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com")
| append
[| makeresults
| fields - _time
| eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")]
| eval email_id=sha256(Sender.Subject.Date)
| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g"
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g"
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g"
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g"
| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender)
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject
| mvexpand AllEmails
| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))"
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g"
| appendpipe
[| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails
| fields - AllEmails]
| selfjoin max=0 email_id
| rename AllEmails AS Email1 Email_Type AS Email1_Type
| where isnotnull(Email2) AND Email1!=Email2
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type
Hi,
I'm not sure if this is the 'best' way, as it feels somewhat overly complicated, but I think I can achieve close to what you're looking for with the following...
This first bit of code is just to simulate your data:
| makeresults
| fields - _time
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com")
| append
[| makeresults
| fields - _time
| eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")]
This gives 2 results like this:
Next we need some sort of unique ID for each result. You may well have this in your original data, but if not, you can just hash some (or all) fields:
| eval email_id=sha256(Sender.Subject.Date)
Then we add the 'type' to the different address, so that when we manipulate these address later, we don't lose what track of what they were:
| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g"
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g"
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g"
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g"
Next, put all of the addresses together into a single field and drop the other fields that we don't need:
| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender)
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject
Now we want to create a result for each individual email address. This is where the 'email_id' field is useful, so that we have an association between the addresses of individual emails:
| mvexpand AllEmails
And we can now create a 'Type' field to represent if the address was Sender,To,Cc or Bcc. And at the same time, tidy up the original fields:
| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))"
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g"
At this point, you should now have a result for every single address.
Now the fun bit. Since you want to associate some data that you've already got, with itself, we can use an appendpipe and a bit of stats to 'duplicate' the results. At the same time, we rename a number of fields to Email2 and Email2_Type:
| appendpipe
[| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails
| fields - AllEmails]
Now we use a selfjoin on the email_id to match results from the first part with the second part. Importantly we need to max=0 to ensure that there is an 'everything-to-everything' result:
| selfjoin max=0 email_id
Finally we rename some of the original fields to make more sense. And we filter out the results which didn't have a join, plus a bit of sorting:
| rename AllEmails AS Email1 Email_Type AS Email1_Type
| where isnotnull(Email2) AND Email1!=Email2
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type
And finally, we should have the table which you were looking for:
At this point you can drop the email_id field as it's no longer needed.
IMPORTANT: There is still one problem with this. It doesn't yet deal with the 'inverse' duplication problem.
I.e. Where you have:
Email1 Email1_Type Email2 Email2_Type
yyy@gmail.com Sender yyy@hotmail.com To
yyy@hotmail.com To yyy@gmail.com Sender
Sorry - I ran out of time for that bit, but if I get a chance, I'll see if I can figure it out later!
Hope that helps.
Below is the full search, so it's easier to copy and paste:
| makeresults
| fields - _time
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com")
| append
[| makeresults
| fields - _time
| eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com"
| eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")]
| eval email_id=sha256(Sender.Subject.Date)
| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g"
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g"
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g"
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g"
| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender)
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject
| mvexpand AllEmails
| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))"
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g"
| appendpipe
[| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails
| fields - AllEmails]
| selfjoin max=0 email_id
| rename AllEmails AS Email1 Email_Type AS Email1_Type
| where isnotnull(Email2) AND Email1!=Email2
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type
It's not pretty, but adding this at the end of the Search should fix-up the duplicate inverse problem:
| eval Checks=mvappend(sha256(Email1.Email1_Type.Email2.Email2_Type.email_id),sha256(Email2.Email2_Type.Email1.Email1_Type.email_id))
| mvexpand Checks
| dedup Checks
| dedup Email1,Email1_Type,Email2,Email2_Type,email_id
| fields - Checks
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type
But worth trying with a bigger dataset to make sure that the whole thing scales enough for you.
Wow. You're the best. Thank you so much.
If it works, it works! At least it gets me much further down the path... thanks again!
No worries. I'm pleased that it helped. Enjoy your Splunk journey!
(And if you're happy with the answer, just mark the question as 'answered' - as it helps the rest of the community know that this one is sorted.)