Splunk Dev

Unpivot multi-value fields

angliu
Engager

I am working with some email header data, starting with generating some multi-value fields and now get to this point...

 index=teed sourcetype="emaildata" | eval max_date_sent= max(DateTimeSent) | dedup message_id, max_date_sent |table FromEmailAddress, Subject, DateTimeSent, ToRecipient, CcRecipient, BccRecipient | where FromEmailAddress != "" 

alt text

but what I really want is to examine the individual relationship, so the final output will like this...
alt text

I've tried this

 index=cms sourcetype="exchangeemails" | eval max_date_sent= max(DateTimeSent) | dedup InternetMessageID, max_date_sent | eval tocc=mvzip(ToRecipients,CcRecipients) | eval receivers=mvzip(tocc,CcRecipients) | mvexpand receivers | makemv delim="," receivers | table FromEmailAddress, receivers

but this returns some duplicated records for the receivers, and cannot distinguish if the receiver was originally a to/cc/bcc.
Can someone point direction? Thanks!

Tags (1)
0 Karma
1 Solution

gvmorley
Contributor

Hi,

I'm not sure if this is the 'best' way, as it feels somewhat overly complicated, but I think I can achieve close to what you're looking for with the following...

This first bit of code is just to simulate your data:

| makeresults 
| fields - _time 
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com" 
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com") 
| append 
    [| makeresults 
    | fields - _time 
    | eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com" 
    | eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")] 

This gives 2 results like this:

alt text

Next we need some sort of unique ID for each result. You may well have this in your original data, but if not, you can just hash some (or all) fields:

| eval email_id=sha256(Sender.Subject.Date)

Then we add the 'type' to the different address, so that when we manipulate these address later, we don't lose what track of what they were:

| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g" 
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g" 
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g" 
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g" 

Next, put all of the addresses together into a single field and drop the other fields that we don't need:

| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender) 
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject 

Now we want to create a result for each individual email address. This is where the 'email_id' field is useful, so that we have an association between the addresses of individual emails:

| mvexpand AllEmails 

And we can now create a 'Type' field to represent if the address was Sender,To,Cc or Bcc. And at the same time, tidy up the original fields:

| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))" 
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g" 

At this point, you should now have a result for every single address.

Now the fun bit. Since you want to associate some data that you've already got, with itself, we can use an appendpipe and a bit of stats to 'duplicate' the results. At the same time, we rename a number of fields to Email2 and Email2_Type:

| appendpipe 
    [| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails
    | fields - AllEmails]

Now we use a selfjoin on the email_id to match results from the first part with the second part. Importantly we need to max=0 to ensure that there is an 'everything-to-everything' result:

| selfjoin max=0 email_id 

Finally we rename some of the original fields to make more sense. And we filter out the results which didn't have a join, plus a bit of sorting:

| rename AllEmails AS Email1 Email_Type AS Email1_Type 
| where isnotnull(Email2) AND Email1!=Email2 
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type

And finally, we should have the table which you were looking for:

alt text

At this point you can drop the email_id field as it's no longer needed.

IMPORTANT: There is still one problem with this. It doesn't yet deal with the 'inverse' duplication problem.

I.e. Where you have:

Email1 Email1_Type Email2 Email2_Type
yyy@gmail.com Sender yyy@hotmail.com To
yyy@hotmail.com To yyy@gmail.com Sender

Sorry - I ran out of time for that bit, but if I get a chance, I'll see if I can figure it out later!

Hope that helps.


Below is the full search, so it's easier to copy and paste:

| makeresults 
| fields - _time 
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com" 
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com") 
| append 
    [| makeresults 
    | fields - _time 
    | eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com" 
    | eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")] 
| eval email_id=sha256(Sender.Subject.Date) 
| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g" 
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g" 
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g" 
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g" 
| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender) 
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject 
| mvexpand AllEmails 
| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))" 
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g" 
| appendpipe 
    [| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails 
    | fields - AllEmails] 
| selfjoin max=0 email_id 
| rename AllEmails AS Email1 Email_Type AS Email1_Type 
| where isnotnull(Email2) AND Email1!=Email2
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type

View solution in original post

gvmorley
Contributor

Hi,

I'm not sure if this is the 'best' way, as it feels somewhat overly complicated, but I think I can achieve close to what you're looking for with the following...

This first bit of code is just to simulate your data:

| makeresults 
| fields - _time 
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com" 
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com") 
| append 
    [| makeresults 
    | fields - _time 
    | eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com" 
    | eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")] 

This gives 2 results like this:

alt text

Next we need some sort of unique ID for each result. You may well have this in your original data, but if not, you can just hash some (or all) fields:

| eval email_id=sha256(Sender.Subject.Date)

Then we add the 'type' to the different address, so that when we manipulate these address later, we don't lose what track of what they were:

| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g" 
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g" 
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g" 
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g" 

Next, put all of the addresses together into a single field and drop the other fields that we don't need:

| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender) 
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject 

Now we want to create a result for each individual email address. This is where the 'email_id' field is useful, so that we have an association between the addresses of individual emails:

| mvexpand AllEmails 

And we can now create a 'Type' field to represent if the address was Sender,To,Cc or Bcc. And at the same time, tidy up the original fields:

| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))" 
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g" 

At this point, you should now have a result for every single address.

Now the fun bit. Since you want to associate some data that you've already got, with itself, we can use an appendpipe and a bit of stats to 'duplicate' the results. At the same time, we rename a number of fields to Email2 and Email2_Type:

| appendpipe 
    [| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails
    | fields - AllEmails]

Now we use a selfjoin on the email_id to match results from the first part with the second part. Importantly we need to max=0 to ensure that there is an 'everything-to-everything' result:

| selfjoin max=0 email_id 

Finally we rename some of the original fields to make more sense. And we filter out the results which didn't have a join, plus a bit of sorting:

| rename AllEmails AS Email1 Email_Type AS Email1_Type 
| where isnotnull(Email2) AND Email1!=Email2 
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type

And finally, we should have the table which you were looking for:

alt text

At this point you can drop the email_id field as it's no longer needed.

IMPORTANT: There is still one problem with this. It doesn't yet deal with the 'inverse' duplication problem.

I.e. Where you have:

Email1 Email1_Type Email2 Email2_Type
yyy@gmail.com Sender yyy@hotmail.com To
yyy@hotmail.com To yyy@gmail.com Sender

Sorry - I ran out of time for that bit, but if I get a chance, I'll see if I can figure it out later!

Hope that helps.


Below is the full search, so it's easier to copy and paste:

| makeresults 
| fields - _time 
| eval Sender="xxx@gmail.com", Subject="something", Date="3/1/2017", ToRecipient="xxx@hotmail.com", CcRecipient="xxx@msn.com", BccRecipient="xxx@wa.com" 
| eval CcRecipient=mvappend(CcRecipient,"xxx@yahoo.com","xxx@splunk.com") 
| append 
    [| makeresults 
    | fields - _time 
    | eval Sender="yyy@gmail.com", Subject="something else", Date="3/1/2017", ToRecipient="yyy@hotmail.com", CcRecipient="yyy@msn.com", BccRecipient="yyy@wa.com" 
    | eval CcRecipient=mvappend(CcRecipient,"yyy@yahoo.com","yyy@splunk.com")] 
| eval email_id=sha256(Sender.Subject.Date) 
| rex max_match=0 mode=sed field=CcRecipient "s/(.*)/\1|Cc/g" 
| rex max_match=0 mode=sed field=BccRecipient "s/(.*)/\1|Bcc/g" 
| rex max_match=0 mode=sed field=ToRecipient "s/(.*)/\1|To/g" 
| rex max_match=0 mode=sed field=Sender "s/(.*)/\1|Sender/g" 
| eval AllEmails=mvappend(ToRecipient,CcRecipient,BccRecipient,Sender) 
| fields - ToRecipient,CcRecipient,BccRecipient,Sender,Date,Subject 
| mvexpand AllEmails 
| rex field=AllEmails "^[^|]+\|(?<Email_Type>(?:To|Cc|Bcc|Sender))" 
| rex field=AllEmails mode=sed "s/([^|]+).*/\1/g" 
| appendpipe 
    [| stats list(AllEmails) AS Email2 list(Email_Type) AS Email2_Type list(email_id) AS email_id by AllEmails 
    | fields - AllEmails] 
| selfjoin max=0 email_id 
| rename AllEmails AS Email1 Email_Type AS Email1_Type 
| where isnotnull(Email2) AND Email1!=Email2
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type

gvmorley
Contributor

It's not pretty, but adding this at the end of the Search should fix-up the duplicate inverse problem:

| eval Checks=mvappend(sha256(Email1.Email1_Type.Email2.Email2_Type.email_id),sha256(Email2.Email2_Type.Email1.Email1_Type.email_id))
| mvexpand Checks
| dedup Checks
| dedup Email1,Email1_Type,Email2,Email2_Type,email_id
| fields - Checks
| sort + email_id,Email1,Email1_Type,Email2,Email2_Type

But worth trying with a bigger dataset to make sure that the whole thing scales enough for you.

0 Karma

angliu
Engager

Wow. You're the best. Thank you so much.
If it works, it works! At least it gets me much further down the path... thanks again!

0 Karma

gvmorley
Contributor

No worries. I'm pleased that it helped. Enjoy your Splunk journey!

(And if you're happy with the answer, just mark the question as 'answered' - as it helps the rest of the community know that this one is sorted.)

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...