Splunk Search

How to retrieve the latest related event from one sourcetype based on a common identifying field from another sourcetype?

arnol229
Explorer

I have 2 sets of events, sourcetype=user_profiles and sourcetype=app_opened which both share common identifiers ( id / user_id).
the user profile id is on the outer layer, while the user_id is nested within eventData.

user_profiles:

{id:1234,
...}

app_opened:

{
eventData:{
user_id:1234,last_opened_location:'blah',last_opened_date:'5/20/2015'},
}

I want to table out profile data and look up the latest app_opened event for each user and display some fields.
UserID Name email .... last_opened_location last_opened_date
Help me Obi-wan Kenobi, you are my only hope.

Tags (3)
1 Solution

acharlieh
Influencer

Looking at your sample data I am making the assumptions that 1) Each sourcetype=user_profiles event contains one user and contains user information 2) Each sourcetype=app_opened event contains information on a single user opening an app and contains app information. Therefore a potential query for your data in place could be:

sourcetype=user_profiles earliest=0 latest=now | fields id name email | dedup id | rename id as user_id
| append [sourcetype=app_opened | fields user_id last_opened_location last_opened_date | dedup user_id ]
| stats first(*) as * by user_id

What is happening here is the first line we're getting all user profiles (from all time) getting the id, name, and email fields, and renaming id to user_id so we can correlate it later. The dedup ensures we get the latest information for a particular user by keeping only the latest event for each user id. (Note: if your desired fields are not extracted as these exact names both here and in the second line, you can use rename to change them as appropriate).

In the second line we add to the results additional result rows containing the information on the last opened app for each user_id. The dedup ensures that we get only the latest app_opened event for each user id.

Now the third line... this is what brings the results for the first two lines together based on user_id. As we ensured no field names conflict (other than the one we want to join based on) we'll now have essentially the first line results joined to the second.

If I could make a suggestion though. user_profiles data feels like it may be better suited to come from a lookup table. Either one you generate via searches against sourcetype=user_profiles on a regular basis, or using something like DBConnect or a script to pull this data from an external authoritative source.

In which case this query could be as easy as:

sourcetype=app_opened | fields user_id last_opened_location last_opened_date | dedup user_id  | lookup user_data id as user_id OUTPUT email name

(or even simpler if you configured the lookup to happen automatically for sourcetype=app_opened)

View solution in original post

acharlieh
Influencer

Looking at your sample data I am making the assumptions that 1) Each sourcetype=user_profiles event contains one user and contains user information 2) Each sourcetype=app_opened event contains information on a single user opening an app and contains app information. Therefore a potential query for your data in place could be:

sourcetype=user_profiles earliest=0 latest=now | fields id name email | dedup id | rename id as user_id
| append [sourcetype=app_opened | fields user_id last_opened_location last_opened_date | dedup user_id ]
| stats first(*) as * by user_id

What is happening here is the first line we're getting all user profiles (from all time) getting the id, name, and email fields, and renaming id to user_id so we can correlate it later. The dedup ensures we get the latest information for a particular user by keeping only the latest event for each user id. (Note: if your desired fields are not extracted as these exact names both here and in the second line, you can use rename to change them as appropriate).

In the second line we add to the results additional result rows containing the information on the last opened app for each user_id. The dedup ensures that we get only the latest app_opened event for each user id.

Now the third line... this is what brings the results for the first two lines together based on user_id. As we ensured no field names conflict (other than the one we want to join based on) we'll now have essentially the first line results joined to the second.

If I could make a suggestion though. user_profiles data feels like it may be better suited to come from a lookup table. Either one you generate via searches against sourcetype=user_profiles on a regular basis, or using something like DBConnect or a script to pull this data from an external authoritative source.

In which case this query could be as easy as:

sourcetype=app_opened | fields user_id last_opened_location last_opened_date | dedup user_id  | lookup user_data id as user_id OUTPUT email name

(or even simpler if you configured the lookup to happen automatically for sourcetype=app_opened)

View solution in original post

arnol229
Explorer

this was fantastic! thank you!! didn't work right off the bat but i made a few corrections to get it working (it wanted the drilldown inside of eventData:

sourcetype=user_profiles earliest=0 latest=now | fields id name email | dedup id | rename id as user_id
 | append [search sourcetype=app_opened | fields eventData.user_id eventData.opened_location eventData.opened_date | dedup eventData.user_id | rename eventData.user_id as user_id ]
 | stats first(*) as * by user_id

woodcock
Esteemed Legend

This should do it:

sourcetype = user_data | stats first(id) AS id | map search="sourcetype =app_open id=$id$ | dedup id | fields <your field list>"
0 Karma

acharlieh
Influencer

Try again? Overlooking the fact that your sourcetypes are different from what was stated in the question. The first two segments of your search, fetches a single user's id (which doesn't help solve the problem of getting data for each user as was asked for in the question). You then use a map to kick off a search to find that id in the app opening sourcetype (which wouldn't actually qualify since the field here is user_id instead of id) to get a single opening result and only the fields that are present with the opening sourcetype (if I had to guess, name and email come a user profile not the app opening. and while they could be present assuming an automatic lookup as I mention below, presence of such would eliminate the need for querying the user_profiles sourcetype). Additionally while map is a useful command, its usefulness quickly degrades as its performance does as it kicks off a separate search for each result present immediately before the map command. (it warns you if you have too many but this can be really slow).

0 Karma

woodcock
Esteemed Legend

Originally the OPs sample search used the same sourcetypes that I also then used (he has since re-edited). This is what he said:

"I want to table out profile data and look up the latest app_opened event for each user and display some fields."

This is exactly what my search does: It says, "in the first sourcetype userdata, grab all the distinct id field values. Then, using this set, get the latest row for each id value in the second sourcetype 'app_opened(originally listed asapp_open). If this is not what OP desires, then OP should be more clear. Thededupcommand takes the most recent event as the singular representative event based on the field specified, in this caseid`. My search is ruthlessly efficient and does exactly what OP said he desires. I did make a mistake in the field names though, so it should be this:

 sourcetype = user_data | stats first(id) AS id | map search="sourcetype =app_opened user_id=$id$ | dedup id | fields &lt;your field list&gt;"
0 Karma

acharlieh
Influencer

Your search doesn't even do what you say. stats first(id) as id gives a single id it does not give all distinct values. Fixing that, map is still terribly inefficient and by default will bomb out if you ever fed it more than (by default) 10 values of id. Kicking off that many individual searches to get what could be pulled back in one search is terribly inefficient

0 Karma

arnol229
Explorer

Not sure I understand what's going on with that query.. Could you explain what each command is doing here?

0 Karma

woodcock
Esteemed Legend

What configurations do you have in place now? Are these events being linebroken correctly? Are the fields being extracted yet?

0 Karma

arnol229
Explorer

everything is being pumped in as strucutred json. My current attempt:
sourcetype = user_data OR app_open
| eval
| transaction id 'eventData.user_id'
| stats values(first_name) as "First Name" values(last_name) as "Last Name" values(username) as "UserName" values(email) as "Email" values(is_active) as "Active" values(home_address) as "Home Address" values(work_address) as "Work Address" values(last_login) as "Last Login" values(date_joined) as "Date Joined" values('eventData.last_opened_location') as "Last Opened Location" values('eventData.last_opened_date') as "last opened date" by id

..of course that doesnt work, and i dont think thats even accounting for retrieving just the latest app_opened event to get the latest location/date in that event

0 Karma