Splunk Search

Join two indexes on time where time is different by 2 sec

ramuzzini
Path Finder
Hello, I am trying to join two indexes to display data from our local printers.  I have an index getting data from our printer server that contains the following data:
   index=prntserver
_time,                                   prnt_name     username   location
2024-11-04 11:05:32    Printer1           jon.doe         Office
2024-11-04 12:20:56    Printer2           tim.allen       FrontDesk
 
I have an index getting data from our DLP software that contains the following data:
   index=printlogs
_time                                    usersname     directory                          file
2024-11-04 11:05:33    jon.doe             c:/desktop/prints/     document1.doc
2024-11-04 12:20:58    tim.allen  c:/documents/files/   document2.xlsx
 
I am trying to join the two indexes to give me time, printer name, user name and location from the Print Server Index and then give me directory and file name that was recorded from my Print Log Index.  I am wanting to use time to join the two indexes but my issues is that the timestamp is off by 1 if not 2 seconds between the two index records.  I was trying to use the transaction command with a maxspan=3s to be safe but cannot get it to work.  Here is what I have been trying to work with
 
index=printserver
| convert timeformat="%Y-%m-%d %H:%M:%S" ctime(_time) AS servtime
   | join type=inner _time
      [ search index=printlogs
        | convert timeformat="%Y-%m-%d %H:%M:%S" ctime(_time) AS logtime
      ]
| transaction startswith=eval(src="<servtime>") endswith=eval(src="<logtime>") maxspan=3s
| table servtime prnt_name username location directory file
 
Thanks for any assistance given on this one. 
Labels (3)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

As @PickleRick notes, if time is critical for correlation, bin is risky.  This is one of the use cases where transaction is appropriate.  But you cannot use span alone.  Using index alone with _time is also unsafe.  From the context implied in mock data, you want transaction by user.  It is very important that you describe these critical logic clearly, explicitly, and without help from SPL.

The only obstacle is the field names "username" and "Users_Name"; this is so easily overcome with coalesce. (It is never a good idea to illustrate mock data with inaccuracy.  If the field name is Users_Name, you should consistently illustrate it as Users_Name and not usersname.)  One element that distracted people is the servtime and logtime conversion in initial illustrated SPL.  These fields adds no value to the use case.

This is the code that should get you started

 

index=printserver OR index=printlogs
| eval username = coalesce(username, Users_Name)
| fields - usersname
| transaction username maxspan=3s
| table _time prnt_name username location directory file

 

Using your corrected mock data, the above gives

_timeprnt_nameusernamelocationdirectoryfile
2024-11-04 12:20:56Printer2tim.allenFrontDeskc:/documents/files/document2.xlsx
2024-11-04 11:05:32Printer1jon.doeOfficec:/desktop/prints/document1.doc

Here is an emulation of your mock data.  Play with it and compare with real data

 

| makeresults format=csv data="_time,                                   prnt_name,     username,   location, _raw
2024-11-04 11:05:32,    Printer1,           jon.doe,         Office, server event 1
2024-11-04 12:20:56,    Printer2,           tim.allen,       FrontDesk, server event 2"
| eval index = "prntserver"
| append
    [makeresults format=csv data="_time,                                    Users_Name,     directory,                          file, _raw
2024-11-04 11:05:33,    jon.doe,             c:/desktop/prints/,     document1.doc, log event 1
2024-11-04 12:20:58,    tim.allen,  c:/documents/files/,   document2.xlsx, log event 2"
    | eval index = "printlogs"]
| eval _time = strptime(_time, "%F %T")
| sort - _time
``` the above emulates
index=printserver OR index=printlogs
```

 

Hope this helps

Tags (1)

MuS
Legend

Hi there,

like in most cases a simple 'stats' will do the trick.

Try something like this 

index=printserver OR index=printlogs
| bin _time span=3s
| stats values(*) AS * by _time username
| table _time prnt_name username location directory file

 

Hope this helps ...

cheers, MuS

ramuzzini
Path Finder

 

Spoiler
MuS,

Thanks for the response.  I am going to take this and work with what I have.  As I put this in my search, found out that my test data is different then what my _raw data actually is.  The username field from  printserver index is "username" but my username field from my printlogs is "User_Name" but has a domain name is front of it. 

 index=prntserver
_time,                                   prnt_name     username   location
2024-11-04 11:05:32    Printer1           jon.doe         Office
2024-11-04 12:20:56    Printer2           tim.allen       FrontDesk
 
I have an index getting data from our DLP software that contains the following data:
   index=printlogs
_time                                    Users_Name     directory                          file
2024-11-04 11:05:33    cpn/jon.doe             c:/desktop/prints/     document1.doc
2024-11-04 12:20:58    tim.allen  c:/documents/files/   document2.xlsx

I am going ot Rex the User_Name field from my print logs to match it with my printserver logs.  This is what I am going to work with and see if I get the results I need.  Thank you for your insight. 

index=printserver OR index=printlogs
| rex field="User_Name" "(?<domain>\S)+\\\\(?<username>\S+)"
| bin _time span=3s | stats values(*) AS * by _time username
| table _time prnt_name username location directory file


 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

That was my initial reaction but as usual - binning has issues when your events cross the bin boundary. For example - one event at :11, another at :14. They will be binned into separate buckets and won't match.

MuS
Legend

HaHA, I was answering based on the provided information not trying to work around every single possible outlier 😄 

PickleRick
SplunkTrust
SplunkTrust

Actually it's not that much of an outlier. Assuming 2s split between first and second event you have 2/3 chance of splitting them into two separate bins.

PickleRick
SplunkTrust
SplunkTrust

Join on _time doesn't make sense if the time is off in one of data sets. You noticed it yourself.

This transaction doesn't make sense either since you don't have a field called src which could contain one of those strings.

If this is indeed all the data you have it's a very tricky problem in general. Because what if the same user requests two print jobs within a second? How can you tell which one went to which printer if the only common field is the username?

If you can make some assumptions about the data the problem can be solved one way or another. You can use transaction on username field and indeed maxspan=3s or something like that (maybe add a startswith="index=printserver" endswith="index=printlogs". But transaction is a relatively resource-intensive command and is best avoided if possible. So if you can make some other assumptions maybe the solution could be better.

ramuzzini
Path Finder

Hey,

Thanks again for giving me your insight on this one.  I did come across the bin command but thought the transaction might be better to try in this situation.  As I am still learning the power and uses of many of the commands that can be used in Splunk, this does help me get a better understanding of how to use and when to use the transaction command.  

As you pointed out and is my true problem in this case, there are only two common/semi common variables I have between my two indexes,  that being "_time" and "username".  

I have compared the raw logs from both indexes and it appears that at most, the print jobs are separated by 2 secs and I haven't seen any print jobs by the same user that have been closer than 10 seconds apart.  But as to your point, I will make note that there could be some issue with my output if a user prints two jobs seconds apart from each other. 

As always, appreciate your input and clarification on my questions. 

0 Karma
Get Updates on the Splunk Community!

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...