Re: Join two indexes on time where time is differe...

ramuzzini · ‎11-04-2024

Hello, I am trying to join two indexes to display data from our local printers. I have an index getting data from our printer server that contains the following data:

index=prntserver

_time, prnt_name username location

2024-11-04 11:05:32 Printer1 jon.doe Office

2024-11-04 12:20:56 Printer2 tim.allen FrontDesk

I have an index getting data from our DLP software that contains the following data:

index=printlogs

_time usersname directory file

2024-11-04 11:05:33 jon.doe c:/desktop/prints/ document1.doc

2024-11-04 12:20:58 tim.allen c:/documents/files/ document2.xlsx

I am trying to join the two indexes to give me time, printer name, user name and location from the Print Server Index and then give me directory and file name that was recorded from my Print Log Index. I am wanting to use time to join the two indexes but my issues is that the timestamp is off by 1 if not 2 seconds between the two index records. I was trying to use the transaction command with a maxspan=3s to be safe but cannot get it to work. Here is what I have been trying to work with

index=printserver

| convert timeformat="%Y-%m-%d %H:%M:%S" ctime(_time) AS servtime

| join type=inner _time

[ search index=printlogs

| convert timeformat="%Y-%m-%d %H:%M:%S" ctime(_time) AS logtime

]

| transaction startswith=eval(src="<servtime>") endswith=eval(src="<logtime>") maxspan=3s

| table servtime prnt_name username location directory file

Thanks for any assistance given on this one.

yuanliu · ‎11-06-2024

As @PickleRick notes, if time is critical for correlation, bin is risky. This is one of the use cases where transaction is appropriate. But you cannot use span alone. Using index alone with _time is also unsafe. From the context implied in mock data, you want transaction by user. It is very important that you describe these critical logic clearly, explicitly, and without help from SPL.

The only obstacle is the field names "username" and "Users_Name"; this is so easily overcome with coalesce. (It is never a good idea to illustrate mock data with inaccuracy. If the field name is Users_Name, you should consistently illustrate it as Users_Name and not usersname.) One element that distracted people is the servtime and logtime conversion in initial illustrated SPL. These fields adds no value to the use case.

This is the code that should get you started

index=printserver OR index=printlogs
| eval username = coalesce(username, Users_Name)
| fields - usersname
| transaction username maxspan=3s
| table _time prnt_name username location directory file

Using your corrected mock data, the above gives

_time	prnt_name	username	location	directory	file
2024-11-04 12:20:56	Printer2	tim.allen	FrontDesk	c:/documents/files/	document2.xlsx
2024-11-04 11:05:32	Printer1	jon.doe	Office	c:/desktop/prints/	document1.doc

Here is an emulation of your mock data. Play with it and compare with real data

| makeresults format=csv data="_time,                                   prnt_name,     username,   location, _raw
2024-11-04 11:05:32,    Printer1,           jon.doe,         Office, server event 1
2024-11-04 12:20:56,    Printer2,           tim.allen,       FrontDesk, server event 2"
| eval index = "prntserver"
| append
    [makeresults format=csv data="_time,                                    Users_Name,     directory,                          file, _raw
2024-11-04 11:05:33,    jon.doe,             c:/desktop/prints/,     document1.doc, log event 1
2024-11-04 12:20:58,    tim.allen,  c:/documents/files/,   document2.xlsx, log event 2"
    | eval index = "printlogs"]
| eval _time = strptime(_time, "%F %T")
| sort - _time
``` the above emulates
index=printserver OR index=printlogs
```

Hope this helps

MuS · ‎11-04-2024

Hi there,

like in most cases a simple 'stats' will do the trick.

Try something like this

index=printserver OR index=printlogs
| bin _time span=3s
| stats values(*) AS * by _time username
| table _time prnt_name username location directory file

Hope this helps ...

cheers, MuS

ramuzzini · ‎11-05-2024

Spoiler

MuS,

Thanks for the response. I am going to take this and work with what I have. As I put this in my search, found out that my test data is different then what my _raw data actually is. The username field from printserver index is "username" but my username field from my printlogs is "User_Name" but has a domain name is front of it.

index=prntserver

_time, prnt_name username location

2024-11-04 11:05:32 Printer1 jon.doe Office

2024-11-04 12:20:56 Printer2 tim.allen FrontDesk

I have an index getting data from our DLP software that contains the following data:

index=printlogs

_time Users_Name directory file

2024-11-04 11:05:33 cpn/jon.doe c:/desktop/prints/ document1.doc

2024-11-04 12:20:58 tim.allen c:/documents/files/ document2.xlsx

MuS,Thanks for the response. I am going to take this and work with what I have. As I put this in my search, found out that my test data is different then what my _raw data actually is. The username field from printserver index is "username" but my username field from my printlogs is "User_Name" but has a domain name is front of it. index=prntserver_time, prnt_name username location2024-11-04 11:05:32 Printer1 jon.doe Office2024-11-04 12:20:56 Printer2 tim.allen FrontDesk I have an index getting data from our DLP software that contains the following data: index=printlogs_time Users_Name directory file2024-11-04 11:05:33 cpn/jon.doe c:/desktop/prints/ document1.doc2024-11-04 12:20:58 tim.allen c:/documents/files/ document2.xlsx

I am going ot Rex the User_Name field from my print logs to match it with my printserver logs. This is what I am going to work with and see if I get the results I need. Thank you for your insight.

index=printserver OR index=printlogs
| rex field="User_Name" "(?<domain>\S)+\\\\(?<username>\S+)"
| bin _time span=3s | stats values(*) AS * by _time username
| table _time prnt_name username location directory file

PickleRick · ‎11-04-2024

That was my initial reaction but as usual - binning has issues when your events cross the bin boundary. For example - one event at :11, another at :14. They will be binned into separate buckets and won't match.

MuS · ‎11-04-2024

HaHA, I was answering based on the provided information not trying to work around every single possible outlier 😄

PickleRick · ‎11-04-2024

Actually it's not that much of an outlier. Assuming 2s split between first and second event you have 2/3 chance of splitting them into two separate bins.

PickleRick · ‎11-04-2024

Join on _time doesn't make sense if the time is off in one of data sets. You noticed it yourself.

This transaction doesn't make sense either since you don't have a field called src which could contain one of those strings.

If this is indeed all the data you have it's a very tricky problem in general. Because what if the same user requests two print jobs within a second? How can you tell which one went to which printer if the only common field is the username?

If you can make some assumptions about the data the problem can be solved one way or another. You can use transaction on username field and indeed maxspan=3s or something like that (maybe add a startswith="index=printserver" endswith="index=printlogs". But transaction is a relatively resource-intensive command and is best avoided if possible. So if you can make some other assumptions maybe the solution could be better.

ramuzzini · ‎11-05-2024

Hey,

Thanks again for giving me your insight on this one. I did come across the bin command but thought the transaction might be better to try in this situation. As I am still learning the power and uses of many of the commands that can be used in Splunk, this does help me get a better understanding of how to use and when to use the transaction command.

As you pointed out and is my true problem in this case, there are only two common/semi common variables I have between my two indexes, that being "_time" and "username".

I have compared the raw logs from both indexes and it appears that at most, the print jobs are separated by 2 secs and I haven't seen any print jobs by the same user that have been closer than 10 seconds apart. But as to your point, I will make note that there could be some issue with my output if a user prints two jobs seconds apart from each other.

As always, appreciate your input and clarification on my questions.

Join two indexes on time where time is different by 2 sec

join

subsearch

transaction

.conf25 Registration is OPEN!

Detecting Cross-Channel Fraud with Splunk

Splunk at Cisco Live 2025: Learning, Innovation, and a Little Bit of Mr. Brightside

Are you a member of the Splunk Community?