topic Re: Joining two large datasets without a Join in Splunk Search

Joining two large datasets without a Join

splunk_eval — Fri, 23 Nov 2012 21:50:54 GMT

I have two data sources, one that is a very large file listing with *nix timestamps, and one that has a text description of the file.
source 1: ls -lR -> permissions ... owner ... filesize timestamp filename
example: -rw-r--r-- 1 ownername 123 1234 Jan 01 23:45 file name with spaces.txt
source 2: text file -> filename (category)= one-line description
example: file name with spaces.txt = this is the text description here for the file

I would like to merge these to get:
ownername category timestamp filesize filename description

source 2 will not have a useful timestamp, it is just the modified date of the text file.

A join or a [subsearch] fails because of the large size of both data sources (> 50,000 entries).

I would like to be able to chart the data. For example, I'd like to chart the occurrence of files over time with ownername=johndoe and description is not empty. ownername comes from source 1, description comes from source 2.

I would also like to stat/summarize the data to get the number of files in 2012 of category "asdf". The category comes from source 2, filename is common to both, and the date comes from source 1.

Re: Joining two large datasets without a Join

jonuwz — Sat, 24 Nov 2012 00:10:55 GMT

If source 2 is reasonably static, you could use outputcsv to create a lookup file.

Then do your search on source1 and enrich the data with the description using lookup

You will need to define the lookup in transforms.conf - details here

Re: Joining two large datasets without a Join

splunk_eval — Mon, 28 Sep 2020 12:51:36 GMT

The data will be changing and also be run on a number of different files, so this could be tricky.

"| transaction ..." would be perfect but it's very slow. Each data source is 100Ks if not 1Ms+ lines. Is there a better/more efficient way to run this command:

(sourcetype="my_source1" OR sourcetype="my_source2") | transaction FILENAME | table FILESIZE,FILENAME,DATETIME,DESCRIPTION

This pulls FILESIZE, DATETIME from one file (using the timestamp in the file, not _time), and DESCRIPTION from the other file, using FILENAME as the shared ID.

Re: Joining two large datasets without a Join

splunk_eval — Mon, 28 Sep 2020 12:51:38 GMT

"| transaction ..." would be perfect but it's very slow. Each data source is 100Ks if not 1Ms+ lines. Is there a better/more efficient way to run this command:

(sourcetype="my_source1" OR sourcetype="my_source2") | transaction FILENAME | table FILESIZE,FILENAME,DATETIME,DESCRIPTION

This pulls FILESIZE, DATETIME from one file (using the timestamp in the file, not _time), and DESCRIPTION from the other file, using FILENAME as the shared ID.

Then, running the table command through a "| selfjoin FILENAME" seems to work, and the selfjoin is pretty quick. I'll do some more testing with this and make sure it's merging the way I want.

Re: Joining two large datasets without a Join

jonuwz — Sat, 24 Nov 2012 01:36:20 GMT

That'll work well if there's a 1-1 mapping between source1 and source2.
If not, the description will only attach to the 1st matching filename.

If you set max=0 in selfjoin, your results will explode ...