Splunk Search

Joining two large datasets without a Join

splunk_eval
Explorer

I have two data sources, one that is a very large file listing with *nix timestamps, and one that has a text description of the file.
source 1: ls -lR -> permissions ... owner ... filesize timestamp filename
example: -rw-r--r-- 1 ownername 123 1234 Jan 01 23:45 file name with spaces.txt
source 2: text file -> filename (category)= one-line description
example: file name with spaces.txt = this is the text description here for the file

I would like to merge these to get:
ownername category timestamp filesize filename description

source 2 will not have a useful timestamp, it is just the modified date of the text file.

A join or a [subsearch] fails because of the large size of both data sources (> 50,000 entries).

I would like to be able to chart the data. For example, I'd like to chart the occurrence of files over time with ownername=johndoe and description is not empty. ownername comes from source 1, description comes from source 2.

I would also like to stat/summarize the data to get the number of files in 2012 of category "asdf". The category comes from source 2, filename is common to both, and the date comes from source 1.

Tags (1)

splunk_eval
Explorer

"| transaction ..." would be perfect but it's very slow. Each data source is 100Ks if not 1Ms+ lines. Is there a better/more efficient way to run this command:

(sourcetype="my_source1" OR sourcetype="my_source2") | transaction FILENAME | table FILESIZE,FILENAME,DATETIME,DESCRIPTION

This pulls FILESIZE, DATETIME from one file (using the timestamp in the file, not _time), and DESCRIPTION from the other file, using FILENAME as the shared ID.

Then, running the table command through a "| selfjoin FILENAME" seems to work, and the selfjoin is pretty quick. I'll do some more testing with this and make sure it's merging the way I want.

0 Karma

jonuwz
Influencer

If source 2 is reasonably static, you could use outputcsv to create a lookup file.

Then do your search on source1 and enrich the data with the description using lookup

You will need to define the lookup in transforms.conf - details here

0 Karma

jonuwz
Influencer

That'll work well if there's a 1-1 mapping between source1 and source2.
If not, the description will only attach to the 1st matching filename.

If you set max=0 in selfjoin, your results will explode ...

splunk_eval
Explorer

The data will be changing and also be run on a number of different files, so this could be tricky.

"| transaction ..." would be perfect but it's very slow. Each data source is 100Ks if not 1Ms+ lines. Is there a better/more efficient way to run this command:

(sourcetype="my_source1" OR sourcetype="my_source2") | transaction FILENAME | table FILESIZE,FILENAME,DATETIME,DESCRIPTION

This pulls FILESIZE, DATETIME from one file (using the timestamp in the file, not _time), and DESCRIPTION from the other file, using FILENAME as the shared ID.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...