<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Joining two large datasets without a Join in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44130#M10433</link>
    <description>&lt;P&gt;If source 2 is reasonably static, you could use &lt;A href="http://docs.splunk.com/Documentation/Splunk/5.0.1/SearchReference/Outputcsv"&gt;outputcsv&lt;/A&gt; to create a lookup file.&lt;/P&gt;

&lt;P&gt;Then do your search on source1 and enrich the data with the description using &lt;A href="http://docs.splunk.com/Documentation/Splunk/5.0.1/SearchReference/Lookup"&gt;lookup&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;You will need to define the lookup in transforms.conf - details &lt;A href="http://docs.splunk.com/Documentation/Splunk/5.0.1/Knowledge/Addfieldsfromexternaldatasources"&gt;here&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 24 Nov 2012 00:10:55 GMT</pubDate>
    <dc:creator>jonuwz</dc:creator>
    <dc:date>2012-11-24T00:10:55Z</dc:date>
    <item>
      <title>Joining two large datasets without a Join</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44129#M10432</link>
      <description>&lt;P&gt;I have two data sources, one that is a very large file listing with *nix timestamps, and one that has a text description of the file.&lt;BR /&gt;
source 1: ls -lR -&amp;gt; permissions ... owner ... filesize timestamp filename&lt;BR /&gt;
   example: -rw-r--r--  1 ownername  123  1234 Jan 01 23:45 file name with spaces.txt&lt;BR /&gt;
source 2: text file -&amp;gt; filename (category)= one-line description&lt;BR /&gt;
   example: file name with spaces.txt = this is the text description here for the file&lt;/P&gt;

&lt;P&gt;I would like to merge these to get:&lt;BR /&gt;
ownername category timestamp filesize filename description&lt;/P&gt;

&lt;P&gt;source 2 will not have a useful timestamp, it is just the modified date of the text file.&lt;/P&gt;

&lt;P&gt;A join or a [subsearch] fails because of the large size of both data sources (&amp;gt; 50,000 entries).&lt;/P&gt;

&lt;P&gt;I would like to be able to chart the data. For example, I'd like to chart the occurrence of files over time with ownername=johndoe and description is not empty. ownername comes from source 1, description comes from source 2.&lt;/P&gt;

&lt;P&gt;I would also like to stat/summarize the data to get the number of files in 2012 of category "asdf". The category comes from source 2, filename is common to both, and the date comes from source 1.&lt;/P&gt;</description>
      <pubDate>Fri, 23 Nov 2012 21:50:54 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44129#M10432</guid>
      <dc:creator>splunk_eval</dc:creator>
      <dc:date>2012-11-23T21:50:54Z</dc:date>
    </item>
    <item>
      <title>Re: Joining two large datasets without a Join</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44130#M10433</link>
      <description>&lt;P&gt;If source 2 is reasonably static, you could use &lt;A href="http://docs.splunk.com/Documentation/Splunk/5.0.1/SearchReference/Outputcsv"&gt;outputcsv&lt;/A&gt; to create a lookup file.&lt;/P&gt;

&lt;P&gt;Then do your search on source1 and enrich the data with the description using &lt;A href="http://docs.splunk.com/Documentation/Splunk/5.0.1/SearchReference/Lookup"&gt;lookup&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;You will need to define the lookup in transforms.conf - details &lt;A href="http://docs.splunk.com/Documentation/Splunk/5.0.1/Knowledge/Addfieldsfromexternaldatasources"&gt;here&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Nov 2012 00:10:55 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44130#M10433</guid>
      <dc:creator>jonuwz</dc:creator>
      <dc:date>2012-11-24T00:10:55Z</dc:date>
    </item>
    <item>
      <title>Re: Joining two large datasets without a Join</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44131#M10434</link>
      <description>&lt;P&gt;The data will be changing and also be run on a number of different files, so this could be tricky.&lt;/P&gt;

&lt;P&gt;"| transaction ..." would be perfect but it's very slow. Each data source is 100Ks if not 1Ms+ lines. Is there a better/more efficient way to run this command:&lt;/P&gt;

&lt;P&gt;(sourcetype="my_source1" OR sourcetype="my_source2") | transaction FILENAME | table FILESIZE,FILENAME,DATETIME,DESCRIPTION&lt;/P&gt;

&lt;P&gt;This pulls FILESIZE, DATETIME from one file (using the timestamp in the file, not _time), and DESCRIPTION from the other file, using FILENAME as the shared ID.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 12:51:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44131#M10434</guid>
      <dc:creator>splunk_eval</dc:creator>
      <dc:date>2020-09-28T12:51:36Z</dc:date>
    </item>
    <item>
      <title>Re: Joining two large datasets without a Join</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44132#M10435</link>
      <description>&lt;P&gt;"| transaction ..." would be perfect but it's very slow. Each data source is 100Ks if not 1Ms+ lines. Is there a better/more efficient way to run this command:&lt;/P&gt;

&lt;P&gt;(sourcetype="my_source1" OR sourcetype="my_source2") | transaction FILENAME | table FILESIZE,FILENAME,DATETIME,DESCRIPTION&lt;/P&gt;

&lt;P&gt;This pulls FILESIZE, DATETIME from one file (using the timestamp in the file, not _time), and DESCRIPTION from the other file, using FILENAME as the shared ID.&lt;/P&gt;

&lt;P&gt;Then, running the table command through a "| selfjoin FILENAME" seems to work, and the selfjoin is pretty quick. I'll do some more testing with this and make sure it's merging the way I want.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 12:51:38 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44132#M10435</guid>
      <dc:creator>splunk_eval</dc:creator>
      <dc:date>2020-09-28T12:51:38Z</dc:date>
    </item>
    <item>
      <title>Re: Joining two large datasets without a Join</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44133#M10436</link>
      <description>&lt;P&gt;That'll work well if there's a 1-1 mapping between source1 and source2.&lt;BR /&gt;
If not, the description will only attach to the 1st matching filename.&lt;/P&gt;

&lt;P&gt;If you set max=0 in selfjoin, your results will explode ...&lt;/P&gt;</description>
      <pubDate>Sat, 24 Nov 2012 01:36:20 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Joining-two-large-datasets-without-a-Join/m-p/44133#M10436</guid>
      <dc:creator>jonuwz</dc:creator>
      <dc:date>2012-11-24T01:36:20Z</dc:date>
    </item>
  </channel>
</rss>

