<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Comparing two very large files in Python in Splunk Dev</title>
    <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490243#M8793</link>
    <description>&lt;P&gt;I am not sure if your desired behavior is correct.  Here is what I got &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;the result is &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;instead of these two you were expecting:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;twotwo.nl
five.nl
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a &lt;A href="https://wiki.python.org/moin/TimeComplexity"&gt;worst case time complexity of len(file1)*len(file2)&lt;/A&gt;.  &lt;/P&gt;

&lt;P&gt;From your description, each file would only around 40MB-100MB to load, which should not be a problem. &lt;/P&gt;

&lt;P&gt;Good luck! &lt;/P&gt;</description>
    <pubDate>Sun, 03 May 2020 01:30:31 GMT</pubDate>
    <dc:creator>tauliang</dc:creator>
    <dc:date>2020-05-03T01:30:31Z</dc:date>
    <item>
      <title>Comparing two very large files in Python</title>
      <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490242#M8792</link>
      <description>&lt;P&gt;I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.&lt;BR /&gt;Example:&lt;BR /&gt;tmpzonefile:&lt;BR /&gt;twotwotwo.nl&lt;BR /&gt;two.nl&lt;BR /&gt;three.nl&lt;BR /&gt;four.nl&lt;BR /&gt;five.nl&lt;/P&gt;
&lt;P&gt;tmpingestedzonefile:&lt;BR /&gt;twotwo.nl&lt;BR /&gt;three.nl&lt;BR /&gt;four.nl&lt;/P&gt;
&lt;P&gt;Diff file must be:&lt;BR /&gt;twotwo.nl&lt;BR /&gt;five.nl&lt;/P&gt;
&lt;P&gt;The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;if debug == 1:
    print('DEBUG: Number of ingested domains returned: %s' % str(count))
    print('DEBUG: Missing domains: %s' % str(numdomains-count))

# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')

old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]

count = 0
for line in old:
    if line not in new:
        count += 1
        difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))

tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 07 Jun 2020 18:13:40 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490242#M8792</guid>
      <dc:creator>lovik00</dc:creator>
      <dc:date>2020-06-07T18:13:40Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two very large files in Python</title>
      <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490243#M8793</link>
      <description>&lt;P&gt;I am not sure if your desired behavior is correct.  Here is what I got &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;the result is &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;instead of these two you were expecting:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;twotwo.nl
five.nl
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a &lt;A href="https://wiki.python.org/moin/TimeComplexity"&gt;worst case time complexity of len(file1)*len(file2)&lt;/A&gt;.  &lt;/P&gt;

&lt;P&gt;From your description, each file would only around 40MB-100MB to load, which should not be a problem. &lt;/P&gt;

&lt;P&gt;Good luck! &lt;/P&gt;</description>
      <pubDate>Sun, 03 May 2020 01:30:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490243#M8793</guid>
      <dc:creator>tauliang</dc:creator>
      <dc:date>2020-05-03T01:30:31Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two very large files in Python</title>
      <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490244#M8794</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.&lt;/P&gt;</description>
      <pubDate>Sun, 03 May 2020 06:52:06 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490244#M8794</guid>
      <dc:creator>lovik00</dc:creator>
      <dc:date>2020-05-03T06:52:06Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two very large files in Python</title>
      <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490245#M8795</link>
      <description>&lt;P&gt;Hello @lovik00 &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;tmpzonefile:          tmpingestedzonefile:
                      twotwo.nl
twotwotwo.nl
two.nl
three.nl              three.nl
four.nl               four.nl
five.nl
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile: &lt;/P&gt;

&lt;UL&gt;
&lt;LI&gt;&lt;P&gt;common domains: four.nl, three.nl&lt;/P&gt;

&lt;P&gt;comm -1 -2 &amp;lt;(sort tmpzonefile) &amp;lt;(sort tmpingestedzonefile) | sort&lt;/P&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;P&gt;in tmpzonefile but not in tmpingestedzonefile:  five.nl, two.nl,  twotwotwo.nl&lt;/P&gt;

&lt;P&gt;comm -2 -3 &amp;lt;(sort tmpzonefile) &amp;lt;(sort tmpingestedzonefile) | sort&lt;/P&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;P&gt;in tmpingestedzonefile but not in tmpzonefile : twotwo.nl&lt;/P&gt;

&lt;P&gt;comm -1 -3 &amp;lt;(sort tmpzonefile) &amp;lt;(sort tmpingestedzonefile) | sort&lt;BR /&gt;
You wrote: &lt;EM&gt;Diff file must be: twotwo.nl, five.nl&lt;/EM&gt; - I'm not sure how to get this result&lt;/P&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;&lt;STRONG&gt;Edit&lt;/STRONG&gt;: the real challenge is to do it with splunk only&lt;/P&gt;</description>
      <pubDate>Sun, 03 May 2020 10:38:37 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490245#M8795</guid>
      <dc:creator>PavelP</dc:creator>
      <dc:date>2020-05-03T10:38:37Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two very large files in Python</title>
      <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490246#M8796</link>
      <description>&lt;P&gt;Oeps, made a typo.&lt;/P&gt;

&lt;P&gt;Diff file should be:&lt;BR /&gt;
twotwo.nl&lt;BR /&gt;
four.nl&lt;/P&gt;

&lt;P&gt;I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.&lt;/P&gt;</description>
      <pubDate>Sun, 03 May 2020 14:07:34 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490246#M8796</guid>
      <dc:creator>lovik00</dc:creator>
      <dc:date>2020-05-03T14:07:34Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two very large files in Python</title>
      <link>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490247#M8797</link>
      <description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;I am trying to understand, did you mean you would like to know &lt;/P&gt;

&lt;BLOCKQUOTE&gt;
&lt;P&gt;which domains are IN &lt;STRONG&gt;tmpzonefile&lt;/STRONG&gt; but NOT in tmpingestedzonefile&lt;BR /&gt;
? &lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;If this is the case, the comparison logic would be &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;    temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
    temp2 = ['twotwo.nl', 'three.nl','four.nl']
    list(set(temp1)-( set(temp2)))
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;with the following results&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;    ['two.nl', 'twotwotwo.nl', 'five.nl']
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 03 May 2020 15:46:27 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Dev/Comparing-two-very-large-files-in-Python/m-p/490247#M8797</guid>
      <dc:creator>tauliang</dc:creator>
      <dc:date>2020-05-03T15:46:27Z</dc:date>
    </item>
  </channel>
</rss>

