Re: Comparing two very large files in Python

lovik00 · ‎05-02-2020

I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Example:
tmpzonefile:
twotwotwo.nl
two.nl
three.nl
four.nl
five.nl

tmpingestedzonefile:
twotwo.nl
three.nl
four.nl

Diff file must be:
twotwo.nl
five.nl

The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.

if debug == 1:
    print('DEBUG: Number of ingested domains returned: %s' % str(count))
    print('DEBUG: Missing domains: %s' % str(numdomains-count))

# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')

old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]

count = 0
for line in old:
    if line not in new:
        count += 1
        difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))

tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()

PavelP · ‎05-03-2020

Hello @lovik00

tmpzonefile:          tmpingestedzonefile:
                      twotwo.nl
twotwotwo.nl
two.nl
three.nl              three.nl
four.nl               four.nl
five.nl

which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:

common domains: four.nl, three.nl

comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl

comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpingestedzonefile but not in tmpzonefile : twotwo.nl

comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result

Edit: the real challenge is to do it with splunk only

lovik00 · ‎05-03-2020

Oeps, made a typo.

Diff file should be:
twotwo.nl
four.nl

I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.

tauliang · ‎05-02-2020

I am not sure if your desired behavior is correct. Here is what I got

temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))

the result is

['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']

instead of these two you were expecting:

twotwo.nl
five.nl

The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).

From your description, each file would only around 40MB-100MB to load, which should not be a problem.

Good luck!

lovik00 · ‎05-02-2020

Hi,

Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

tauliang · ‎05-03-2020

I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

I am trying to understand, did you mean you would like to know

which domains are IN tmpzonefile but NOT in tmpingestedzonefile
?

If this is the case, the comparison logic would be

    temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
    temp2 = ['twotwo.nl', 'three.nl','four.nl']
    list(set(temp1)-( set(temp2)))

with the following results

    ['two.nl', 'twotwotwo.nl', 'five.nl']

Comparing two very large files in Python

python

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor