Comparing two very large files in Python

lovik00 · ‎05-02-2020

I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Example:
tmpzonefile:
twotwotwo.nl
two.nl
three.nl
four.nl
five.nl

tmpingestedzonefile:
twotwo.nl
three.nl
four.nl

Diff file must be:
twotwo.nl
five.nl

The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.

if debug == 1:
    print('DEBUG: Number of ingested domains returned: %s' % str(count))
    print('DEBUG: Missing domains: %s' % str(numdomains-count))

# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')

old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]

count = 0
for line in old:
    if line not in new:
        count += 1
        difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))

tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()

PavelP · ‎05-03-2020

Hello @lovik00

tmpzonefile:          tmpingestedzonefile:
                      twotwo.nl
twotwotwo.nl
two.nl
three.nl              three.nl
four.nl               four.nl
five.nl

which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:

common domains: four.nl, three.nl

comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl

comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpingestedzonefile but not in tmpzonefile : twotwo.nl

comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result

Edit: the real challenge is to do it with splunk only

lovik00 · ‎05-03-2020

Oeps, made a typo.

Diff file should be:
twotwo.nl
four.nl

I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.

tauliang · ‎05-02-2020

I am not sure if your desired behavior is correct. Here is what I got

temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))

the result is

['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']

instead of these two you were expecting:

twotwo.nl
five.nl

The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).

From your description, each file would only around 40MB-100MB to load, which should not be a problem.

Good luck!

lovik00 · ‎05-02-2020

Hi,

Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

tauliang · ‎05-03-2020

I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

I am trying to understand, did you mean you would like to know

which domains are IN tmpzonefile but NOT in tmpingestedzonefile
?

If this is the case, the comparison logic would be

    temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
    temp2 = ['twotwo.nl', 'three.nl','four.nl']
    list(set(temp1)-( set(temp2)))

with the following results

    ['two.nl', 'twotwotwo.nl', 'five.nl']

Comparing two very large files in Python

Python

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...

Deep insights, no barriers: Splunk Observability Cloud Free Edition

Monitoring AI Agents with Splunk Observability Cloud

Join the Conversation

Comparing two very large files in Python

Python

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...

Deep insights, no barriers: Splunk Observability Cloud Free Edition

Monitoring AI Agents with Splunk Observability Cloud