Splunk Dev

Comparing two very large files in Python

lovik00
New Member

I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Example:
tmpzonefile:
twotwotwo.nl
two.nl
three.nl
four.nl
five.nl

tmpingestedzonefile:
twotwo.nl
three.nl
four.nl

Diff file must be:
twotwo.nl
five.nl

The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.

if debug == 1:
    print('DEBUG: Number of ingested domains returned: %s' % str(count))
    print('DEBUG: Missing domains: %s' % str(numdomains-count))

# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')

old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]

count = 0
for line in old:
    if line not in new:
        count += 1
        difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))

tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()
Labels (1)
0 Karma

PavelP
Motivator

Hello @lovik00

tmpzonefile:          tmpingestedzonefile:
                      twotwo.nl
twotwotwo.nl
two.nl
three.nl              three.nl
four.nl               four.nl
five.nl

which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:

  • common domains: four.nl, three.nl

    comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort

  • in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl

    comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort

  • in tmpingestedzonefile but not in tmpzonefile : twotwo.nl

    comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
    You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result

Edit: the real challenge is to do it with splunk only

lovik00
New Member

Oeps, made a typo.

Diff file should be:
twotwo.nl
four.nl

I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.

0 Karma

tauliang
Communicator

I am not sure if your desired behavior is correct. Here is what I got

temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))

the result is

['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']

instead of these two you were expecting:

twotwo.nl
five.nl

The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).

From your description, each file would only around 40MB-100MB to load, which should not be a problem.

Good luck!

lovik00
New Member

Hi,

Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

0 Karma

tauliang
Communicator

I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

I am trying to understand, did you mean you would like to know

which domains are IN tmpzonefile but NOT in tmpingestedzonefile
?

If this is the case, the comparison logic would be

    temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
    temp2 = ['twotwo.nl', 'three.nl','four.nl']
    list(set(temp1)-( set(temp2)))

with the following results

    ['two.nl', 'twotwotwo.nl', 'five.nl']
Get Updates on the Splunk Community!

Splunk Observability Cloud’s AI Assistant in Action Series: Analyzing and ...

This is the second post in our Splunk Observability Cloud’s AI Assistant in Action series, in which we look at ...

Elevate Your Organization with Splunk’s Next Platform Evolution

 Thursday, July 10, 2025  |  11AM PDT / 2PM EDT Whether you're managing complex deployments or looking to ...

Splunk Answers Content Calendar, June Edition

Get ready for this week’s post dedicated to Splunk Dashboards! We're celebrating the power of community by ...