Splunk Dev

Comparing two very large files in Python

lovik00
New Member

I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Example:
tmpzonefile:
twotwotwo.nl
two.nl
three.nl
four.nl
five.nl

tmpingestedzonefile:
twotwo.nl
three.nl
four.nl

Diff file must be:
twotwo.nl
five.nl

The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.

if debug == 1:
    print('DEBUG: Number of ingested domains returned: %s' % str(count))
    print('DEBUG: Missing domains: %s' % str(numdomains-count))

# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')

old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]

count = 0
for line in old:
    if line not in new:
        count += 1
        difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))

tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()
Labels (1)
0 Karma

PavelP
Motivator

Hello @lovik00

tmpzonefile:          tmpingestedzonefile:
                      twotwo.nl
twotwotwo.nl
two.nl
three.nl              three.nl
four.nl               four.nl
five.nl

which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:

  • common domains: four.nl, three.nl

    comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort

  • in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl

    comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort

  • in tmpingestedzonefile but not in tmpzonefile : twotwo.nl

    comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
    You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result

Edit: the real challenge is to do it with splunk only

lovik00
New Member

Oeps, made a typo.

Diff file should be:
twotwo.nl
four.nl

I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.

0 Karma

tauliang
Communicator

I am not sure if your desired behavior is correct. Here is what I got

temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))

the result is

['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']

instead of these two you were expecting:

twotwo.nl
five.nl

The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).

From your description, each file would only around 40MB-100MB to load, which should not be a problem.

Good luck!

lovik00
New Member

Hi,

Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

0 Karma

tauliang
Communicator

I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

I am trying to understand, did you mean you would like to know

which domains are IN tmpzonefile but NOT in tmpingestedzonefile
?

If this is the case, the comparison logic would be

    temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
    temp2 = ['twotwo.nl', 'three.nl','four.nl']
    list(set(temp1)-( set(temp2)))

with the following results

    ['two.nl', 'twotwotwo.nl', 'five.nl']
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...