Splunk Dev

Comparing two very large files in Python

lovik00
New Member

I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Example:
tmpzonefile:
twotwotwo.nl
two.nl
three.nl
four.nl
five.nl

tmpingestedzonefile:
twotwo.nl
three.nl
four.nl

Diff file must be:
twotwo.nl
five.nl

The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.

if debug == 1:
    print('DEBUG: Number of ingested domains returned: %s' % str(count))
    print('DEBUG: Missing domains: %s' % str(numdomains-count))

# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')

old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]

count = 0
for line in old:
    if line not in new:
        count += 1
        difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))

tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()
Labels (1)
0 Karma

PavelP
Motivator

Hello @lovik00

tmpzonefile:          tmpingestedzonefile:
                      twotwo.nl
twotwotwo.nl
two.nl
three.nl              three.nl
four.nl               four.nl
five.nl

which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:

  • common domains: four.nl, three.nl

    comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort

  • in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl

    comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort

  • in tmpingestedzonefile but not in tmpzonefile : twotwo.nl

    comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
    You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result

Edit: the real challenge is to do it with splunk only

lovik00
New Member

Oeps, made a typo.

Diff file should be:
twotwo.nl
four.nl

I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.

0 Karma

tauliang
Communicator

I am not sure if your desired behavior is correct. Here is what I got

temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))

the result is

['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']

instead of these two you were expecting:

twotwo.nl
five.nl

The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).

From your description, each file would only around 40MB-100MB to load, which should not be a problem.

Good luck!

lovik00
New Member

Hi,

Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

0 Karma

tauliang
Communicator

I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.

I am trying to understand, did you mean you would like to know

which domains are IN tmpzonefile but NOT in tmpingestedzonefile
?

If this is the case, the comparison logic would be

    temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
    temp2 = ['twotwo.nl', 'three.nl','four.nl']
    list(set(temp1)-( set(temp2)))

with the following results

    ['two.nl', 'twotwotwo.nl', 'five.nl']
Get Updates on the Splunk Community!

Index This | How many sides does a circle have?

February 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

Registration for Splunk University is Now Open!

Are you ready for an adventure in learning?   Brace yourselves because Splunk University is back, and it's ...

Splunkbase | Splunk Dashboard Examples App for SimpleXML End of Life

The Splunk Dashboard Examples App for SimpleXML will reach end of support on Dec 19, 2024, after which no new ...