I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Diff file must be:
The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.
if debug == 1: print('DEBUG: Number of ingested domains returned: %s' % str(count)) print('DEBUG: Missing domains: %s' % str(numdomains-count)) # Determine missing domains tmpzonefile_f = open(tmpzonefile) tmpingestedzonefile_f = open(tmpingestedzonefile) difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt') old = [line.strip() for line in tmpzonefile_f] new = [line.strip() for line in tmpingestedzonefile_f] count = 0 for line in old: if line not in new: count += 1 difffile.write(line+'\n') print('DEBUG: Number of domain written to difffile file: %s' % str(count)) tmpzonefile_f.close() tmpingestedzonefile_f.close() difffile.close()
tmpzonefile: tmpingestedzonefile: twotwo.nl twotwotwo.nl two.nl three.nl three.nl four.nl four.nl five.nl
which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:
common domains: four.nl, three.nl
comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl
comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpingestedzonefile but not in tmpzonefile : twotwo.nl
comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result
Edit: the real challenge is to do it with splunk only
I am not sure if your desired behavior is correct. Here is what I got
temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl'] temp2 = ['twotwo.nl', 'three.nl','four.nl'] list(set(temp1).symmetric_difference( set(temp2)))
the result is
['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']
instead of these two you were expecting:
The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).
From your description, each file would only around 40MB-100MB to load, which should not be a problem.
I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.
I am trying to understand, did you mean you would like to know
which domains are IN tmpzonefile but NOT in tmpingestedzonefile
If this is the case, the comparison logic would be
temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl'] temp2 = ['twotwo.nl', 'three.nl','four.nl'] list(set(temp1)-( set(temp2)))
with the following results
['two.nl', 'twotwotwo.nl', 'five.nl']