- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Comparing two very large files in Python
I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk.
Example:
tmpzonefile:
twotwotwo.nl
two.nl
three.nl
four.nl
five.nl
tmpingestedzonefile:
twotwo.nl
three.nl
four.nl
Diff file must be:
twotwo.nl
five.nl
The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.
if debug == 1:
print('DEBUG: Number of ingested domains returned: %s' % str(count))
print('DEBUG: Missing domains: %s' % str(numdomains-count))
# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')
old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]
count = 0
for line in old:
if line not in new:
count += 1
difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))
tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @lovik00
tmpzonefile: tmpingestedzonefile:
twotwo.nl
twotwotwo.nl
two.nl
three.nl three.nl
four.nl four.nl
five.nl
which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile:
common domains: four.nl, three.nl
comm -1 -2 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpzonefile but not in tmpingestedzonefile: five.nl, two.nl, twotwotwo.nl
comm -2 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
in tmpingestedzonefile but not in tmpzonefile : twotwo.nl
comm -1 -3 <(sort tmpzonefile) <(sort tmpingestedzonefile) | sort
You wrote: Diff file must be: twotwo.nl, five.nl - I'm not sure how to get this result
Edit: the real challenge is to do it with splunk only
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oeps, made a typo.
Diff file should be:
twotwo.nl
four.nl
I have a bash script which does the job. I'm using comm (great diff tool), but I want to a python script to do the job.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure if your desired behavior is correct. Here is what I got
temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1).symmetric_difference( set(temp2)))
the result is
['twotwo.nl', 'two.nl', 'five.nl', 'twotwotwo.nl']
instead of these two you were expecting:
twotwo.nl
five.nl
The algorithm in your code is len(file1)*len(file2) so it is not surprising that it took forever. Theoretically you only need to load them into two sets, and do a symmetric_difference, which has a worst case time complexity of len(file1)*len(file2).
From your description, each file would only around 40MB-100MB to load, which should not be a problem.
Good luck!
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your answers, but that's not exactly what I meant. I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would like to know which domains are in tmpingestedzonefile and which are not in tmpingestedzonefile.
I am trying to understand, did you mean you would like to know
which domains are IN tmpzonefile but NOT in tmpingestedzonefile
?
If this is the case, the comparison logic would be
temp1 = ['twotwotwo.nl','two.nl', 'three.nl', 'four.nl', 'five.nl']
temp2 = ['twotwo.nl', 'three.nl','four.nl']
list(set(temp1)-( set(temp2)))
with the following results
['two.nl', 'twotwotwo.nl', 'five.nl']
