We have a need to identify the country of origin of IPs that are hitting our firewalls, notably from "unfriendly" countries. To that end, I have collected a list of IPs in CIDR notation for each of these unfriendly countries. The Splunk lookup search works but our 8-CPU server is choking and drastically slows down other searches. One file has about 7k lines (142352 bytes), the other about 11k lines (231628 bytes). Each IP entry in the lookup table is in CIDR notation. I have reason to believe that matching an IP to a subnet in CIDR notation is CPU intensive. Linux "top" and "vmstat" show that the all 8-CPUs remain steady at 99-100%. Memory and swap space usage is quite low. Question: are there ways to improve the searches performance? (1) I can index these cvs tables that are small by setting the "max_memtable_bytes=142350" parameter. This setting is system-wide and may have impact that I not aware about (2) I can develop an external lookup doing binary search rather a sequential search on the lookup files.
Your help in clarifying (1), (2) and other options would be much appreciated
Thuan-
I did not see that this question had been answered... I've tested this extensively and the max_memtable_bytes
setting is unlikely to improve performance so long as you are using the CIDR lookup type, for lookup tables of your size.
I think your best bet is to either:
a) Use an external Python lookup to perform a binary search. If you are on Windows, the performance of this method will be less substantial than the performance improvement you will see on a Linux system, since the cost of forking Python processes is larger on Windows, in my experience.
b) Expand all CIDR subnets in your lookup table into individual IP addresses and use a string-based lookup instead of a CIDR lookup. This will increase your lookup table size substantially, but if the lookup table is likely to be static, you will only incur a one-time cost when the lookup is indexed (assuming max_memtable_bytes
is set to a value smaller than the lookup table size), and the lookup should be very fast thereafter. I would only recommend this approach if you are using subnets smaller than /24 in your lookup table; anything larger will probably cause the size of your lookup table to expand too dramatically, and the performance increase you will see from using the string-based lookup will be offset by the size of the lookup table.
There is a command "geoip" that is available with the MAXMIND apps. Can it be used to just display the IPs of selected countries? and not a location on a map?
(1) I don't have any feel for how fast is the query prior to the lookup. (2) The number of logs records that are parsed in the NOT() statement may easily amount to 1/3 of all log records. (3) I understand that the use of NOT is not recommended for performance reasons but I have not been able to find a better way to filter out addresses that will be processed by the lookup
How fast is the part of the query prior to the lookup and how many events actually make it past the NOT statement?
I have changed the name of countries that are searched in the query and addresses as well.
The query is: "index=*_out_index NOT (s_ip=X.X.0.0/16 OR s_ip=10.0.0.0/8 OR s_ip=X.0.0.0/8 OR s_ip=172.16.0.0/16) | lookup ENetflow s_ip OUTPUT country | search country=country1 OR country=country2 OR country=country3 | transaction s_ip | table _time host s_ip country d_ip d_port action eventcount duration"
The lookup table in CIDR notation includes 10077 entries/lines
The number of netflow events is about 250 millions
How many events are being processed/passed to the lookup? Perhaps there's optimization that can be done there.
Can you share your actual search query?
Each entry in the lookup table is in IP CIDR notation along with its country of origin. I basically try to match netflow or proxy data IPs against the corresponding lookup table. The search is sparse as there are few matches within very large data sets. Besides options (1) and (2) suggested in a previous note, as indexed data is compressed, decompressing it is bound to be CPU-intensive. Option (3) would be to NOT compress data at indexed time. Is it feasible? Disk space usage will be much higher with uncompressed indexed data.
What search are you running?