I have set up a table in a view. However, with the search in place, over time, the memory on the Splunk server is consumed and eventually Splunk crashes. The server has 75 Gig of memory, and there are millions of logs per day.
I could use some help optimizing this search. As you can see from the config below, I'm using a 'join' between two tscollections (all_traffic
and malware_traffic_pattern
). However, all_traffic
can be very very large, bringing in thousands of events every few minutes. I had hoped the 'join' would be limited to the time period specified by a TimeRangePicker on the dashboard, but while the results are limited to this time range, it appears that the 'join' is not limited, so it is 'joining' the entire all_traffic
tscollection, then only showing the results for the time range. This is responsible for the extreme memory usage and eventual crash of Splunk.
How can I correlate these two tscollections, either without a join, or with a join that is limited by the TimeRangePicker? What other techniques can I use to make this more efficient?
savedsearches.conf:
# Collect the traffic logs for the last 5 minutes and use tscollect to
# add them to a collection called all_traffic
# This collects a large amount of data. There could be thousands of
# logs in a 5 minute period.
[All Traffic]
cron_schedule = */5 * * * *
dispatch.earliest_time = -5m@m
displayview = flashtimeline
enableSched = 1
realtime_schedule = 0
request.ui_dispatch_view = flashtimeline
search = `all_traffic` | table _time log_subtype action bytes bytes_sent bytes_received dst_ip egress_interface ingress_interface dst_port dst_user packets protocol src_ip src_user | tscollect namespace=all_traffic
disabled = 0
# Collect the known malware traffic patterns for the last 5 minutes and use tscollect to
# add them to a collection called malware_traffic_pattern.
# This collects very little data. There may be 5-10 logs in a 5 minute period.
[Malware Traffic Pattern]
cron_schedule = */5 * * * *
dispatch.earliest_time = -5m@m
displayview = flashtimeline
enableSched = 1
realtime_schedule = 0
request.ui_dispatch_view = flashtimeline
search = `malware_traffic_pattern` malware=yes | table _time report_id dst_ip dst_port protocol | tscollect namespace=malware_traffic_pattern
disabled = 0
relevant part of data/ui/views/my_dashboard.xml
<module name="HiddenSearch" layoutPanel="panel_row1_col1" group="Possible Malware Traffic">
<param name="search"> |`tstats` count(dst_ip) AS cdip FROM malware_traffic_pattern WHERE * NOT (protocol=udp AND dst_port=53) groupby dst_ip dst_port report_id protocol | table report_id dst_ip dst_port protocol |
join protocol dst_ip dst_port [ |`tstats` count(src_ip) FROM all_traffic WHERE * (NOT (protocol=udp AND dst_port=53)) $src_ip$ $dst_ip$ $src_user$ $vsys$ $app$ groupby _time src_ip dst_ip dst_port protocol app src_user | dedup 1 src_ip dst_ip dst_port protocol app src_user | table _time src_ip src_user dst_port dst_ip protocol app | rename _time AS traffic_time ] |
rename src_user AS "User" | rename src_ip AS "Source IP" |
table traffic_time "Source IP" "User" dst_ip dst_port protocol app report_id |
rename traffic_time AS _time |
rename dst_ip AS "Dst_IP" |
rename dst_port AS "Dst_Port" |
rename protocol AS "Protocol" |
rename app AS "Application" |
rename report_id AS "Report ID"</param>
<module name="Paginator">
<param name="count">10</param>
<param name="entityName">results</param>
<module name="SimpleResultsTable">
<param name="allowTransformedFieldSelect">True</param>
<param name="drilldown">all</param>
<param name="displayMenu">true</param>
<module name="SimpleDrilldown">
<param name="links">
<param name="*">./flashtimeline?earliest=$earliest$&latest=$latest$&q=`all_traffic` src_ip="$row.Source IP$" dst_ip="$row.Dst_IP$" dst_port="$row.Dst_Port$" protocol="$row.Protocol$"</param>
</param>
</module>
</module>
</module>
</module>
This is a tough one because profiling process memory is nothing trivial.
My first questions are:
If that can be established with certainty, I would suggest to take a divide and conquer approach to the search string and break it down in pieces until you can figure out which part is causing the memory blowout.
Typically, I would recommend to start by yanking out that "join" directive entirely and see how things run under those circumstances. If that doesn't trigger the memory issue, I would then take the search in the "join" brackets and run that as a standalone search, again monitoring memory usage (you can use the S.o.S app to that effect, by the way).
Whatever you find out, I would suggest to open a case with Splunk Support to report this issue and your findings. If at all possible, provide a reproducible test case, ideally with sample events / TSDIX data stores that are sufficient to reproduce the problem.
For reference, here's the search string:
|`tstats` count(dst_ip) AS cdip FROM malware_traffic_pattern WHERE * NOT (protocol=udp AND dst_port=53) groupby dst_ip dst_port report_id protocol
| table report_id dst_ip dst_port protocol
| join protocol dst_ip dst_port [
|`tstats` count(src_ip) FROM all_traffic WHERE * (NOT (protocol=udp AND dst_port=53)) $src_ip$ $dst_ip$ $src_user$ $vsys$ $app$ groupby _time src_ip dst_ip dst_port protocol app src_user
| dedup 1 src_ip dst_ip dst_port protocol app src_user
| table _time src_ip src_user dst_port dst_ip protocol app
| rename _time AS traffic_time]
| rename src_user AS "User"
| rename src_ip AS "Source IP"
| table traffic_time "Source IP" "User" dst_ip dst_port protocol app report_id
| rename traffic_time AS _time
| rename dst_ip AS "Dst_IP"
| rename dst_port AS "Dst_Port"
| rename protocol AS "Protocol"
| rename app AS "Application"
| rename report_id AS "Report ID"
PS: My money is on "dedup" against 6 distinct fields.