Splunk Search

Search causes Splunk to crash

btorresgil
Builder

I have set up a table in a view. However, with the search in place, over time, the memory on the Splunk server is consumed and eventually Splunk crashes. The server has 75 Gig of memory, and there are millions of logs per day.

I could use some help optimizing this search. As you can see from the config below, I'm using a 'join' between two tscollections (all_traffic and malware_traffic_pattern). However, all_traffic can be very very large, bringing in thousands of events every few minutes. I had hoped the 'join' would be limited to the time period specified by a TimeRangePicker on the dashboard, but while the results are limited to this time range, it appears that the 'join' is not limited, so it is 'joining' the entire all_traffic tscollection, then only showing the results for the time range. This is responsible for the extreme memory usage and eventual crash of Splunk.

How can I correlate these two tscollections, either without a join, or with a join that is limited by the TimeRangePicker? What other techniques can I use to make this more efficient?

savedsearches.conf:

# Collect the traffic logs for the last 5 minutes and use tscollect to
# add them to a collection called all_traffic
# This collects a large amount of data.  There could be thousands of
# logs in a 5 minute period.
[All Traffic]
cron_schedule = */5 * * * *
dispatch.earliest_time = -5m@m
displayview = flashtimeline
enableSched = 1
realtime_schedule = 0
request.ui_dispatch_view = flashtimeline
search = `all_traffic` | table _time log_subtype action bytes bytes_sent bytes_received dst_ip egress_interface ingress_interface dst_port dst_user packets protocol src_ip src_user | tscollect namespace=all_traffic
disabled = 0

# Collect the known malware traffic patterns for the last 5 minutes and use tscollect to
# add them to a collection called malware_traffic_pattern.
# This collects very little data.  There may be 5-10 logs in a 5 minute period.
[Malware Traffic Pattern]
cron_schedule = */5 * * * *
dispatch.earliest_time = -5m@m
displayview = flashtimeline
enableSched = 1
realtime_schedule = 0
request.ui_dispatch_view = flashtimeline
search = `malware_traffic_pattern` malware=yes | table _time report_id dst_ip dst_port protocol | tscollect namespace=malware_traffic_pattern
disabled = 0

relevant part of data/ui/views/my_dashboard.xml

<module name="HiddenSearch" layoutPanel="panel_row1_col1" group="Possible Malware Traffic">
  <param name="search"> |`tstats` count(dst_ip) AS cdip FROM malware_traffic_pattern WHERE * NOT (protocol=udp AND dst_port=53) groupby dst_ip dst_port report_id protocol | table report_id dst_ip dst_port protocol |
    join protocol dst_ip dst_port [ |`tstats` count(src_ip) FROM all_traffic WHERE * (NOT (protocol=udp AND dst_port=53)) $src_ip$ $dst_ip$ $src_user$ $vsys$ $app$ groupby _time src_ip dst_ip dst_port protocol app src_user | dedup 1 src_ip dst_ip dst_port protocol app src_user | table _time src_ip src_user dst_port dst_ip protocol app | rename _time AS traffic_time ] |
    rename src_user AS "User" | rename src_ip AS "Source IP" |
    table traffic_time "Source IP" "User" dst_ip dst_port protocol app report_id |
    rename traffic_time AS _time |
    rename dst_ip AS "Dst_IP" |
    rename dst_port AS "Dst_Port" |
    rename protocol AS "Protocol" |
    rename app AS "Application" |
    rename report_id AS "Report ID"</param>
    <module name="Paginator">
        <param name="count">10</param>
        <param name="entityName">results</param>
        <module name="SimpleResultsTable">
          <param name="allowTransformedFieldSelect">True</param>
          <param name="drilldown">all</param>
          <param name="displayMenu">true</param>
          <module name="SimpleDrilldown">
            <param name="links">
              <param name="*">./flashtimeline?earliest=$earliest$&amp;latest=$latest$&amp;q=`all_traffic` src_ip="$row.Source IP$" dst_ip="$row.Dst_IP$" dst_port="$row.Dst_Port$" protocol="$row.Protocol$"</param>
            </param>
          </module>
        </module>
    </module>
</module>
Tags (1)

hexx
Splunk Employee
Splunk Employee

This is a tough one because profiling process memory is nothing trivial.

My first questions are:

  • Are you positively certain that this particular search ends up consuming the entirety of your system memory?
  • Is the memory consumption attributed to the "splunkd search" process associated with the ID of this particular search?

If that can be established with certainty, I would suggest to take a divide and conquer approach to the search string and break it down in pieces until you can figure out which part is causing the memory blowout.

Typically, I would recommend to start by yanking out that "join" directive entirely and see how things run under those circumstances. If that doesn't trigger the memory issue, I would then take the search in the "join" brackets and run that as a standalone search, again monitoring memory usage (you can use the S.o.S app to that effect, by the way).

Whatever you find out, I would suggest to open a case with Splunk Support to report this issue and your findings. If at all possible, provide a reproducible test case, ideally with sample events / TSDIX data stores that are sufficient to reproduce the problem.

For reference, here's the search string:

|`tstats` count(dst_ip) AS cdip FROM malware_traffic_pattern WHERE * NOT (protocol=udp AND dst_port=53) groupby dst_ip dst_port report_id protocol
| table report_id dst_ip dst_port protocol
| join protocol dst_ip dst_port [ 
      |`tstats` count(src_ip) FROM all_traffic WHERE * (NOT (protocol=udp AND dst_port=53)) $src_ip$ $dst_ip$ $src_user$ $vsys$ $app$ groupby _time src_ip dst_ip dst_port protocol app src_user
      | dedup 1 src_ip dst_ip dst_port protocol app src_user
      | table _time src_ip src_user dst_port dst_ip protocol app
      | rename _time AS traffic_time] 
| rename src_user AS "User"
| rename src_ip AS "Source IP"
| table traffic_time "Source IP" "User" dst_ip dst_port protocol app report_id
| rename traffic_time AS _time
| rename dst_ip AS "Dst_IP"
| rename dst_port AS "Dst_Port"
| rename protocol AS "Protocol"
| rename app AS "Application"
| rename report_id AS "Report ID"

PS: My money is on "dedup" against 6 distinct fields.

Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...