All Apps and Add-ons

Rapid7 TA Python script gets stuck and blocks next job until killed

guilmxm
SplunkTrust
SplunkTrust

Hi,

I have implemented recently the TA for Rapid7 for a customer of mine, and had to face to issues relating to the backend Python script.

After some days following the initial deployment, I realised the assets and vulnerabilities were not ingested anymore, while the other job (vuln exceptions) was working perfectly fine day after day.

I found out that the nexpose Python script systematically gets stuck at the end of the end of its execution, at least it generates data properly but never terminates and remains in machine.

Killing the process allows the next execution to work properly as expected.

Logs clearly show that the script remains stuck at the last query:

2018-10-26 02:15:34,873 INFO    nx_logger:38 - Processing asset report for site(s) <['158']>
2018-10-26 02:15:34,892 INFO    nx_logger:38 - Finished processing asset report for site(s) <['158']>
2018-10-26 02:16:54,924 INFO    nx_logger:38 - Connecting Nexpose client
2018-10-26 02:16:54,988 INFO    nx_logger:38 - Executing vuln query for site(s) <['158']>
2018-10-26 02:16:54,989 INFO    nx_logger:38 - In AdHoc generate
2018-10-26 02:16:54,989 INFO    nx_logger:38 - Making Query:
<ReportAdhocGenerateRequest session-id="0AFAE635E5D96FE30C40579A14EF63DA135303F9" sync-id="22499"><AdhocReportConfig format="sql"><Filters><filter type="version" id="2.3.0"/><filter type="query" id="SELECT asset_id, da.ip_address, da.mac_address, site_id, &#10;                          favf.vulnerability_instances, favf.vulnerability_id, &#10;                          fasva.first_discovered, fasva.most_recently_discovered, dv.title, dv.severity, dvc.categories, &#10;                          dve.skill_levels, dvr.sources, favf.scan_id, &#10;                          dv.cvss_score, dv.date_added, dv.cvss_vector&#10;                    from dim_site_asset&#10;                    RIGHT OUTER JOIN (select favf.asset_id, favf.vulnerability_instances, favf.vulnerability_id, favf.scan_id FROM fact_asset_vulnerability_finding favf) favf USING (asset_id)&#10;                    LEFT OUTER JOIN (select dv.vulnerability_id, dv.title, dv.severity, dv.cvss_score, dv.cvss_vector, dv.date_added FROM dim_vulnerability dv) dv USING (vulnerability_id)&#10;                    LEFT OUTER JOIN (select dvc.vulnerability_id, (string_agg(DISTINCT '<' || dvc.category_name, '>') || '>') as categories FROM dim_vulnerability_category dvc GROUP BY dvc.vulnerability_id) dvc USING (vulnerability_id)&#10;                    LEFT OUTER JOIN (select dve.vulnerability_id, (string_agg(DISTINCT '<' || dve.skill_level, '>') || '>') as skill_levels FROM dim_vulnerability_exploit dve GROUP BY dve.vulnerability_id) dve USING (vulnerability_id)&#10;                    LEFT OUTER JOIN (select dvr.vulnerability_id, (string_agg(DISTINCT '<' || dvr.source || ':' || dvr.reference,'>') || '>') as sources FROM dim_vulnerability_reference dvr GROUP BY dvr.vulnerability_id) dvr USING (vulnerability_id)&#10;                    LEFT OUTER JOIN (select fasva.asset_id, fasva.vulnerability_id, fasva.first_discovered, fasva.most_recently_discovered FROM fact_asset_vulnerability_age fasva) fasva USING(asset_id, vulnerability_id) &#10;                    LEFT OUTER JOIN (select da.asset_id, da.ip_address, da.mac_address FROM dim_asset da) da USING (asset_id)&#10;&#10;                    WHERE site_id=158&#10;&#10;                    GROUP BY asset_id, da.ip_address, da.mac_address, fasva.first_discovered, fasva.most_recently_discovered, site_id, favf.scan_id, favf.vulnerability_id, favf.vulnerability_instances, dv.title, dv.vulnerability_id, dv.severity, dvc.categories, dve.skill_levels, dvr.sources, dv.cvss_score, dv.cvss_vector, dv.date_added&#10;                    "/><filter type="site" id="158"/></Filters></AdhocReportConfig></ReportAdhocGenerateRequest>
2018-10-26 02:17:14,996 INFO    nx_logger:38 - Connecting Nexpose client
2018-10-26 02:17:15,069 INFO    nx_logger:38 - Executing asset query for site(s) <['213']>
2018-10-26 02:17:15,069 INFO    nx_logger:38 - In AdHoc generate
2018-10-26 02:17:15,069 INFO    nx_logger:38 - Making Query:
<ReportAdhocGenerateRequest session-id="5C663DF004F33281B210A0B1EE8B062C34F859B5" sync-id="49944"><AdhocReportConfig format="sql"><Filters><filter type="version" id="2.3.0"/><filter type="query" id="SELECT dsa.asset_id as asset_id, dsa.site_id as site_id, &#10;                    ds.site_name, da.mac_address, da.ip_address, da.host_name,&#10;                    da.operating_system_id, da.host_type_id, &#10;                    dos.os_description, dos.architecture, dos.system, dos.cpe,&#10;                    dht.host_description, dagc.asset_group_accounts, &#10;                    fa.vulnerabilities, fa.critical_vulnerabilities, &#10;                    fa.severe_vulnerabilities, fa.moderate_vulnerabilities, &#10;                    fa.malware_kits, fa.exploits, fa.vulnerability_instances, &#10;                    fa.riskscore, fa.pci_status, dsoft.installed_software, &#10;                    dserv.services, dserv.protocols, dta.tags, &#10;                    dta.tag_association, fa.scan_finished, fad.last_discovered&#10;                    from dim_site_asset dsa&#10;                    JOIN (select asset_id, last_discovered FROM fact_asset_discovery) fad USING (asset_id)&#10;                    LEFT OUTER JOIN (select da.asset_id, da.ip_address, da.mac_address, da.host_name, da.operating_system_id, da.host_type_id FROM dim_asset da) da USING (asset_id)&#10;                    LEFT OUTER JOIN (select dos.operating_system_id, dos.description as os_description, dos.architecture, dos.system, dos.cpe FROM dim_operating_system dos) dos using (operating_system_id)&#10;                    LEFT OUTER JOIN (select dht.host_type_id, dht.description as host_description FROM dim_host_type dht) dht using (host_type_id)&#10;                    LEFT OUTER JOIN (select dagc.asset_id, (string_agg(DISTINCT '<' || dagc.name, '>') || '>') as asset_group_accounts FROM dim_asset_group_account dagc GROUP BY dagc.asset_id) dagc USING (asset_id)&#10;                    LEFT OUTER JOIN (select fa.asset_id, fa.vulnerabilities, fa.scan_finished, fa.critical_vulnerabilities, fa.severe_vulnerabilities, fa.moderate_vulnerabilities, fa.malware_kits, fa.exploits, fa.vulnerability_instances, fa.riskscore, fa.pci_status FROM fact_asset fa) fa USING (asset_id)&#10;                    LEFT OUTER JOIN (select dasoft.asset_id, (string_agg(DISTINCT '<' || dsoft.name, '>') || '>') as installed_software&#10;                    FROM dim_asset_software dasoft&#10;                    JOIN dim_software dsoft on dasoft.software_id = dsoft.software_id&#10;                    GROUP BY dasoft.asset_id) dsoft using (asset_id)&#10;                    LEFT OUTER JOIN (select daserv.asset_id, (string_agg(DISTINCT '<' || dserv.name, '>') || '>') as services, (string_agg(DISTINCT '<' || dp.name, '>') || '>') as protocols&#10;                    FROM dim_asset_service daserv&#10;                    JOIN dim_service dserv on daserv.service_id = dserv.service_id&#10;                    JOIN dim_protocol dp on daserv.protocol_id = dp.protocol_id&#10;                    GROUP BY daserv.asset_id) dserv using (asset_id)&#10;                    LEFT OUTER JOIN (select dta.asset_id, (string_agg(DISTINCT '<' || dta.association, '>') || '>') as tag_association, (string_agg(DISTINCT '<' || dt.tag_name, '>') || '>') as tags&#10;                    FROM dim_tag_asset dta&#10;                    JOIN dim_tag dt ON  dta.tag_id = dt.tag_id&#10;                    GROUP BY dta.asset_id) dta USING (asset_id)&#10;                    LEFT OUTER JOIN (select ds.site_id, ds.name as site_name FROM dim_site ds) ds on (ds.site_id = dsa.site_id)&#10;                    &#10;                    WHERE dsa.site_id=213&#10;&#10;                    GROUP BY asset_id, dsa.site_id, ds.site_name, mac_address, da.ip_address, &#10;                      host_name, operating_system_id, host_type_id, dos.os_description, dos.architecture, dos.system, dos.cpe, dht.host_description, &#10;                      dagc.asset_group_accounts, fa.vulnerabilities, fa.critical_vulnerabilities, fa.severe_vulnerabilities, fa.moderate_vulnerabilities, &#10;                      fa.malware_kits, fa.exploits, fa.vulnerability_instances, fa.riskscore, fa.pci_status, dsoft.installed_software, dserv.services, dserv.protocols, dta.tags, dta.tag_association, fa.scan_finished, fad.last_discovered&#10;                    "/><filter type="site" id="213"/></Filters></AdhocReportConfig></ReportAdhocGenerateRequest>

I have made a "workaround" that works perfectly by scheduling a small shell script that verify the process age, and kills it if it goes over a static limit in hours, this works perfectly and all the data are ingested with no problem at all anymore.

But there is an issue with the script that shall be fixed.

Splunk version; 7.0.x
Guest OS: Centos 7.x
TA version: 1.1.8

For the record of anyone requiring it, I am using the following simple shell script:

#!/bin/sh

# set -x

# Program name: rapid7_clean.sh
# Purpose - Clean rapid7 stuck processes
# Author - Guilhem Marchand

# Version 1.0

#################################################
##  Your Customizations Go Here            ##
#################################################

# format date output to strftime dd/mm/YYYY HH:MM:SS
log_date () {
    date "+%d-%m-%Y %H:%M:%S"
}

# hostname
HOST=`hostname`

if [ -z "${SPLUNK_HOME}" ]; then
    echo "`log_date`, ${HOST} ERROR, SPLUNK_HOME variable is not defined"
    exit 1
fi

####################################################################
#############       Main Program            ############
####################################################################

###### Maintenance tasks ######

#
# Maintenance task1
#

# Nexpose rapid7 stuck processes: somehow the rapid7 nexpose Python script gets stuck and 

    # maximal time in seconds the process is allowed to be in machine
    endtime=7200

    echo "`log_date`, ${HOST} INFO, starting maintenance task 1: verify rapid7 stuck processes (processes in machine for more than $endtime seconds)"

    # get the list of running processes
    oldPidList=`ps -eo user,pid,command,etime,args | grep "splunk" | grep "$SPLUNK_HOME/etc/apps/TA-rapid7_nexpose/bin/rapid7nexpose.py" | grep -v rapid7_clean.sh | grep -v grep | awk '{ print $2 }'`
    ps -eo user,pid,command,etime,args | grep "splunk" | grep "$SPLUNK_HOME/etc/apps/TA-rapid7_nexpose/bin/rapid7nexpose.py" | grep -v rapid7_clean.sh | grep -v grep >/dev/null

    if [ $? -eq 0 ]; then

        for pid in $oldPidList; do

            pid_runtime=0
            # only run the process is running
            if [ -d /proc/${pid} ]; then
                # get the process runtime in seconds

                pid_runtime=`ps -p ${pid} -oetime= | tr '-' ':' | awk -F: '{ total=0; m=1; } { for (i=0; i < NF; i++) {total += $(NF-i)*m; m *= i >= 2 ? 24 : 60 }} {print total}'`

                # additional protection
                case ${pid_runtime} in
                "")
                 ;;
                *)
                 if [ ${pid_runtime} -gt ${endtime} ]; then
                     echo "`log_date`, ${HOST} WARN, old process found due to: `ps auxwww | grep $pid | grep -v grep` killing (SIGTERM) process $pid"
                     kill $pid

                     # Allow some time for the process to end
                     sleep 5

                     # re-check the status
                     ps -p ${pid} -oetime= >/dev/null

                     if [ $? -eq 0 ]; then
                         echo "`log_date`, ${HOST} WARN, old process found due to: `ps auxwww | grep $pid | grep -v grep` failed to stop, killing (-9) process $pid"
                         kill -9 $pid
                     fi

                 fi
                ;;
                esac
            fi

        done

    fi

###### End maintenance tasks ######

exit 0

Thank you

1 Solution

guilmxm
SplunkTrust
SplunkTrust

See script above, not very satisfying solution but works.

View solution in original post

0 Karma

guilmxm
SplunkTrust
SplunkTrust

See script above, not very satisfying solution but works.

0 Karma

manderson7
Contributor

I'm seeing the exact same issue. Thank you very much for your script, I'm sure it will do what we need.
How often do you let the script run?
Thank you

0 Karma

guilmxm
SplunkTrust
SplunkTrust

Hi @manderson7

It was scheduled with a 4 hours interval, to be gentle.
For us it did it definitely, never had to touch this stuff again and never had any issues at all.

Guilhem

manderson7
Contributor

Thanks! You may want to move your script from your edit to an answer to show that this has been resolved.

0 Karma

guilmxm
SplunkTrust
SplunkTrust

Sort of, it's more a workaround than a fix... ideally the maintainer could see the message and verifies it.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...