Hi,
I have implemented recently the TA for Rapid7 for a customer of mine, and had to face to issues relating to the backend Python script.
After some days following the initial deployment, I realised the assets and vulnerabilities were not ingested anymore, while the other job (vuln exceptions) was working perfectly fine day after day.
I found out that the nexpose Python script systematically gets stuck at the end of the end of its execution, at least it generates data properly but never terminates and remains in machine.
Killing the process allows the next execution to work properly as expected.
Logs clearly show that the script remains stuck at the last query:
2018-10-26 02:15:34,873 INFO nx_logger:38 - Processing asset report for site(s) <['158']>
2018-10-26 02:15:34,892 INFO nx_logger:38 - Finished processing asset report for site(s) <['158']>
2018-10-26 02:16:54,924 INFO nx_logger:38 - Connecting Nexpose client
2018-10-26 02:16:54,988 INFO nx_logger:38 - Executing vuln query for site(s) <['158']>
2018-10-26 02:16:54,989 INFO nx_logger:38 - In AdHoc generate
2018-10-26 02:16:54,989 INFO nx_logger:38 - Making Query:
<ReportAdhocGenerateRequest session-id="0AFAE635E5D96FE30C40579A14EF63DA135303F9" sync-id="22499"><AdhocReportConfig format="sql"><Filters><filter type="version" id="2.3.0"/><filter type="query" id="SELECT asset_id, da.ip_address, da.mac_address, site_id, favf.vulnerability_instances, favf.vulnerability_id, fasva.first_discovered, fasva.most_recently_discovered, dv.title, dv.severity, dvc.categories, dve.skill_levels, dvr.sources, favf.scan_id, dv.cvss_score, dv.date_added, dv.cvss_vector from dim_site_asset RIGHT OUTER JOIN (select favf.asset_id, favf.vulnerability_instances, favf.vulnerability_id, favf.scan_id FROM fact_asset_vulnerability_finding favf) favf USING (asset_id) LEFT OUTER JOIN (select dv.vulnerability_id, dv.title, dv.severity, dv.cvss_score, dv.cvss_vector, dv.date_added FROM dim_vulnerability dv) dv USING (vulnerability_id) LEFT OUTER JOIN (select dvc.vulnerability_id, (string_agg(DISTINCT '<' || dvc.category_name, '>') || '>') as categories FROM dim_vulnerability_category dvc GROUP BY dvc.vulnerability_id) dvc USING (vulnerability_id) LEFT OUTER JOIN (select dve.vulnerability_id, (string_agg(DISTINCT '<' || dve.skill_level, '>') || '>') as skill_levels FROM dim_vulnerability_exploit dve GROUP BY dve.vulnerability_id) dve USING (vulnerability_id) LEFT OUTER JOIN (select dvr.vulnerability_id, (string_agg(DISTINCT '<' || dvr.source || ':' || dvr.reference,'>') || '>') as sources FROM dim_vulnerability_reference dvr GROUP BY dvr.vulnerability_id) dvr USING (vulnerability_id) LEFT OUTER JOIN (select fasva.asset_id, fasva.vulnerability_id, fasva.first_discovered, fasva.most_recently_discovered FROM fact_asset_vulnerability_age fasva) fasva USING(asset_id, vulnerability_id) LEFT OUTER JOIN (select da.asset_id, da.ip_address, da.mac_address FROM dim_asset da) da USING (asset_id) WHERE site_id=158 GROUP BY asset_id, da.ip_address, da.mac_address, fasva.first_discovered, fasva.most_recently_discovered, site_id, favf.scan_id, favf.vulnerability_id, favf.vulnerability_instances, dv.title, dv.vulnerability_id, dv.severity, dvc.categories, dve.skill_levels, dvr.sources, dv.cvss_score, dv.cvss_vector, dv.date_added "/><filter type="site" id="158"/></Filters></AdhocReportConfig></ReportAdhocGenerateRequest>
2018-10-26 02:17:14,996 INFO nx_logger:38 - Connecting Nexpose client
2018-10-26 02:17:15,069 INFO nx_logger:38 - Executing asset query for site(s) <['213']>
2018-10-26 02:17:15,069 INFO nx_logger:38 - In AdHoc generate
2018-10-26 02:17:15,069 INFO nx_logger:38 - Making Query:
<ReportAdhocGenerateRequest session-id="5C663DF004F33281B210A0B1EE8B062C34F859B5" sync-id="49944"><AdhocReportConfig format="sql"><Filters><filter type="version" id="2.3.0"/><filter type="query" id="SELECT dsa.asset_id as asset_id, dsa.site_id as site_id, ds.site_name, da.mac_address, da.ip_address, da.host_name, da.operating_system_id, da.host_type_id, dos.os_description, dos.architecture, dos.system, dos.cpe, dht.host_description, dagc.asset_group_accounts, fa.vulnerabilities, fa.critical_vulnerabilities, fa.severe_vulnerabilities, fa.moderate_vulnerabilities, fa.malware_kits, fa.exploits, fa.vulnerability_instances, fa.riskscore, fa.pci_status, dsoft.installed_software, dserv.services, dserv.protocols, dta.tags, dta.tag_association, fa.scan_finished, fad.last_discovered from dim_site_asset dsa JOIN (select asset_id, last_discovered FROM fact_asset_discovery) fad USING (asset_id) LEFT OUTER JOIN (select da.asset_id, da.ip_address, da.mac_address, da.host_name, da.operating_system_id, da.host_type_id FROM dim_asset da) da USING (asset_id) LEFT OUTER JOIN (select dos.operating_system_id, dos.description as os_description, dos.architecture, dos.system, dos.cpe FROM dim_operating_system dos) dos using (operating_system_id) LEFT OUTER JOIN (select dht.host_type_id, dht.description as host_description FROM dim_host_type dht) dht using (host_type_id) LEFT OUTER JOIN (select dagc.asset_id, (string_agg(DISTINCT '<' || dagc.name, '>') || '>') as asset_group_accounts FROM dim_asset_group_account dagc GROUP BY dagc.asset_id) dagc USING (asset_id) LEFT OUTER JOIN (select fa.asset_id, fa.vulnerabilities, fa.scan_finished, fa.critical_vulnerabilities, fa.severe_vulnerabilities, fa.moderate_vulnerabilities, fa.malware_kits, fa.exploits, fa.vulnerability_instances, fa.riskscore, fa.pci_status FROM fact_asset fa) fa USING (asset_id) LEFT OUTER JOIN (select dasoft.asset_id, (string_agg(DISTINCT '<' || dsoft.name, '>') || '>') as installed_software FROM dim_asset_software dasoft JOIN dim_software dsoft on dasoft.software_id = dsoft.software_id GROUP BY dasoft.asset_id) dsoft using (asset_id) LEFT OUTER JOIN (select daserv.asset_id, (string_agg(DISTINCT '<' || dserv.name, '>') || '>') as services, (string_agg(DISTINCT '<' || dp.name, '>') || '>') as protocols FROM dim_asset_service daserv JOIN dim_service dserv on daserv.service_id = dserv.service_id JOIN dim_protocol dp on daserv.protocol_id = dp.protocol_id GROUP BY daserv.asset_id) dserv using (asset_id) LEFT OUTER JOIN (select dta.asset_id, (string_agg(DISTINCT '<' || dta.association, '>') || '>') as tag_association, (string_agg(DISTINCT '<' || dt.tag_name, '>') || '>') as tags FROM dim_tag_asset dta JOIN dim_tag dt ON dta.tag_id = dt.tag_id GROUP BY dta.asset_id) dta USING (asset_id) LEFT OUTER JOIN (select ds.site_id, ds.name as site_name FROM dim_site ds) ds on (ds.site_id = dsa.site_id) WHERE dsa.site_id=213 GROUP BY asset_id, dsa.site_id, ds.site_name, mac_address, da.ip_address, host_name, operating_system_id, host_type_id, dos.os_description, dos.architecture, dos.system, dos.cpe, dht.host_description, dagc.asset_group_accounts, fa.vulnerabilities, fa.critical_vulnerabilities, fa.severe_vulnerabilities, fa.moderate_vulnerabilities, fa.malware_kits, fa.exploits, fa.vulnerability_instances, fa.riskscore, fa.pci_status, dsoft.installed_software, dserv.services, dserv.protocols, dta.tags, dta.tag_association, fa.scan_finished, fad.last_discovered "/><filter type="site" id="213"/></Filters></AdhocReportConfig></ReportAdhocGenerateRequest>
I have made a "workaround" that works perfectly by scheduling a small shell script that verify the process age, and kills it if it goes over a static limit in hours, this works perfectly and all the data are ingested with no problem at all anymore.
But there is an issue with the script that shall be fixed.
Splunk version; 7.0.x
Guest OS: Centos 7.x
TA version: 1.1.8
For the record of anyone requiring it, I am using the following simple shell script:
#!/bin/sh
# set -x
# Program name: rapid7_clean.sh
# Purpose - Clean rapid7 stuck processes
# Author - Guilhem Marchand
# Version 1.0
#################################################
## Your Customizations Go Here ##
#################################################
# format date output to strftime dd/mm/YYYY HH:MM:SS
log_date () {
date "+%d-%m-%Y %H:%M:%S"
}
# hostname
HOST=`hostname`
if [ -z "${SPLUNK_HOME}" ]; then
echo "`log_date`, ${HOST} ERROR, SPLUNK_HOME variable is not defined"
exit 1
fi
####################################################################
############# Main Program ############
####################################################################
###### Maintenance tasks ######
#
# Maintenance task1
#
# Nexpose rapid7 stuck processes: somehow the rapid7 nexpose Python script gets stuck and
# maximal time in seconds the process is allowed to be in machine
endtime=7200
echo "`log_date`, ${HOST} INFO, starting maintenance task 1: verify rapid7 stuck processes (processes in machine for more than $endtime seconds)"
# get the list of running processes
oldPidList=`ps -eo user,pid,command,etime,args | grep "splunk" | grep "$SPLUNK_HOME/etc/apps/TA-rapid7_nexpose/bin/rapid7nexpose.py" | grep -v rapid7_clean.sh | grep -v grep | awk '{ print $2 }'`
ps -eo user,pid,command,etime,args | grep "splunk" | grep "$SPLUNK_HOME/etc/apps/TA-rapid7_nexpose/bin/rapid7nexpose.py" | grep -v rapid7_clean.sh | grep -v grep >/dev/null
if [ $? -eq 0 ]; then
for pid in $oldPidList; do
pid_runtime=0
# only run the process is running
if [ -d /proc/${pid} ]; then
# get the process runtime in seconds
pid_runtime=`ps -p ${pid} -oetime= | tr '-' ':' | awk -F: '{ total=0; m=1; } { for (i=0; i < NF; i++) {total += $(NF-i)*m; m *= i >= 2 ? 24 : 60 }} {print total}'`
# additional protection
case ${pid_runtime} in
"")
;;
*)
if [ ${pid_runtime} -gt ${endtime} ]; then
echo "`log_date`, ${HOST} WARN, old process found due to: `ps auxwww | grep $pid | grep -v grep` killing (SIGTERM) process $pid"
kill $pid
# Allow some time for the process to end
sleep 5
# re-check the status
ps -p ${pid} -oetime= >/dev/null
if [ $? -eq 0 ]; then
echo "`log_date`, ${HOST} WARN, old process found due to: `ps auxwww | grep $pid | grep -v grep` failed to stop, killing (-9) process $pid"
kill -9 $pid
fi
fi
;;
esac
fi
done
fi
###### End maintenance tasks ######
exit 0
Thank you
See script above, not very satisfying solution but works.
I'm seeing the exact same issue. Thank you very much for your script, I'm sure it will do what we need.
How often do you let the script run?
Thank you
Hi @manderson7
It was scheduled with a 4 hours interval, to be gentle.
For us it did it definitely, never had to touch this stuff again and never had any issues at all.
Guilhem
Thanks! You may want to move your script from your edit to an answer to show that this has been resolved.
Sort of, it's more a workaround than a fix... ideally the maintainer could see the message and verifies it.