Issues for Splunk data collection with Python splu...

bergen288 · ‎11-26-2021

I experienced the following 3 issues when collecting Splunk data with Python splunk-sdk package.

The 1st issue is: during peak hours from 10AM to 4PM, I may experience the following error. How to increase concurrency_limit to avoid this error? Is concurrency_limit to be modified on Splunk server?

splunklib.binding.HTTPError: HTTP 503 Service Unavailable -- Search not executed: The maximum number of concurrent historical searches on this instance has been reached., concurrency_category="historical", concurrency_context="instance-wide", current_concurrency=52, concurrency_limit=52

bergen288 · ‎11-26-2021

The 3rd issue is about data duplicates. Below is my python program to collect 1 hour data from 11pm to midnight on 11/25 and load data into Pandas dataframe. It is sorted by _time and Index is the input order. As you can see in attached screenshot of CSV file, there are total 184 lines. But 88 lines are duplicates. Although I can use df.drop_duplicates() to drop duplicates, but it is not the best/most efficient way. I wonder if Splunk-sdk has an option to prevent such kind of duplicates?

SEARCH_STRING = f"""

search index=pivotal cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*"

msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*"

earliest="11/25/2021:23:00:00" latest="11/25/2021:24:00:00"

| eval Message='msg.Properties.LoggingTemplate.Message'

| eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId'

| eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName'

| sort _time

| table _time,Message,SessionId,PayeeName

"""

dt_string = "2021_11_25_23"

TABLE = '_time,Message,SessionId,PayeeName'

COLUMNS = TABLE.split(',')

service = connect_Splunk()

rr = results.ResultsReader(service.jobs.export(SEARCH_STRING))

ord_list = []

for result in rr:

if isinstance(result, results.Message):

#skip message

pass

elif isinstance(result, dict):

# Normal events are returned as dicts

if bool(result):

ord_list.append(result)

if len(ord_list) > 0:

df = pd.DataFrame([k.values() for k in ord_list], columns = COLUMNS)

df = df.sort_values(by=['_time'])

print('Rows before drop duplicates', df.shape[0])

df_nodup = df.drop_duplicates()

print('Rows after drop duplicates', df_nodup.shape[0])

OUT = f'../data/splunk_cfn_{dt_string}.csv'

df.to_csv(OUT)

else:

print('No valid data available in this period.')

del service

bergen288 · ‎11-26-2021

The 2nd issue is the connection reset error when trying to collect whole day data in one Splunk connection. My work-around is to collect 1 hour data per each Splunk connection. It will be nice to resolve connection reset error so that I can collect whole day data in one session. Is it something to be modified on Splunk server or inside python splunk-sdk package?

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

Issues for Splunk data collection with Python splunk-sdk package

field extraction

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Calling All Security Pros: Ready to Race Through Boston?

Are you a member of the Splunk Community?

Issues for Splunk data collection with Python splunk-sdk package

field extraction

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Calling All Security Pros: Ready to Race Through Boston?