Splunk Search

Issues for Splunk data collection with Python splunk-sdk package

bergen288
Engager

I experienced the following 3 issues when collecting Splunk data with Python splunk-sdk package.

The 1st issue is: during peak hours from 10AM to 4PM, I may experience the following error.  How to increase concurrency_limit to avoid this error?  Is concurrency_limit to be modified on Splunk server?

splunklib.binding.HTTPError: HTTP 503 Service Unavailable -- Search not executed: The maximum number of concurrent historical searches on this instance has been reached., concurrency_category="historical", concurrency_context="instance-wide", current_concurrency=52, concurrency_limit=52

Labels (1)
0 Karma

bergen288
Engager

The 3rd issue is about data duplicates.  Below is my python program to collect 1 hour data from 11pm to midnight on 11/25 and load data into Pandas dataframe.  It is sorted by _time and Index is the input order.  As you can see in attached screenshot of CSV file, there are total 184 lines.  But 88 lines are duplicates.  Although I can use df.drop_duplicates() to drop duplicates, but it is not the best/most efficient way.  I wonder if Splunk-sdk has an option to prevent such kind of duplicates?

SEARCH_STRING = f"""
    search index=pivotal  cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*"
    msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*"
    earliest="11/25/2021:23:00:00" latest="11/25/2021:24:00:00"
    | eval Message='msg.Properties.LoggingTemplate.Message'
    | eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId'
    | eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName'
    | sort _time
    | table _time,Message,SessionId,PayeeName
"""
dt_string = "2021_11_25_23"
TABLE = '_time,Message,SessionId,PayeeName'
COLUMNS = TABLE.split(',')
service = connect_Splunk()
rr = results.ResultsReader(service.jobs.export(SEARCH_STRING))
ord_list = []
for result in rr:
    if isinstance(result, results.Message):
        #skip message
        pass
    elif isinstance(result, dict):
        # Normal events are returned as dicts
        if bool(result):
            ord_list.append(result)
if len(ord_list) > 0:
    df = pd.DataFrame([k.values() for k in ord_list], columns = COLUMNS)
    df = df.sort_values(by=['_time'])
    print('Rows before drop duplicates', df.shape[0])
    df_nodup = df.drop_duplicates()
    print('Rows after drop duplicates', df_nodup.shape[0])
    OUT = f'../data/splunk_cfn_{dt_string}.csv'
    df.to_csv(OUT)
else:
    print('No valid data available in this period.')
del service
0 Karma

bergen288
Engager

The 2nd issue is the connection reset error when trying to collect whole day data in one Splunk connection.  My work-around is to collect 1 hour data per each Splunk connection.  It will be nice to resolve connection reset error so that I can collect whole day data in one session.  Is it something to be modified on Splunk server or inside python splunk-sdk package? 

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

 

0 Karma
Get Updates on the Splunk Community!

AI for AppInspect

We’re excited to announce two new updates to AppInspect designed to save you time and make the app approval ...

App Platform's 2025 Year in Review: A Year of Innovation, Growth, and Community

As we step into 2026, it’s the perfect moment to reflect on what an extraordinary year 2025 was for the Splunk ...

Operationalizing Entity Risk Score with Enterprise Security 8.3+

Overview Enterprise Security 8.3 introduces a powerful new feature called “Entity Risk Scoring” (ERS) for ...