Splunk Search

Issues for Splunk data collection with Python splunk-sdk package

bergen288
Engager

I experienced the following 3 issues when collecting Splunk data with Python splunk-sdk package.

The 1st issue is: during peak hours from 10AM to 4PM, I may experience the following error.  How to increase concurrency_limit to avoid this error?  Is concurrency_limit to be modified on Splunk server?

splunklib.binding.HTTPError: HTTP 503 Service Unavailable -- Search not executed: The maximum number of concurrent historical searches on this instance has been reached., concurrency_category="historical", concurrency_context="instance-wide", current_concurrency=52, concurrency_limit=52

Labels (1)
0 Karma

bergen288
Engager

The 3rd issue is about data duplicates.  Below is my python program to collect 1 hour data from 11pm to midnight on 11/25 and load data into Pandas dataframe.  It is sorted by _time and Index is the input order.  As you can see in attached screenshot of CSV file, there are total 184 lines.  But 88 lines are duplicates.  Although I can use df.drop_duplicates() to drop duplicates, but it is not the best/most efficient way.  I wonder if Splunk-sdk has an option to prevent such kind of duplicates?

SEARCH_STRING = f"""
    search index=pivotal  cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*"
    msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*"
    earliest="11/25/2021:23:00:00" latest="11/25/2021:24:00:00"
    | eval Message='msg.Properties.LoggingTemplate.Message'
    | eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId'
    | eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName'
    | sort _time
    | table _time,Message,SessionId,PayeeName
"""
dt_string = "2021_11_25_23"
TABLE = '_time,Message,SessionId,PayeeName'
COLUMNS = TABLE.split(',')
service = connect_Splunk()
rr = results.ResultsReader(service.jobs.export(SEARCH_STRING))
ord_list = []
for result in rr:
    if isinstance(result, results.Message):
        #skip message
        pass
    elif isinstance(result, dict):
        # Normal events are returned as dicts
        if bool(result):
            ord_list.append(result)
if len(ord_list) > 0:
    df = pd.DataFrame([k.values() for k in ord_list], columns = COLUMNS)
    df = df.sort_values(by=['_time'])
    print('Rows before drop duplicates', df.shape[0])
    df_nodup = df.drop_duplicates()
    print('Rows after drop duplicates', df_nodup.shape[0])
    OUT = f'../data/splunk_cfn_{dt_string}.csv'
    df.to_csv(OUT)
else:
    print('No valid data available in this period.')
del service
0 Karma

bergen288
Engager

The 2nd issue is the connection reset error when trying to collect whole day data in one Splunk connection.  My work-around is to collect 1 hour data per each Splunk connection.  It will be nice to resolve connection reset error so that I can collect whole day data in one session.  Is it something to be modified on Splunk server or inside python splunk-sdk package? 

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

 

0 Karma
Get Updates on the Splunk Community!

Index This | What is broken 80% of the time by February?

December 2025 Edition   Hayyy Splunk Education Enthusiasts and the Eternally Curious!    We’re back with this ...

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Hello Splunk Community,   We're thrilled to share an exciting update that will help you manage your data more ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Discover how the Splunk Model Context Protocol (MCP) Server can revolutionize the way your organization uses ...