Splunk Search

Issues for Splunk data collection with Python splunk-sdk package

bergen288
Engager

I experienced the following 3 issues when collecting Splunk data with Python splunk-sdk package.

The 1st issue is: during peak hours from 10AM to 4PM, I may experience the following error.  How to increase concurrency_limit to avoid this error?  Is concurrency_limit to be modified on Splunk server?

splunklib.binding.HTTPError: HTTP 503 Service Unavailable -- Search not executed: The maximum number of concurrent historical searches on this instance has been reached., concurrency_category="historical", concurrency_context="instance-wide", current_concurrency=52, concurrency_limit=52

Labels (1)
0 Karma

bergen288
Engager

The 3rd issue is about data duplicates.  Below is my python program to collect 1 hour data from 11pm to midnight on 11/25 and load data into Pandas dataframe.  It is sorted by _time and Index is the input order.  As you can see in attached screenshot of CSV file, there are total 184 lines.  But 88 lines are duplicates.  Although I can use df.drop_duplicates() to drop duplicates, but it is not the best/most efficient way.  I wonder if Splunk-sdk has an option to prevent such kind of duplicates?

SEARCH_STRING = f"""
    search index=pivotal  cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*"
    msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*"
    earliest="11/25/2021:23:00:00" latest="11/25/2021:24:00:00"
    | eval Message='msg.Properties.LoggingTemplate.Message'
    | eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId'
    | eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName'
    | sort _time
    | table _time,Message,SessionId,PayeeName
"""
dt_string = "2021_11_25_23"
TABLE = '_time,Message,SessionId,PayeeName'
COLUMNS = TABLE.split(',')
service = connect_Splunk()
rr = results.ResultsReader(service.jobs.export(SEARCH_STRING))
ord_list = []
for result in rr:
    if isinstance(result, results.Message):
        #skip message
        pass
    elif isinstance(result, dict):
        # Normal events are returned as dicts
        if bool(result):
            ord_list.append(result)
if len(ord_list) > 0:
    df = pd.DataFrame([k.values() for k in ord_list], columns = COLUMNS)
    df = df.sort_values(by=['_time'])
    print('Rows before drop duplicates', df.shape[0])
    df_nodup = df.drop_duplicates()
    print('Rows after drop duplicates', df_nodup.shape[0])
    OUT = f'../data/splunk_cfn_{dt_string}.csv'
    df.to_csv(OUT)
else:
    print('No valid data available in this period.')
del service
0 Karma

bergen288
Engager

The 2nd issue is the connection reset error when trying to collect whole day data in one Splunk connection.  My work-around is to collect 1 hour data per each Splunk connection.  It will be nice to resolve connection reset error so that I can collect whole day data in one session.  Is it something to be modified on Splunk server or inside python splunk-sdk package? 

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

 

0 Karma
Get Updates on the Splunk Community!

The OpenTelemetry Certified Associate (OTCA) Exam

What’s this OTCA exam? The Linux Foundation offers the OpenTelemetry Certified Associate (OTCA) credential to ...

From Manual to Agentic: Level Up Your SOC at Cisco Live

Welcome to the Era of the Agentic SOC   Are you tired of being a manual alert responder? The security ...

Splunk Classroom Chronicles: Training Tales and Testimonials (Episode 4)

Welcome back to Splunk Classroom Chronicles, our ongoing series where we shine a light on what really happens ...