Splunk Search

Issues for Splunk data collection with Python splunk-sdk package

bergen288
Engager

I experienced the following 3 issues when collecting Splunk data with Python splunk-sdk package.

The 1st issue is: during peak hours from 10AM to 4PM, I may experience the following error.  How to increase concurrency_limit to avoid this error?  Is concurrency_limit to be modified on Splunk server?

splunklib.binding.HTTPError: HTTP 503 Service Unavailable -- Search not executed: The maximum number of concurrent historical searches on this instance has been reached., concurrency_category="historical", concurrency_context="instance-wide", current_concurrency=52, concurrency_limit=52

Labels (1)
0 Karma

bergen288
Engager

The 3rd issue is about data duplicates.  Below is my python program to collect 1 hour data from 11pm to midnight on 11/25 and load data into Pandas dataframe.  It is sorted by _time and Index is the input order.  As you can see in attached screenshot of CSV file, there are total 184 lines.  But 88 lines are duplicates.  Although I can use df.drop_duplicates() to drop duplicates, but it is not the best/most efficient way.  I wonder if Splunk-sdk has an option to prevent such kind of duplicates?

SEARCH_STRING = f"""
    search index=pivotal  cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*"
    msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*"
    earliest="11/25/2021:23:00:00" latest="11/25/2021:24:00:00"
    | eval Message='msg.Properties.LoggingTemplate.Message'
    | eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId'
    | eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName'
    | sort _time
    | table _time,Message,SessionId,PayeeName
"""
dt_string = "2021_11_25_23"
TABLE = '_time,Message,SessionId,PayeeName'
COLUMNS = TABLE.split(',')
service = connect_Splunk()
rr = results.ResultsReader(service.jobs.export(SEARCH_STRING))
ord_list = []
for result in rr:
    if isinstance(result, results.Message):
        #skip message
        pass
    elif isinstance(result, dict):
        # Normal events are returned as dicts
        if bool(result):
            ord_list.append(result)
if len(ord_list) > 0:
    df = pd.DataFrame([k.values() for k in ord_list], columns = COLUMNS)
    df = df.sort_values(by=['_time'])
    print('Rows before drop duplicates', df.shape[0])
    df_nodup = df.drop_duplicates()
    print('Rows after drop duplicates', df_nodup.shape[0])
    OUT = f'../data/splunk_cfn_{dt_string}.csv'
    df.to_csv(OUT)
else:
    print('No valid data available in this period.')
del service
0 Karma

bergen288
Engager

The 2nd issue is the connection reset error when trying to collect whole day data in one Splunk connection.  My work-around is to collect 1 hour data per each Splunk connection.  It will be nice to resolve connection reset error so that I can collect whole day data in one session.  Is it something to be modified on Splunk server or inside python splunk-sdk package? 

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

 

0 Karma
Get Updates on the Splunk Community!

Exporting Splunk Apps

Join us on Monday, October 21 at 11 am PT | 2 pm ET!With the app export functionality, app developers and ...

Cisco Use Cases, ITSI Best Practices, and More New Articles from Splunk Lantern

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Build Your First SPL2 App!

Watch the recording now!.Do you want to SPL™, too? SPL2, Splunk's next-generation data search and preparation ...