About bergen288

bergen288 · ‎11-26-2021

The 3rd issue is about data duplicates. Below is my python program to collect 1 hour data from 11pm to midnight on 11/25 and load data into Pandas dataframe. It is sorted by _time and Index is the input order. As you can see in attached screenshot of CSV file, there are total 184 lines. But 88 lines are duplicates. Although I can use df.drop_duplicates() to drop duplicates, but it is not the best/most efficient way. I wonder if Splunk-sdk has an option to prevent such kind of duplicates? SEARCH_STRING = f""" search index=pivotal cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*" msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*" earliest="11/25/2021:23:00:00" latest="11/25/2021:24:00:00" | eval Message='msg.Properties.LoggingTemplate.Message' | eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId' | eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName' | sort _time | table _time,Message,SessionId,PayeeName """ dt_string = "2021_11_25_23" TABLE = '_time,Message,SessionId,PayeeName' COLUMNS = TABLE.split(',') service = connect_Splunk() rr = results.ResultsReader(service.jobs.export(SEARCH_STRING)) ord_list = [] for result in rr: if isinstance(result, results.Message): #skip message pass elif isinstance(result, dict): # Normal events are returned as dicts if bool(result): ord_list.append(result) if len(ord_list) > 0: df = pd.DataFrame([k.values() for k in ord_list], columns = COLUMNS) df = df.sort_values(by=['_time']) print('Rows before drop duplicates', df.shape[0]) df_nodup = df.drop_duplicates() print('Rows after drop duplicates', df_nodup.shape[0]) OUT = f'../data/splunk_cfn_{dt_string}.csv' df.to_csv(OUT) else: print('No valid data available in this period.') del service

bergen288 · ‎11-26-2021

The 2nd issue is the connection reset error when trying to collect whole day data in one Splunk connection. My work-around is to collect 1 hour data per each Splunk connection. It will be nice to resolve connection reset error so that I can collect whole day data in one session. Is it something to be modified on Splunk server or inside python splunk-sdk package? ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

bergen288 · ‎11-26-2021

I experienced the following 3 issues when collecting Splunk data with Python splunk-sdk package. The 1st issue is: during peak hours from 10AM to 4PM, I may experience the following error. How to increase concurrency_limit to avoid this error? Is concurrency_limit to be modified on Splunk server? splunklib.binding.HTTPError: HTTP 503 Service Unavailable -- Search not executed: The maximum number of concurrent historical searches on this instance has been reached., concurrency_category="historical", concurrency_context="instance-wide", current_concurrency=52, concurrency_limit=52

bergen288 · ‎11-19-2021

Yes, I got "null" value for PayeeType after adding "|fillnull value=null PayeeType" in my SEARCH_STRING. Thanks.

bergen288 · ‎11-18-2021

Rick: I modified my search string based on your hints. In one minute at 9:33am today, there are 1672 rows. Unfortunately, 23 rows do not have PayeeType column so they have 12 columns while all others have 13 columns which will cause failure to load whole data into Pandas dataframe. Below is an example of _raw column. It doesn't have PayeeType. In addition, there is a chance that AccountNumber may have the same issue. Is there a way to let Splunk generate "null" value for them so that all rows have 13 columns even though PayeeType and/or AccountNumber might be missing in _raw value? Thanks. "2021-11-18 09:33:06,900 [59] INFO FiservLog.stdlog - <PayeeAddManager><TenantId>FI05</TenantId><UserId>559852410</UserId><SourceMethodName>LogInfoSecure</SourceMethodName><SourceLineNumber>234</SourceLineNumber><Message>WARNING:Error adding Payee:Subscriber status prevents this action from being completed</Message><Timestamp>2021-11-18T14:33:06.899739Z</Timestamp><Exception /><AdditionalInformation><SessionId>463949F06E9F4B93A57570E8B56489A0201T4Q4P019019D467AADD625BC88A04</SessionId><Timestamp>11/18/2021 2:33:06 PM</Timestamp><CorrelationId>1637245986853</CorrelationId><PayeeName>PNC CARD SERVICES</PayeeName><Address>null</Address><AccountNumber>XXXXXXXXXXXX8590</AccountNumber></AdditionalInformation></PayeeAddManager>" search sourcetype="builder:payeeservice" host=JWPP*BLDRBP* "*AdditionalInformation*" earliest=-27m@m latest=-26m@m |xpath outfield=Timestamp "//NetworkPayeeAddManager/Timestamp" |xpath outfield=TenantId "//NetworkPayeeAddManager/TenantId" |xpath outfield=UserId "//NetworkPayeeAddManager/UserId" |xpath outfield=SourceMethodName "//NetworkPayeeAddManager/SourceMethodName" |xpath outfield=SourceLineNumber "//NetworkPayeeAddManager/SourceLineNumber" |xpath outfield=Message "//NetworkPayeeAddManager/Message" |xpath outfield=Exception "//NetworkPayeeAddManager/Exception" |xpath outfield=SessionId "//NetworkPayeeAddManager/AdditionalInformation/SessionId" |xpath outfield=CorrelationId "//NetworkPayeeAddManager/AdditionalInformation/CorrelationId" |xpath outfield=PayeeName "//NetworkPayeeAddManager/AdditionalInformation/PayeeName" |xpath outfield=Address "//NetworkPayeeAddManager/AdditionalInformation/Address" |xpath outfield=AccountNumber "//NetworkPayeeAddManager/AdditionalInformation/AccountNumber" |xpath outfield=PayeeType "//NetworkPayeeAddManager/AdditionalInformation/PayeeType" |table Timestamp TenantId UserId SourceMethodName SourceLineNumber Message Exception SessionId CorrelationId PayeeName Address AccountNumber PayeeType

bergen288 · ‎11-08-2021

I would expect the output dataframe has columns from first "TenantId" to last "AccountNumber" with values such as 13744, XX2222.

bergen288 · ‎11-08-2021

Good advice. Now, I only keep the following simple search statement with "_raw" column only as it contains all my required fields. SEARCH_STRING = """ search sourcetype="builder:payeeservice" host=JWPP*BLDRBP* "*AdditionalInformation*" earliest=-1h@h latest=-0h@h |table _raw """ The sample data is in OrderDict format as shown below. I need to extract all fields between <NetworkPayeeAddManager> and </NetworkPayeeAddManager> or between <PayeeAddManager> and </PayeeAddManager> and save all information to Pandas DataFrame. What's the best way to do it? OrderedDict([('_raw', '2021-11-08 08:58:23,832 [42] INFO FiservLog.stdlog - <NetworkPayeeAddManager><TenantId>13744</TenantId><UserId>999176993878</UserId><SourceMethodName>LogInfoSecure</SourceMethodName><SourceLineNumber>234</SourceLineNumber><Message>NetworkPayee was added successfully</Message><Timestamp>2021-11-08T13:58:23.831628Z</Timestamp><Exception /><AdditionalInformation><SessionId>F7E65ED4D8C74E6699C62F23ECF5D000200TWNQ9X1AA1754513234A6367FEE06</SessionId><Timestamp>11/8/2021 1:58:23 PM</Timestamp><CorrelationId>2461b5d9839a46739e9a3e918ca0681b-01</CorrelationId><PayeeName>Louisville fire brick</PayeeName><Address>{"Address1":"Po 9229","Address2":null,"City":"Louisville","State":"KY","Zip5":"40209","Zip4":null,"Zip2":null}</Address><PayeeType>UnManagedPayee</PayeeType><AccountNumber>XX2222</AccountNumber></AdditionalInformation></NetworkPayeeAddManager>')]) OrderedDict([('_raw', '2021-11-08 08:58:24,783 [105] INFO FiservLog.stdlog - <PayeeAddManager><TenantId>DI737</TenantId><UserId>344801483</UserId><SourceMethodName>LogInfoSecure</SourceMethodName><SourceLineNumber>234</SourceLineNumber><Message>Payee was added successfully</Message><Timestamp>2021-11-08T13:58:24.7831103Z</Timestamp><Exception /><AdditionalInformation><SessionId>7FC6442718864CE4838E50B026C8D0A0000TWNXSV1721BE0D804F295706DD39E</SessionId><Timestamp>11/8/2021 1:58:24 PM</Timestamp><CorrelationId>ab33b59c-756e-4144-ad62-6f0afadbe8eb</CorrelationId><PayeeName>Gail Nezworski</PayeeName><Address>{"Address1":"2280 S 460 E","Address2":null,"City":"LaGrange","State":"IN","Zip5":"46761","Zip4":null,"Zip2":null}</Address><PayeeType>UnManagedPayee</PayeeType><AccountNumber>XXXXX1888</AccountNumber></AdditionalInformation></PayeeAddManager>')])

bergen288 · ‎11-05-2021

My python is 3.8.5 and splunk-sdk is 1.6.16. My Splunk developer gives me a URL and I get its search string to retrieve data as shown below. Below is my search string and additional python code: search/earliest/latest are added after copy/paste search string. SEARCH_STRING = f""" search sourcetype="builder:payeeservice" host="JWPP*BLDR*P*" "*PayeeAddResponse" "*" "*" "*" "*" "*" "*" "*" earliest=-1h@h latest=-0h@h |rex d5p1:Description>(?<Description>.*</d5p1:Description>) |eval Description = replace(Description,"<[/]*[d]5p1:[\S]*>|<[d]5p1:[\S\s\"\=]*/>", "") |rex "GU\(((?P<SponsorId>[^;]+);(?P<SubscriberId>[^;]+);(?P<SessionId>[^;]*);(?P<CorrelationId>[^;]+);(?P<Version>\w+))\)" |table _time,SponsorId, SubscriberId,SessionId, CorrelationId,Description |join type=left CorrelationId [search sourcetype="builder:payeeservice" host="JWPP*BLDR*P*" "*AdditionalInformation*" |xmlkv ] |eval Timestamp = if((TenantId != ""),Timestamp,_time),PayeeName = if((TenantId != ""),PayeeName,""), Message = if((Description != ""),Description,Message), Exception = if((TenantId != ""),Exception,""), Address = if((TenantId != ""),Address,""), PayeeType = if((TenantId != ""),PayeeType,""),MerchantId = if((TenantId != ""),MerchantId,""),AccountNumber = if((TenantId != ""),AccountNumber,""),SubscriberId = if((TenantId != ""),UserId,SubscriberId),SponsorId = if((TenantId != ""),TenantId,SponsorId) |table Timestamp, SponsorId,SubscriberId, PayeeName,Message,Exception,CorrelationId,SessionId,PayeeName,Address,PayeeType,MerchantId,AccountNumber """ import splunklib.results as results service = connect_Splunk() rr = results.ResultsReader(service.jobs.create(SEARCH_STRING)) ord_list = [] for result in rr: if isinstance(result, results.Message): #skip messages pass elif isinstance(result, dict): # Normal events are returned as dicts ord_list.append(result) I get this error so something is wrong in my search string. How to fix it? splunklib.binding.HTTPError: HTTP 400 Bad Request -- Error in 'SearchParser': Mismatched ']'. Thanks.

bergen288 · ‎11-01-2021

Don't worry, I found a way to load OrderedDict data into dataframe. Thanks.

bergen288 · ‎11-01-2021

The key question is that the default output in <class 'collections.OrderedDict'> format is ugly and hard to convert to pandas dataframe. The output in CSV format is much easier to load into dataframe. If there is new way to convert output to dataframe, I don't mind what output format it is. Thanks.

bergen288 · ‎11-01-2021

I tried both rr = results.ResultsReader(service.jobs.export(SEARCH_STRING, **{"output_mode": "CSV"})) and rr = results.ResultsReader(service.jobs.export(SEARCH_STRING, output_mode="CSV")). Both give me the following invalid format CSV error: Traceback (most recent call last): File "e:\Python_Projects\Payees\Code\get_splunk_sdk.py", line 43, in <module> rr = results.ResultsReader(service.jobs.export(SEARCH_STRING, **{"output_mode": "CSV"})) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\client.py", line 2989, in export return self.post(path_segment="export", File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\client.py", line 821, in post return self.service.post(path, owner=owner, app=app, sharing=sharing, **query) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\binding.py", line 290, in wrapper return request_fun(self, *args, **kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\binding.py", line 71, in new_f val = f(*args, **kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\binding.py", line 764, in post response = self.http.post(path, all_headers, **query) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\binding.py", line 1242, in post return self.request(url, message) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\binding.py", line 1262, in request raise HTTPError(response) splunklib.binding.HTTPError: HTTP 400 Invalid output mode specified (CSV). -- Invalid output mode specified (CSV). If I try the following code: rr = results.ResultsReader(service.jobs.export(SEARCH_STRING, output_mode="csv")) for result in rr: print(result) It seems OK with "rr" statement, but gives me the following error: Traceback (most recent call last): File "e:\Python_Projects\Payees\Code\get_splunk_sdk.py", line 47, in <module> for result in rr: File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\results.py", line 210, in next return next(self._gen) File "C:\ProgramData\Anaconda3\lib\site-packages\splunklib\results.py", line 219, in _parse_results for event, elem in et.iterparse(stream, events=('start', 'end')): File "C:\ProgramData\Anaconda3\lib\xml\etree\ElementTree.py", line 1227, in iterator yield from pullparser.read_events() File "C:\ProgramData\Anaconda3\lib\xml\etree\ElementTree.py", line 1302, in read_events raise event File "C:\ProgramData\Anaconda3\lib\xml\etree\ElementTree.py", line 1274, in feed self._parser.feed(data) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 101 I also tried to add "|outputcsv myoutput.csv" inside my SEARCH_STRING, I don't know where is its location on Windows Server 2016? By the way, your document is pretty hard to understand. Do you mind to give me direct answer next time? Thanks.

bergen288 · ‎10-29-2021

Where is the default location of CSV output file defined in search string on Windows Server 2016?

bergen288 · ‎10-29-2021

My application developer gives me correct Splunk search string (see below), but its output is in <class 'collections.OrderedDict'> format which is pretty ugly. Is there a way to define output in CSV format? Thanks. SEARCH_STRING = """ search index=pivotal cf_app_name=ips-challenger-challengerapi-* "*PostPayeeAsync*" earliest=-2d latest=-d@d msg.Properties.LoggingTemplate.Exception !="*SubscriberStatus*" | eval Message='msg.Properties.LoggingTemplate.Message' | eval SponsorId ='msg.Properties.LoggingTemplate.TenantId' | eval SubscriberId = 'msg.Properties.LoggingTemplate.UserId' | eval Exception = 'msg.Properties.LoggingTemplate.Exception' | eval CorrelationId = 'msg.Properties.LoggingTemplate.AdditionalInformation.CorrelationId' | eval SessionId='msg.Properties.LoggingTemplate.AdditionalInformation.SessionId' | eval PayeeName= 'msg.Properties.LoggingTemplate.AdditionalInformation.PayeeName' | eval Address= 'msg.Properties.LoggingTemplate.AdditionalInformation.Address' | eval MerchantType= 'msg.Properties.LoggingTemplate.AdditionalInformation.MerchantType' | eval MerchantId= 'msg.Properties.LoggingTemplate.AdditionalInformation.MerchantId' | eval AccountNumber= 'msg.Properties.LoggingTemplate.AdditionalInformation.AccountNumber' | sort _time | table _time,SponsorId,SubscriberId,Message,Exception,CorrelationId,SessionId,PayeeName,Address,MerchantType,MerchantId,AccountNumber """

bergen288 · ‎09-30-2021

Sorry, I replied to your previous response. Here you go again: Sorry for the confusion. I am trying with 2 different approaches with the same login credentials. The 1st one is regular Web access with failed 401 error and the 2nd one is connection via Splunk-SDK client which is successful. It is confirmed with <splunklib.client.Service object at 0x0000013682881790> for print(service) statement. For my 1st Web access connection, my question is how to login Spunk website correctly. For my 2nd Splunk client connection, my question is how to modify its "search" string to get correct results. I am fine with either one.

bergen288 · ‎09-30-2021

Sorry for the confusion. I am trying with 2 different approaches with the same login credentials. The 1st one is regular Web access with failed 401 error and the 2nd one is connection via Splunk-SDK client which is successful. It is confirmed with <splunklib.client.Service object at 0x0000013682881790> for print(service) statement. For my 1st Web access connection, my question is how to login Spunk website correctly. For my 2nd Splunk client connection, my question is how to modify its "search" string to get correct results. I am fine with either one.

bergen288 · ‎09-28-2021

First, I don't see any valid search result with print(result) statement. My key question is how to define search string for https://splunk.usce.l.az.fisv.cloud/en-US/app/epayments/postpayee_success_and_failure?form.SponsorId=*&form.SubscriberId=*&form.CorrelationId=*&form.Status=*&form.Exception=-&form.timespan.earliest=-7d%40h&form.timespan.latest=now after Splunk client connection? Second, I don't see Splunk website login example in your link? Thanks.

bergen288 · ‎09-27-2021

I need to collect Specific Splunk data for business analysis. My target URL is https://splunk.usce.l.az.fisv.cloud/en-US/app/epayments/postpayee_success_and_failure?form.SponsorId=*&form.SubscriberId=*&form.CorrelationId=*&form.Status=*&form.Exception=-&form.timespan.earliest=-7d%40h&form.timespan.latest=now. After login with my username/password, it will show "Post Payee Exception List". I am trying to write a Python script to read Splunk data in last 7 days. Below is my code: session = requests.Session() response = session.post(LOGIN_URL, auth = HTTPBasicAuth(user, password), verify=False) print(response.status_code) The user/password are the same ones for Web access and the LOGIN_URL is 'https://splunk.usce.l.az.fisv.cloud/en-US/account/login?return_to=%2Fen-US%2F' However, the response status code is 401 which is a failure. What's the correct Python way to login to Splunk website? In addition, I am trying to connect to Splunk server with Splunk-SDK package via port 8089. Below is my Python code: import splunklib.client as client import splunklib.results as results HOST = "splunk.usce.l.az.fisv.cloud" PORT = 8089 credentials = get_splunk_pwd() username = credentials['username'] password = credentials['password'] service = client.connect( host=HOST, port=PORT, username=username, password=password) print(service) rr = results.ResultsReader(service.jobs.export("search index=_internal earliest=-24h | head 5")) for result in rr: if isinstance(result, results.Message): # Diagnostic messages might be returned in the results print('%s: %s' % (result.type, result.message) ) elif isinstance(result, dict): # Normal events are returned as dicts print(result) Below is the output. It looks like the Splunk connection is established successfully. But the serarch is invalid. What's the valid search string based on my target URL in 1st line? <splunklib.client.Service object at 0x0000029461421790> DEBUG: Configuration initialization for /opt/splunk/etc took 91ms when dispatching a search (search ID: 1632765670.57370_31B6A7A0-BF6B-46EF-BD46-2CF0D6AB351A) DEBUG: Invalid eval expression for 'EVAL-SessionDateTime' in stanza [source::dbmon-tail://*/CCAuditLogSelect]: The expression is malformed. An unexpected character is reached at '“%Y-%m-%d %H:%M:%S.%3N”)'. DEBUG: Invalid eval expression for 'EVAL-TrxDateTime' in stanza [source::dbmon-tail://*/CCAuditLogSelect]: The expression is malformed. An unexpected character is reached at '“%Y-%m-%d %H:%M:%S.%3N”)'. DEBUG: base lispy: [ AND index::_internal ] DEBUG: search context: user="xzhang", app="search", bs-pathname="/opt/splunk/etc"

Posts	17
Solutions	0
Karma Given	0
Karma Received	0
Member Since	‎09-27-2021

Online Status	Offline
Date Last Visited	‎11-26-2021 05:23 PM

Issues for Splunk data collection with Python splu...

Mismatch ']' in the search of Python Splunk SDK pa...

Python script to read Splunk data

Re: Issues for Splunk data collection with Python ...

Re: Issues for Splunk data collection with Python ...

Issues for Splunk data collection with Python splu...

Re: Mismatch ']' in the search of Python Splunk SD...

Re: Mismatch ']' in the search of Python Splunk SD...

Re: Mismatch ']' in the search of Python Splunk SD...

Re: Mismatch ']' in the search of Python Splunk SD...

Mismatch ']' in the search of Python Splunk SDK pa...

Re: Python script to read Splunk data

Re: Python script to read Splunk data

Re: Python script to read Splunk data

Re: export results to csv

Re: Python script to read Splunk data

Re: Python script to read Splunk data

Re: Python script to read Splunk data

Re: Python script to read Splunk data

Python script to read Splunk data

Are you a member of the Splunk Community?