Getting Data In

Importing logs from Parquet into Splunk

w344423
Explorer

Hi Guys, I am performing a POC to import our parquet files into splunk, i have manage to write a python script to extract out the events aka raw logs to a df. 

I also did a python script to pump the logs via the syslog protocol to HF than to indexer. I am using the syslog method because i got many log type and i can do this by using the [udp://portnumber] to ingest multiple types of logs at once and to a different sourcetype

however when i do this I am not able to retain the original datatime on the raw event but it is taking the datetime on the point i was sending the event. secondly i am using python because all these parquet files are storing in a s3 container hence it will be easier for me to loop thru the directory and extract the file. 

I was hoping if anyone can help me out how can i get the original timestamp of the logs? Or there are other more effective way of doing this?

sample logs from splunk after index,

- Nov 10 09:45:50 127.0.0.1 <190>2023-09-01T16:59:12Z server1 server2 %NGIPS-6-430002: DeviceUUID: xxx-xxx-xxx

heres my code to push the event via syslog. 

import logging
import logging.handlers
import socket
from IPython.display import clear_output


#Create you logger. Please note that this logger is different from ArcSight logger.
#my_loggerudp = logging.getLogger('MyLoggerUDP')
#my_loggertcp = logging.getLogger('MyLoggerTCP')

#We will pass the message as INFO
my_loggerudp.setLevel(logging.INFO)

#Define SyslogHandler

#TCP
#handlertcp = logging.handlers.SysLogHandler(address = ('localhost',1026), socktype=socket.SOCK_STREAM)

#UDP
handlerudp = logging.handlers.SysLogHandler(address = ('localhost',1025), socktype=socket.SOCK_DGRAM)

#X.X.X.X =IP Address of the Syslog Collector(Connector Appliance,Loggers etc.)

#514 = Syslog port , You need to specify the port which you have defined ,by default it is 514 for Syslog)
my_loggerudp.addHandler(handlerudp)
#my_loggertcp.addHandler(handlertcp)

#Example: We will pass values from a List

event = df["event"]
count = len(event)
#for x in range(2):
for x in event:
clear_output (wait=True)
my_loggerudp.info(x)
my_loggerudp.handlers[0].flush()
count -= 1
print(f"logs left to be transmit {count}")
print (x)

 

Labels (5)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

IMO, syslog should the onboarding choice of last resort.  There are too many syslog "standards" and issues always arise (like yours).

Since you're building your own ingestion program, consider sending the data to Splunk using HTTP Event Collector (HEC).  See "To Add Data Directly to an Index" at https://dev.splunk.com/enterprise/docs/devtools/python/sdk-python/howtousesplunkpython/howtogetdatap...

---
If this reply helps you, Karma would be appreciated.
0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...