Getting Data In

Importing logs from Parquet into Splunk

w344423
Explorer

Hi Guys, I am performing a POC to import our parquet files into splunk, i have manage to write a python script to extract out the events aka raw logs to a df. 

I also did a python script to pump the logs via the syslog protocol to HF than to indexer. I am using the syslog method because i got many log type and i can do this by using the [udp://portnumber] to ingest multiple types of logs at once and to a different sourcetype

however when i do this I am not able to retain the original datatime on the raw event but it is taking the datetime on the point i was sending the event. secondly i am using python because all these parquet files are storing in a s3 container hence it will be easier for me to loop thru the directory and extract the file. 

I was hoping if anyone can help me out how can i get the original timestamp of the logs? Or there are other more effective way of doing this?

sample logs from splunk after index,

- Nov 10 09:45:50 127.0.0.1 <190>2023-09-01T16:59:12Z server1 server2 %NGIPS-6-430002: DeviceUUID: xxx-xxx-xxx

heres my code to push the event via syslog. 

import logging
import logging.handlers
import socket
from IPython.display import clear_output


#Create you logger. Please note that this logger is different from ArcSight logger.
#my_loggerudp = logging.getLogger('MyLoggerUDP')
#my_loggertcp = logging.getLogger('MyLoggerTCP')

#We will pass the message as INFO
my_loggerudp.setLevel(logging.INFO)

#Define SyslogHandler

#TCP
#handlertcp = logging.handlers.SysLogHandler(address = ('localhost',1026), socktype=socket.SOCK_STREAM)

#UDP
handlerudp = logging.handlers.SysLogHandler(address = ('localhost',1025), socktype=socket.SOCK_DGRAM)

#X.X.X.X =IP Address of the Syslog Collector(Connector Appliance,Loggers etc.)

#514 = Syslog port , You need to specify the port which you have defined ,by default it is 514 for Syslog)
my_loggerudp.addHandler(handlerudp)
#my_loggertcp.addHandler(handlertcp)

#Example: We will pass values from a List

event = df["event"]
count = len(event)
#for x in range(2):
for x in event:
clear_output (wait=True)
my_loggerudp.info(x)
my_loggerudp.handlers[0].flush()
count -= 1
print(f"logs left to be transmit {count}")
print (x)

 

Labels (5)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

IMO, syslog should the onboarding choice of last resort.  There are too many syslog "standards" and issues always arise (like yours).

Since you're building your own ingestion program, consider sending the data to Splunk using HTTP Event Collector (HEC).  See "To Add Data Directly to an Index" at https://dev.splunk.com/enterprise/docs/devtools/python/sdk-python/howtousesplunkpython/howtogetdatap...

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...