Knowledge Management

What is the best way of moving data from splunk to HDFS storage for processing using Apache Spark

manu_mukundan2
Engager

We are currently trying to set up a reliable solution for moving data from Splunk to HDFS location. This is not for archiving. We would like to move the data to HDFS location so that we can further process the data in the HDFS cluster using Apache Spark processing framework. We have looked at these options

  1. Forward data from Splunk HF to Apache Nifi Syslog processor to push the data to HDFS
  2. Forward data from Splunk HF to Apache Nifi TcpListener processor to push the data to HDFS
  3. Splunk Hadoop connect (After looking at Splunk documentation, it looks like this plug-in does not work with the latest versions)
  4. Splunk DSP where the data will be moved directly to Kafka and from there move to HDFS

Thanks in advance
Manu Mukundan

Tags (1)

koshyk
Super Champion

The best option among yours is Option 1 as you get more "pure" data from that.
But the key question here is, WHY you need the data in Splunk then? Could you have split the data before it reaches Splunk?

There is another option https://cribl.io/ logstream if you want to redirect your data before it reaches Splunk.

ledion
Path Finder

Also, if you're thinking of going the NiFi route I would highly recommend checking out this blog post where we compare it's performance to Cribl LogStream and show that it's performance is pretty poor.

jianw223
Loves-to-Learn

I'm guessing you work for Cribl? Anyone that has been around the block knows vendor execute benchmarks are dishonest.

I know this because Cribl was considerably slower and buggy for our use case. It's written in Node for crying out loud!

0 Karma
Get Updates on the Splunk Community!

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Watch On Demand the Tech Talk on November 6 at 11AM PT, and empower your SOC to reach new heights! Duration: ...

Splunk Observability as Code: From Zero to Dashboard

For the details on what Self-Service Observability and Observability as Code is, we have some awesome content ...