Reporting

Export data to Parquet (Hadoop - Cloudera Stack) | Scheduled job

astone42
Engager

We have a Hadoop cluster that's based on the Cloudera Stack (CDH 5.8.3) and we are using parquet file format to store the data.
We want to export processed data from Splunk directly to the parquet tables in the Hadoop Cluster.

Example, let's assume a table named user_sessions exists in the Hadoop cluster stored in parquet.
1. User sessions log files are pushed to splunk
2. Scheduled Splunk query process the log files and outputs them in a table format
3. The data from step 2 is appended to the user_sessions table in the Hadoop cluster.

A possible solution for step 3 is to create a splunk custom command that connects to Impala through pyodbc and writes the data using INSERT INTO. The bottleneck for that solution is the performance.

Any ideas/suggestions?

Thanks a lot in advance.

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...