Export data to Parquet (Hadoop - Cloudera Stack) |...

astone42 · ‎08-08-2017

We have a Hadoop cluster that's based on the Cloudera Stack (CDH 5.8.3) and we are using parquet file format to store the data.
We want to export processed data from Splunk directly to the parquet tables in the Hadoop Cluster.

Example, let's assume a table named user_sessions exists in the Hadoop cluster stored in parquet.
1. User sessions log files are pushed to splunk
2. Scheduled Splunk query process the log files and outputs them in a table format
3. The data from step 2 is appended to the user_sessions table in the Hadoop cluster.

A possible solution for step 3 is to create a splunk custom command that connects to Impala through pyodbc and writes the data using INSERT INTO. The bottleneck for that solution is the performance.

Any ideas/suggestions?

Thanks a lot in advance.

rdagan_splunk · ‎08-08-2017

Martin, you can use Impala and DB Connect
Impala (notice the details on DBX3):
https://answers.splunk.com/answers/489448/can-i-connect-to-impala-sql-engine-on-hadoop-from.html
DB Connect dbxoutput command:
http://docs.splunk.com/Documentation/DBX/latest/DeployDBX/Createandmanagedatabaseoutputs

Export data to Parquet (Hadoop - Cloudera Stack) | Scheduled job

SOC4Kafka - New Kafka Connector Powered by OpenTelemetry

Your Voice Matters! Help Us Shape the New Splunk Lantern Experience

Building Momentum: Splunk Developer Program at .conf25

Are you a member of the Splunk Community?

Export data to Parquet (Hadoop - Cloudera Stack) | Scheduled job

SOC4Kafka - New Kafka Connector Powered by OpenTelemetry

Your Voice Matters! Help Us Shape the New Splunk Lantern Experience

Building Momentum: Splunk Developer Program at .conf25