Getting Data In

Filtering events from Hadoop unstructured data

sansri7680
Path Finder

I am trying to read log files from Hadoop cluster. These are unstructured files which otherwise can be filtered after indexing using Regex searches. But my input data is huge and the throughput requirement is also very high. The result is only a small portion of the input. Hence is it possible to filter the input data before being indexed by Hunk so that I can avoid searching unnecessary data

Tags (3)
0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

Currently Hunk optimizes data access if the data is partitioned and Hunk is properly configured to recognize those partitions. Two types of partitioning exist: (a) time based, this is when the data is structured hierarchically using some time partitioning and (b) field based partitioning.

For example if your data is organized as follows

/some/path/20140108/server1/...
/some/path/20140108/server2/...
/some/path/20140109/server1/...
/some/path/20140109/server2/...

You can configure Hunk to recognize the third segment in the path as the data and the fourth segment as the server field. You can look at the details of how to do that here

Currently Hunk does not have the ability to optimize data access based on the file content, because we don't create an index - we just access/process the data in it's raw form.

Does this help?

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...

Evaluating an enterprise observability platform usually goes like this: fill out a form, get a free trial with ...

Deep insights, no barriers: Splunk Observability Cloud Free Edition

As software delivery cycles continue to accelerate, observability shouldn’t be a luxury — it should be a ...

Monitoring AI Agents with Splunk Observability Cloud

Let’s say I’m running a travel planning AI app in production. A user asks for three concise hotel options in ...