Hunk - Data Model Acceleration - Parquet files get...

prvnks · ‎07-07-2016

I was trying out datamodel acceleration with Hunk (latest version). This is how my datamodel.conf looks:

cat etc/apps/search/local/datamodels.conf
[LVSMC]
acceleration = 1
acceleration.earliest_time = -1d
acceleration.hunk.compression_codec = snappy
acceleration.hunk.dfs_block_size = 134217728
acceleration.hunk.file_format = parquet
acceleration.manual_rebuilds = 0

It starts accelerating. But, the parquet-snappy files get deleted after collecting for around 10-20 mins. Suddenly, the parquet files disappears. May be summary creation is dropping this newly created files.

$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 08:46:56 PDT 2016
0  0  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:04:40 PDT 2016
2.5 G  7.4 G  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:05:47 PDT 2016
75.4 M  226.2 M  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test

I tried to play around with other options. It did not help.

acceleration.max_time
acceleration.backfill_time
acceleration.manual_rebuilds
acceleration.max_concurrent

Pls note that our Hunk would require around 8 hours to process entire day’s data when no other queries are fired. I don’t know how to catch up and make Hunk accelerate datamodel for 1 day data. Is there some switch that I can use to retain the parquet-snappy files? I tried to adjust earliest_time and backfill_time(much shorter than earliest_time). It did not help.

Pls let me know where it could be going wrong.

hsesterhenn_spl · ‎10-10-2019

Hi,

very old stuff but might be still a current problem...

Have you ever tried to switch the file format from "parquet" to "orc"?

parquet-hive-bundle-1.6.0.jar is f-uped...
https://issues.apache.org/jira/browse/PARQUET-246

Looks like they fixed it in 1.8.0 which has never been shipped by Splunk Core...

I have done my tests with Hadoop DMA using ORC...

Worth a try?

Good luck,

Holger

rdagan_splunk · ‎07-07-2016

Here is the Expected behavior – Do you see this behavior or something different?
Every 5 minutes update the DMA (Data Model Acceleration)
Every 30 minutes delete all the DMA files that are no longer valid ..
See details: http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Acceleratedatamodels — After you enable acceleration for a data model
Look at the delete action we do every 30 minutes.

kschon_splunk · ‎07-07-2016

In your limits.conf file, try setting maintenance_period to something larger than the default, which is 1800 (i.e. 30 min). That default seems like it would explain the 10-20 minute lifespan you're seeing. If you change it to, say, 5400 (i.e. 90 min) do the DM files last longer? This won't fix your problem, but will help narrow down what is happening.

Hunk - Data Model Acceleration - Parquet files getting deleted

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

New Dates, New City: Save the Date for .conf25!

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud