Knowledge Management

Hunk - Data Model Acceleration - Parquet files getting deleted

prvnks
New Member

I was trying out datamodel acceleration with Hunk (latest version). This is how my datamodel.conf looks:

cat etc/apps/search/local/datamodels.conf
[LVSMC]
acceleration = 1
acceleration.earliest_time = -1d
acceleration.hunk.compression_codec = snappy
acceleration.hunk.dfs_block_size = 134217728
acceleration.hunk.file_format = parquet
acceleration.manual_rebuilds = 0

It starts accelerating. But, the parquet-snappy files get deleted after collecting for around 10-20 mins. Suddenly, the parquet files disappears. May be summary creation is dropping this newly created files.

$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 08:46:56 PDT 2016
0  0  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:04:40 PDT 2016
2.5 G  7.4 G  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:05:47 PDT 2016
75.4 M  226.2 M  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test

I tried to play around with other options. It did not help.

acceleration.max_time
acceleration.backfill_time
acceleration.manual_rebuilds
acceleration.max_concurrent

Pls note that our Hunk would require around 8 hours to process entire day’s data when no other queries are fired. I don’t know how to catch up and make Hunk accelerate datamodel for 1 day data. Is there some switch that I can use to retain the parquet-snappy files? I tried to adjust earliest_time and backfill_time(much shorter than earliest_time). It did not help.

Pls let me know where it could be going wrong.

0 Karma

hsesterhenn_spl
Splunk Employee
Splunk Employee

Hi,

very old stuff but might be still a current problem...

Have you ever tried to switch the file format from "parquet" to "orc"?

parquet-hive-bundle-1.6.0.jar is f-uped...
https://issues.apache.org/jira/browse/PARQUET-246

Looks like they fixed it in 1.8.0 which has never been shipped by Splunk Core...

I have done my tests with Hadoop DMA using ORC...

Worth a try?

Good luck,

Holger

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Here is the Expected behavior – Do you see this behavior or something different?
Every 5 minutes update the DMA (Data Model Acceleration)
Every 30 minutes delete all the DMA files that are no longer valid ..
See details: http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Acceleratedatamodels — After you enable acceleration for a data model
Look at the delete action we do every 30 minutes.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

In your limits.conf file, try setting maintenance_period to something larger than the default, which is 1800 (i.e. 30 min). That default seems like it would explain the 10-20 minute lifespan you're seeing. If you change it to, say, 5400 (i.e. 90 min) do the DM files last longer? This won't fix your problem, but will help narrow down what is happening.

0 Karma
Get Updates on the Splunk Community!

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...

New Dates, New City: Save the Date for .conf25!

Wake up, babe! New .conf25 dates AND location just dropped!! That's right, this year, .conf25 is taking place ...

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...