Knowledge Management

Hunk - Data Model Acceleration - Parquet files getting deleted

prvnks
New Member

I was trying out datamodel acceleration with Hunk (latest version). This is how my datamodel.conf looks:

cat etc/apps/search/local/datamodels.conf
[LVSMC]
acceleration = 1
acceleration.earliest_time = -1d
acceleration.hunk.compression_codec = snappy
acceleration.hunk.dfs_block_size = 134217728
acceleration.hunk.file_format = parquet
acceleration.manual_rebuilds = 0

It starts accelerating. But, the parquet-snappy files get deleted after collecting for around 10-20 mins. Suddenly, the parquet files disappears. May be summary creation is dropping this newly created files.

$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 08:46:56 PDT 2016
0  0  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:04:40 PDT 2016
2.5 G  7.4 G  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:05:47 PDT 2016
75.4 M  226.2 M  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test

I tried to play around with other options. It did not help.

acceleration.max_time
acceleration.backfill_time
acceleration.manual_rebuilds
acceleration.max_concurrent

Pls note that our Hunk would require around 8 hours to process entire day’s data when no other queries are fired. I don’t know how to catch up and make Hunk accelerate datamodel for 1 day data. Is there some switch that I can use to retain the parquet-snappy files? I tried to adjust earliest_time and backfill_time(much shorter than earliest_time). It did not help.

Pls let me know where it could be going wrong.

0 Karma

hsesterhenn_spl
Splunk Employee
Splunk Employee

Hi,

very old stuff but might be still a current problem...

Have you ever tried to switch the file format from "parquet" to "orc"?

parquet-hive-bundle-1.6.0.jar is f-uped...
https://issues.apache.org/jira/browse/PARQUET-246

Looks like they fixed it in 1.8.0 which has never been shipped by Splunk Core...

I have done my tests with Hadoop DMA using ORC...

Worth a try?

Good luck,

Holger

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Here is the Expected behavior – Do you see this behavior or something different?
Every 5 minutes update the DMA (Data Model Acceleration)
Every 30 minutes delete all the DMA files that are no longer valid ..
See details: http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Acceleratedatamodels — After you enable acceleration for a data model
Look at the delete action we do every 30 minutes.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

In your limits.conf file, try setting maintenance_period to something larger than the default, which is 1800 (i.e. 30 min). That default seems like it would explain the 10-20 minute lifespan you're seeing. If you change it to, say, 5400 (i.e. 90 min) do the DM files last longer? This won't fix your problem, but will help narrow down what is happening.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...