Removing duplicates while ingesting data into Splu...

kthudi6 · ‎08-14-2019

Data stopped coming into the Splunk and i checked the disc space as example belowdisc is showing as 100% usage because of the duplicate data files consuming the extra disc space as in example below Log.1, log.2, log.3, log.4, log.5 files using additional memory.Please give me suggestion here is to increase disc space or removing the duplicate log files as in the image(How to remove the duplicates log file ingestion). Please check below examples for better understanding.

Example:
rw------ 1 splunkadmin splunkadmin 8486417 Aug 14 19:28 splunk_ta_microsoft-cloudservices_storage_blob_ZG90Y29tcHJlcHJvZA==.log
rw------ 1 splunkadmin splunkadmin 24829845 Aug 14 19:26 splunk_ta_microsoft-cloudservices_storage_blob_ZG90Y29tcHJlcHJvZA==.log.1
rw------ 1 splunkadmin splunkadmin 24949031 Aug 14 19:16 splunk_ta_microsoft-cloudservices_storage_blob_ZG90Y29tcHJlcHJvZA==.log.2
rw------ 1 splunkadmin splunkadmin 24327574 Aug 14 19:04 splunk_ta_microsoft-cloudservices_storage_blob_ZG90Y29tcHJlcHJvZA==.log.3
rw------ 1 splunkadmin splunkadmin 24787540 Aug 14 18:55 splunk_ta_microsoft-cloudservices_storage_blob_B64_ZG90Y29tcHJlcHJvZA==.log.4
rw------ 1 splunkadmin splunkadmin 24898758 Aug 14 18:47 splunk_ta_microsoft-cloudservices_storage_blob_B64_ZG90Y29tcHJlcHJvZA==.log.5

Disc space using example:
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 32G 31G 840M 98% / - This is the disc space
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 358M 3.6G 9% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 497M 103M 394M 21% /boot
/dev/sdb1 32G 2.1G 28G 7% /mnt/resource
tmpfs 797M 0 797M 0% /run/user/1000

rkantamaneni_sp · ‎10-16-2019

Hi @kthudi6,

Did you resolve this? You mentioned duplicate data/log files. This looks like you have a rotating log file and you're indexing both the actual log file and the rotated logs together?

If this is the case, you'll probably want to setup your file monitoring logic to monitor specific files and/or update your log rolling logic (place files in a different location, have an archive script to clean out old files, etc.).

If you monitor a file that's being constantly written to, Splunk may keep re-ingesting the data from that file because the CRC and End of File keep changing before Splunk gets to the end of file. In this case, if a delay in data is okay, you can only monitor the rolled log files, or create logic to copy the data from the active log file to a separate file that is monitored by Splunk (i.e. copy every 30 seconds or 1 minute).

solarboyz1 · ‎08-15-2019

If the data is already indexed, there is no way to remove it from the index.
Splunk has a delete command, but that will only hide the data from searches, it will not free up the space.

If you want to keep the current data, you can copy a deduped set of the data to a new index:

index=current | dedup identifier | collect index=new

This will require additional space, at least temporarily, and a new index to be created.

Removing duplicates while ingesting data into Splunk through heavy forwarder

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!