Index only first Occurrence of string in events

ips_mandar · ‎07-18-2019

I want to know if below things are possible in splunk and if YES then How it can be achieved-
1. Below is sample events

2019-07-16|21:15:43.370|INFO|This is a statement
2019-07-16|21:16:43.370|INFO|Random statement
2019-07-16|21:17:43.370|INFO|Random statement
2019-07-16|21:18:43.370|INFO|This is a statement
2019-07-16|21:19:43.370|INFO|This is a statement

I have heavy forwarder where I want to index only first occurrence of "This is a statement" line and do not want other lines which contain "This is a statement" string to be index. Since same line coming multiple time in log file and I want to index only first occurrence of it.
2. Below is another sample events

2019-07-16|21:15:43.370|INFO|Temprature-30
2019-07-16|21:16:43.370|INFO|Temprature-30
2019-07-16|21:17:43.370|INFO|Temprature-30
2019-07-16|21:18:43.370|INFO|Temprature-32
2019-07-16|21:19:43.370|INFO|Temprature-32

Here I want only two lines which has distinct temprature to be index.
are these above two strings possible in splunk? I want these to be done before indexing so to reduce indexing volume.
Currently I am using nullqueue and indexqueue to parse required data but now I want to index only first occurrence.
Appreciate your help.

DavidHourani · ‎07-19-2019

Hi @ips_mandar,

This kind of logic is not possible on the HF alone as the indexing pipeline doesn't keep a history of the indexed events. You can see here in more details how that layer works :
https://wiki.splunk.com/Community:HowIndexingWorks

My advice in your case is to create a scripted input and configure it in the inputs.conf to run with run it with an interval of 5-10 mins (more or less depending on your needs). Within this scrip you can apply the required logic and then the output which is the non-duplicated events is the only thing that will get indexed.
Details here of when to use scripted inputs can be found here : https://docs.splunk.com/Documentation/Splunk/7.3.0/AdvancedDev/ScriptedInputsIntro#Use_cases_for_scr...

If you're not comfortable with scripted inputs you can simply cron a script to apply cleansing on your file and rewrite them into new files without duplicates. Then you would index those files instead of the main ones.

Cheers,
David

ips_mandar · ‎07-19-2019

Thanks @DavidHourani
If I write script to remove duplicates logic and if I run then it will require to store parsed files in another folder and then with monitor stanza I will monitor these parsed files which will require "disk space" since my all files are zip files.
Can it be possible with zip files without storing any parsed log files on separate folder and directly send for indexing? if yes can you please help me with sample script..

DavidHourani · ‎07-19-2019

You're welcome @ips_mandar.
You don't have to unzip your files, then read, then delete. You can simply read them using zcatfrom the script :
https://www.tecmint.com/linux-zcat-command-examples/

Let me know if that's what you're looking for 🙂

ips_mandar · ‎07-19-2019

sorry I didn't mention that I am on Windows server.
I am very new with scripts it will be good if you can share me one script which I can run to remove duplicates from zip files although in duplicates line timestamp will be different.

DavidHourani · ‎07-19-2019

The logic should be as follows :
1- find unique events
2- write into new files
For linux you can very easily do that using :

sort -u your_file > new_file

You could try finding the equivalent for windows, it surely exists.

Also you might need to handle the timestamp because that makes all lines different, so you'll also need to exclude that from the "unique" logic.

Index only first Occurrence of string in events

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes