Re: Reducing logs redundancy

PanKokos · ‎01-02-2017

Hi,

In my project we are using Splunk mainly for performance monitoring of application and we have created a dedicated logs for that. Currently they have following format:

1 [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ....
2. [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ...
3 [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="bbb" ....
4 [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="bbb" ....

That was the easiest way to easily write Splunk queries and produce nice graphs. However, we are repeating a lots of data here. Also, it is making this logs harder to read for humans. Do you know if there is an easy alternative to remove this redundancy and still effectively query log files? E.g. to get at least a global context variables logged only once for a given task, something like:

1 [Time] TaskId="1" Key="CONTEXT" GlobalContextVariable="xxx" GlobalContextVariable2="vvv"
2 [Time] TaskId="1" Measure1="Value" Measure2="Value" LocalContextVariable="aaa" ....
3 [Time] TaskId="1" Measure1="Value" Measure2="Value" LocalContextVariable="bbb" ...

Please forgive me if I am asking for something obvious 🙂

In the end, we want to be able to present e.g. Measure1 values grouped by GlobalContextVariable(s).

Thanks a lot in advance for any help.
Michal

PanKokos · ‎01-03-2017

Hi,

I think that my question was misunderstood - or I have asked it not precise enough. We do not have duplicate entries, e.g. for same time and with same values:

1 [Time1] TaskId="1" Measure1="Value1" Measure2="Value2" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ....
2 [Time1] TaskId="1" Measure1="Value1" Measure2="Value2" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ...

Basically, we are logging (mostly) execution time of critical section of application so it will be:

1 [Time1] TaskId="1" ElapsedTime="100" TaskSender="ClientA" TaskType="TypeA" SectionName="BuildingEnv"  
2 [Time1] TaskId="1" ElapsedTime="34" TaskSender="ClientA" TaskType="TypeA" SectionName="CalculatingResults" 
3 ....

For each of this kind of logs we have number of distinct value (e.g. SectionName) and number of repeated values in every log entry (e.g. type of task, requestor etc.). What we want to achieve, is to compress the log files so we will have:

1 [Time1] TaskId="1" TaskSender="ClientA" TaskType="TypeA" SectionName="Context"  **<-- logging static information for given task only once**
2 [Time1] TaskId="1" ElapsedTime="100"  SectionName="BuildingEnv"
3 [Time1] TaskId="1" ElapsedTime="34" SectionName="CalculatingResults"
4 ....

Because Splunk is, obviously, non-sql I am not sure if above is possible without reducing the performance of the queries and if that is even possible?

Kind regards,

richgalloway · ‎01-03-2017

To log static information only once, use the dedup command.

---
If this reply helps you, Karma would be appreciated.

PanKokos · ‎01-03-2017

Hmm... Not sure if I got this. If I will modify my log files to be in a last form from above, for 2 tasks

 1 [Time1] TaskId="1" TaskSender="ClientA" TaskType="TypeAAA" SectionName="Context"   
 2 [Time1] TaskId="1" ElapsedTime="100"  SectionName="BuildingEnv"
 3 [Time1] TaskId="1" ElapsedTime="34" SectionName="CalculatingResults"
 4 [Time1] TaskId="2" TaskSender="ClientA" TaskType="TypeBBB" SectionName="Context"   
 5 [Time1] TaskId="2" ElapsedTime="100"  SectionName="BuildingEnv"
 6 [Time1] TaskId="2" ElapsedTime="34" SectionName="CalculatingResults"
 4 ....

Will it work for query:

source | stats sum(ElapsedTime) as TotalTime by TaskType

Note that TaskType is log only once for given task and is not present in other log lines (which have the time). Where is the place for dedup here?

richgalloway · ‎01-03-2017

Thanks for clarifying. What you have is not redundant so dedup does not apply.

I'd look at using the transaction command with the TaskId field to group events together.

---
If this reply helps you, Karma would be appreciated.

PanKokos · ‎01-04-2017

Hi,

Yes, that looks like something I am looking for. I will test that and let you know if that is the thing.

Do you know if this have a big impact of query execution time?

richgalloway · ‎01-04-2017

Yes, the transaction command is known to affect performance. I think it's your only option, however.

---
If this reply helps you, Karma would be appreciated.

gokadroid · ‎01-02-2017

Please check the suggestions made by @richgalloway, however one way to check if the data is duplicate you can use stats as follows too:

If the events are duplicate then their time value in logs shall also be repeating for the duplicate events. Doing a stats on minimal fields, including time field, can prove duplicate events and can be done as follows :

your query to return events
| stats count by Time, TaskId, Measure1, Measure2, GlobalContextVariable, GlobalContextVariable2, LocalContextVariable ....
| sort TaskId

That shall tell you the duplicates if the count field value is more than one any of the table rows. The ... in the above query should be updated with all the remaining fields you might have or want to use to see the duplicate events.

Once you know you are getting the counts as you had hoped, you can remove the count field by appending | fields - count and get the de-duplicated data.

richgalloway · ‎01-02-2017

You should review your inputs.conf settings to verify each log is ingested only once. If you include source in your query you should be able to see if the same log file is producing the redundant data. Indexing the same sources multiple times wastes storage and consumes your license needlessly.

Please share the search that is resulting in duplicate data. In the meantime, consider adding dedup TaskId Measure1 Measure2 LocalContextVariable to your search to eliminate redundant rows.

---
If this reply helps you, Karma would be appreciated.

Reducing logs redundancy

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?