Hi,
In my project we are using Splunk mainly for performance monitoring of application and we have created a dedicated logs for that. Currently they have following format:
1 [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ....
2. [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ...
3 [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="bbb" ....
4 [Time] TaskId="1" Measure1="Value" Measure2="Value" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="bbb" ....
That was the easiest way to easily write Splunk queries and produce nice graphs. However, we are repeating a lots of data here. Also, it is making this logs harder to read for humans. Do you know if there is an easy alternative to remove this redundancy and still effectively query log files? E.g. to get at least a global context variables logged only once for a given task, something like:
1 [Time] TaskId="1" Key="CONTEXT" GlobalContextVariable="xxx" GlobalContextVariable2="vvv"
2 [Time] TaskId="1" Measure1="Value" Measure2="Value" LocalContextVariable="aaa" ....
3 [Time] TaskId="1" Measure1="Value" Measure2="Value" LocalContextVariable="bbb" ...
Please forgive me if I am asking for something obvious 🙂
In the end, we want to be able to present e.g. Measure1 values grouped by GlobalContextVariable(s).
Thanks a lot in advance for any help.
Michal
Hi,
I think that my question was misunderstood - or I have asked it not precise enough. We do not have duplicate entries, e.g. for same time and with same values:
1 [Time1] TaskId="1" Measure1="Value1" Measure2="Value2" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ....
2 [Time1] TaskId="1" Measure1="Value1" Measure2="Value2" GlobalContextVariable="xxx" GlobalContextVariable2="vvv" LocalContextVariable="aaa" ...
Basically, we are logging (mostly) execution time of critical section of application so it will be:
1 [Time1] TaskId="1" ElapsedTime="100" TaskSender="ClientA" TaskType="TypeA" SectionName="BuildingEnv"
2 [Time1] TaskId="1" ElapsedTime="34" TaskSender="ClientA" TaskType="TypeA" SectionName="CalculatingResults"
3 ....
For each of this kind of logs we have number of distinct value (e.g. SectionName) and number of repeated values in every log entry (e.g. type of task, requestor etc.). What we want to achieve, is to compress the log files so we will have:
1 [Time1] TaskId="1" TaskSender="ClientA" TaskType="TypeA" SectionName="Context" **<-- logging static information for given task only once**
2 [Time1] TaskId="1" ElapsedTime="100" SectionName="BuildingEnv"
3 [Time1] TaskId="1" ElapsedTime="34" SectionName="CalculatingResults"
4 ....
Because Splunk is, obviously, non-sql I am not sure if above is possible without reducing the performance of the queries and if that is even possible?
Kind regards,
To log static information only once, use the dedup
command.
Hmm... Not sure if I got this. If I will modify my log files to be in a last form from above, for 2 tasks
1 [Time1] TaskId="1" TaskSender="ClientA" TaskType="TypeAAA" SectionName="Context"
2 [Time1] TaskId="1" ElapsedTime="100" SectionName="BuildingEnv"
3 [Time1] TaskId="1" ElapsedTime="34" SectionName="CalculatingResults"
4 [Time1] TaskId="2" TaskSender="ClientA" TaskType="TypeBBB" SectionName="Context"
5 [Time1] TaskId="2" ElapsedTime="100" SectionName="BuildingEnv"
6 [Time1] TaskId="2" ElapsedTime="34" SectionName="CalculatingResults"
4 ....
Will it work for query:
source | stats sum(ElapsedTime) as TotalTime by TaskType
Note that TaskType is log only once for given task and is not present in other log lines (which have the time). Where is the place for dedup
here?
Thanks for clarifying. What you have is not redundant so dedup does not apply.
I'd look at using the transaction
command with the TaskId field to group events together.
Hi,
Yes, that looks like something I am looking for. I will test that and let you know if that is the thing.
Do you know if this have a big impact of query execution time?
Yes, the transaction command is known to affect performance. I think it's your only option, however.
Please check the suggestions made by @richgalloway, however one way to check if the data is duplicate you can use stats as follows too:
If the events are duplicate then their time
value in logs shall also be repeating for the duplicate events. Doing a stats on minimal fields, including time field, can prove duplicate events and can be done as follows :
your query to return events
| stats count by Time, TaskId, Measure1, Measure2, GlobalContextVariable, GlobalContextVariable2, LocalContextVariable ....
| sort TaskId
That shall tell you the duplicates if the count field value is more than one any of the table rows. The ...
in the above query should be updated with all the remaining fields you might have or want to use to see the duplicate events.
Once you know you are getting the counts as you had hoped, you can remove the count field by appending | fields - count
and get the de-duplicated data.
You should review your inputs.conf settings to verify each log is ingested only once. If you include source
in your query you should be able to see if the same log file is producing the redundant data. Indexing the same sources multiple times wastes storage and consumes your license needlessly.
Please share the search that is resulting in duplicate data. In the meantime, consider adding dedup TaskId Measure1 Measure2 LocalContextVariable
to your search to eliminate redundant rows.