Hello - I am trying to troubleshoot an issue and have not had much success in determining a root cause. I was wondering if anybody else has ever seen this issue.
I inherited this cluster about a year ago, which is a distributed cluster hosted mostly in our own cloud environment with a few parts located on-prem. The indexer tier consists of a CM with 3 peer nodes. The search tier is two SH (not setup as a cluster), one being the main SH used by the company, and the other SH is running ES and is also he LM for the deployment. We have a license quota of 300Gb/Day, and until very recently I believed we were averaging around 280-320 Gb/day during the workweek.
In the past year, we have been told on multiple occasions that we are exceeding our license quota, and it has been a continuous effort to monitor and reduce our sourcetypes when possible to manage our daily ingestion volume. This has been something that we have spent a lot of time trying to maintain. Doing some log analysis on our sourcetypes, I have discovered what I believe is a duplicate scenario somewhere between the ingest point and the indexer tier.
In a rough example to show my case, I have been trying to take measurements of our ingest to better understand what is happening. Using the method recommended in the Splunk Forums, I started out measuring our ingest using this base search.
index=_internal sourcetype=splunkd source=*license_usage.log type=Usage earliest=-1d@d
| stats sum(b) as bytes by idx
| eval gb=round(bytes/1024/1024/1024,3)
This sum came out to be between 280-300 Gb /day.
I then tried to measure the bytes in the logs using the len() aggregate function, in a search like this:
index=* earliest=-1d@d
| eval EventSize=len(_raw)
| stats sum(EventSize) by index
This search sums out to around 150 Gb /day. From my understanding, this is completely the opposite of what I expected. I did not expect the numbers to be exactly the same, but being roughly double does not make sense to me. If I am ingesting bytes, and Splunk is counting as (bytes)*2, this is a big issue for us. Please let me know if I am incorrect in this assumption.
Another case example. To reproduce this phenomena, I created a scenario that ingests a small amount of data to hopefully make measuring the difference easier to do. I created a script that makes an API call to another solution in our network, to pull down information about its clients. Since this list is mostly static, I figured it would be a nice and small data source that I can work with. Running the script locally in a bash shell returns 606 json events. When redirected to a local file, these 606 events equal 396,647 bytes in size.
Next I put this script in an app on my searchhead. I created a sourcetype specifically for the json data that the API call will be returning. I created a Script modular input to execute the API call every 60 seconds. I enabled the modular input, let it execute once, then disabled it. Looking for the logs in search, Splunk is calling it 1210 events and checking the len(_raw) shows 792,034 bytes. This seems to be a very large issue for us, and it seems to be effecting every sourcetype that we have. I did a similar test on our DNS logs, and our firewall logs, which are our two largest sourcetypes. A close examination of the local log file, shows logs that are near exactly half the size in bytes as the same _raw log examined in Splunk.
Has anyone every seen a issue like this before? I have a Case opened with Splunk but so far it has not yielded any actionable results and we are going on 2 weeks now of back and forth. Any ideas or insights would be greatly appreciated. TY
The reason for duplication could vary based on source and type of monitoring. I would add other metadata in the above search (like shown below) and look at each data source. If the same raw data is coming from same file different host, it may be OK (like heart beat events comings from different hosts at the same time, same could be true for same data from different files). If data is from duplicated from same file/host, then look a the inputs configurations why it may monitor same file twice (CrcSalt is being used, so softlink is being used etc)
Your Base Search
| eval IngestTime=_indextime
| stats dc(IngestTime) as countIngested values(host) as hosts values(source) as sources by _raw _time index sourcetype
| where countIngested>1
See if your events are getting duplicated (same event is ingested twice)
index=_internal
| eval IngestTime=_indextime
| stats dc(IngestTime) as countIngested by _raw _time
| where countIngested>1
Thank you for the quick response.
Using the search you provided, there are many rows with countIngested = 2, and a few where countIngested=3.
I am guessing this proves that my events are getting duplicated.
If that is the case, do you have any guidance on how you would go about fixing this type of issue?
The reason for duplication could vary based on source and type of monitoring. I would add other metadata in the above search (like shown below) and look at each data source. If the same raw data is coming from same file different host, it may be OK (like heart beat events comings from different hosts at the same time, same could be true for same data from different files). If data is from duplicated from same file/host, then look a the inputs configurations why it may monitor same file twice (CrcSalt is being used, so softlink is being used etc)
Your Base Search
| eval IngestTime=_indextime
| stats dc(IngestTime) as countIngested values(host) as hosts values(source) as sources by _raw _time index sourcetype
| where countIngested>1
Sorry for the delay - I have been trying to decipher what your search is doing and what the output is indicating.
I would like to confirm I am understanding this correctly.
Looking at the output, the _raw is only the _internal log, not the log being ingested?
The countIngested is a count of the _indextime when these _internal logs are seen ingested, and in this case it would be more than once? Which would show a possible double index?
Should I be focused on sources that are not internal splunk logs or sourcetypes like splunkd?
I am sorry if this is fundamental knowledge, I realize I have a lot to learn and I am trying to get educated.
I should've been more clear. You would replace "index= _internal" with your search, where you think duplication is happening. (e.g. index=os )
Thank you for clarifying. That makes more sense now.
When I run this in this way I do not get very many results. I take it that is a good thing, but it doesn't explain why I would have
On the node that has the Scripted Modular Input, I believe my outputs.conf file might have 2 indexer groups specified, but they contain the same list of peers. I wonder if that is causing some sort of data clone for that input?
If I don't have duplicates being indexed, but these two measuring searches are almost exactly double from each other, do you know of any other reason for why that might be?
index=_internal sourcetype=splunkd source=*license_usage.log type=Usage earliest=-1d@d
| stats sum(b) as bytes by idx
| eval gb=round(bytes/1024/1024/1024,3)
The sum of "bytes" equals roughly300 Gb, and
index=* earliest=-1d@d
| eval EventSize=len(_raw)
| stats sum(EventSize) by index
the sum of "sum(EventSize)" equals roughly 150 Gb.
I am going to try to change my outputs.conf to what I think it should be, and see if that fixes my issue for the scripted input.
It turns out, the issue with my outputs.conf file was in fact creating some kind of data cloning scenario. I changed the conf setting, and now receive the amounts of events that I would expect. This explains the duplicate events in my Scripted Input. I am hoping I might be able to find another one somewhere that would explain why these numbers are doubled.