Hello - I am trying to troubleshoot an issue and have not had much success in determining a root cause. I was wondering if anybody else has ever seen this issue. I inherited this cluster about a y...
See more...
Hello - I am trying to troubleshoot an issue and have not had much success in determining a root cause. I was wondering if anybody else has ever seen this issue. I inherited this cluster about a year ago, which is a distributed cluster hosted mostly in our own cloud environment with a few parts located on-prem. The indexer tier consists of a CM with 3 peer nodes. The search tier is two SH (not setup as a cluster), one being the main SH used by the company, and the other SH is running ES and is also he LM for the deployment. We have a license quota of 300Gb/Day, and until very recently I believed we were averaging around 280-320 Gb/day during the workweek.
In the past year, we have been told on multiple occasions that we are exceeding our license quota, and it has been a continuous effort to monitor and reduce our sourcetypes when possible to manage our daily ingestion volume. This has been something that we have spent a lot of time trying to maintain. Doing some log analysis on our sourcetypes, I have discovered what I believe is a duplicate scenario somewhere between the ingest point and the indexer tier. In a rough example to show my case, I have been trying to take measurements of our ingest to better understand what is happening. Using the method recommended in the Splunk Forums, I started out measuring our ingest using this base search.
index=_internal sourcetype=splunkd source=*license_usage.log type=Usage earliest=-1d@d
| stats sum(b) as bytes by idx
| eval gb=round(bytes/1024/1024/1024,3)
This sum came out to be between 280-300 Gb /day. I then tried to measure the bytes in the logs using the len() aggregate function, in a search like this:
index=* earliest=-1d@d
| eval EventSize=len(_raw)
| stats sum(EventSize) by index
This search sums out to around 150 Gb /day. From my understanding, this is completely the opposite of what I expected. I did not expect the numbers to be exactly the same, but being roughly double does not make sense to me. If I am ingesting bytes, and Splunk is counting as (bytes)*2, this is a big issue for us. Please let me know if I am incorrect in this assumption. Another case example. To reproduce this phenomena, I created a scenario that ingests a small amount of data to hopefully make measuring the difference easier to do. I created a script that makes an API call to another solution in our network, to pull down information about its clients. Since this list is mostly static, I figured it would be a nice and small data source that I can work with. Running the script locally in a bash shell returns 606 json events. When redirected to a local file, these 606 events equal 396,647 bytes in size.
Next I put this script in an app on my searchhead. I created a sourcetype specifically for the json data that the API call will be returning. I created a Script modular input to execute the API call every 60 seconds. I enabled the modular input, let it execute once, then disabled it. Looking for the logs in search, Splunk is calling it 1210 events and checking the len(_raw) shows 792,034 bytes. This seems to be a very large issue for us, and it seems to be effecting every sourcetype that we have. I did a similar test on our DNS logs, and our firewall logs, which are our two largest sourcetypes. A close examination of the local log file, shows logs that are near exactly half the size in bytes as the same _raw log examined in Splunk.
Has anyone every seen a issue like this before? I have a Case opened with Splunk but so far it has not yielded any actionable results and we are going on 2 weeks now of back and forth. Any ideas or insights would be greatly appreciated. TY