My current splunk install handles logs for about 80 different development groups. Each one with their own idea of what logs should be. Currently at 700+ sourcetypes, and growing every day.
Some of the devs get really carried away with what they're logging. IMHO. I used to have a limit of 10 KB. Who would ever need log events with more than 10 KB of data. "That's insane!" I used to say. Ah the good old days. Then they wanted 50 KB, and i grudgingly complied. Then they wanted 100 KB, and again i was forced to comply. It's still not enough. I now have log events of 1 MB or larger. Giant XML dumps. I dunno if I'd call them logs. More like entire databases. And of course they "need" them. Gotta have it. No matter the limit, they will find ways to push beyond.
Do you just run with
TRUNCATE=0 (and a well-defined
LINE_BREAKER)? Do you try to educate devs to not be braindead and/or lazy? I've been trying the latter for years, and it feels like I'm just spinning my wheels.
In the last 24hrs I see 1.1 million warnings regarding truncation in the internal logs, representing about 75 sourcetypes. Gross.
This one is super hard given it's very specific to your data owner's needs of that data. There might be other ways for them to solve the problem. It's certainly odd to expect a production system having to output that much data in each pop.
Perhaps getting them to be more involved and understand the implication of their needs would get them to be more emotionally invested in the challenge you face.
Alternatively, what about breaking those large events into a ton of sub events (just events in Splunk) and giving them knowledge objects to correlate and reattach them all at search time as needed?
1) Just to be sure I understand you... how, precisely, were you "forced to comply"?
2) Have you calculated what the storage and logging cost is, and made that known to your management?
3) What are the actual ability of your system to process and store log events? If you are getting a million truncation events per day, then you are losing data. What is the effect of losing that data?
4) Just for fun, consider the possibility of a policy rerouting oversize events for each sourcetype to a separate index. They can have 1 meg if they want, but it will go away ten times faster. Properly compact events get priority.