Hi All,
we are most probably facing an issue we can't pinpoint. We are in the process of closing our Firepower deployment and in the last couple of weeks we enabled Firepower services in one of our top talker locations, plus three small ones, for which we were expecting a bump, in terms of Splunk license usage, of about 15-17GB daily.
In reality Splunk's license usage remained pretty much the same before and after the migrations (101-102GB daily).
We checked already the sensors and all of them are correctly reporting to the FMC, but for some still unclear reasons seems that the amount of log data that are daily produced by the other sensors proportionally decreased in order to accommodate the new entries and contextually not exceeding the 101-102GB.
I know that this sounds like a "Stranger Things" episode, but I would appreciate if anyone already experienced a similar issue..
Thank you.
Dom
Makes no sense to me either. Cisco TAC can probably provide guidance on how to look at the sensor-based event rates.
There is no 'elastic' mechanism in the Firepower solution that throttles events on one sensor as a function of what's happening on another.
A new version of eNcore for Splunk is in development. We expect much higher event rates will be supported.
Targeting Late April for posting the new version.,A new eNcore for Splunk which will support much higher event rates is in development. Late April we expect to post an updated version.
We faced a similar issue, both with the old eStreamer perl Splunk app and the new eStreamer eNcore python one. Reported it about a year ago.
One CPU running the python script maxed at 100% with either app.
The log files that the app writes have up-to-date file names (the file names include the date), but the events inside the files gradually get more and more delayed until there is a gap in the logs.
Try this search to see if there are gaps in your logs (select visualization tab):
earliest=-1d sourcetype="cisco:estreamer:data"
| timechart count span=15m
Looked like a square wave for us... lots of events for a few hours, then a gap for about an hour of none.
I edited the python to make it multiprocess and was able to get rid of the issue, and get all my logs on time. More of a proof of concept than production code, though... I don't have the time to do the code changes properly, but I had to get it working because we don't have the bandwidth to use syslog (doubles bandwidth usage if you are also sending logs to FMC). I did provide the proof of concept code to Cisco in September 2017.
If you can, just use syslog until they get this working. Seems to be what most folks do.
jtalpash, when you speak of syslog are you talking about normal ASA syslog or is there a method of extracting syslog from the FMC?
Within the Access Control Policy on the FMC we can configure Syslog logging. Unfortunately this gets sent from the FirePOWER sensors and not from the FMC.
It's not the same as the ASA syslog, though.
There is an updated version available... we're in the process of testing it.
Is there a reason your using file creation time rather than the event_sec field for the timestamp? We found this to be causing issues on out side anytime the Access Control Policy Metadata event was logged which didn't have the event_sec field. Splunk would then stop indexing the files.
We made several changes which seem to have helped:
1. Changed the monitor input to batch and set to sinkhole. I couldn't come up with a reason to wait for the cleanup script to run and would just rather Splunk deletes the files after indexing.
2. Since we found no value in the Access Control Policy Metadata events, and they didn't have the event_sec field, we excluded these events in estreamer.conf by adding "145" to the exclude section of the records handler.
After making these two changes our setup has been running fairly smooth event with a single CPU core maxed out. We will see how things go once we add additional locations in the coming months.
Exactly, same shape for us as well. Unfortunately Syslog is our last resort, but we would try the modified script/scripts first. Would you mind to share it/them? We have no Python knowledge so we don't know exactly what to change. If tested and working for you, without incurring in race conditions, would be a great option meanwhile we wait for the official update.
Makes no sense to me either. Cisco TAC can probably provide guidance on how to look at the sensor-based event rates.
There is no 'elastic' mechanism in the Firepower solution that throttles events on one sensor as a function of what's happening on another.
After troubleshooting with TAC, they pinpointed the issue to the python script that is retrieving the Connection Events from FMC. Particularly, the single python process is fully utilizing one CPU only when should be, due to the high log volume, developed differently. I trust you are aware of this issue and we are looking the new release.
Definitely aware. Met with your colleague at C Live last week.
Doug,
Let me know if you need any help testing/troubleshooting. We have a Splunk lab instance setup and are generating several million connection events daily with plans to grow significantly this year. Currently our FMC3500's only retain ~8 days worth of connection events event with the FMC internal DB setting increased to 350 million events. We've worked with you on this issue in the past and found the bottleneck has been the single threaded python script. We'd love to get this data into Splunk in a near real-time fashion.
Great work your doing on the add-on, let us know how to help.