Solved: Sourcefire Encore data ingestion issue

vik_splunk · ‎06-15-2021

Hi All,

We have recently upgraded from 7.2.6 to 8.1.3 Splunk and since then, we have been having issues with Sourcefire ingestion from FMC.

Splunk and sourcefire version - prior to upgrade - 7.2.6 and 3.0.0

Splunk and sourcefire version - Post upgrade - 8.1.3 and 4.6.0

TA used - https://splunkbase.splunk.com/app/3662/

What we've attempted so far

3.0.0 with compatibility enabled for Python 2.x - Errors out with Connection reset by peer
estreamer.subscriber ERROR error: \nTraceback (most recent call last):\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 198, in start\n self.connection.connect()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/connection.py", line 80, in connect\n self.socket.connect( ( host, port ) )\n File "/opt/splunk/lib/python2.7/ssl.py", line 864, in connect\n self._real_connect(addr, False)\n File "/opt/splunk/lib/python2.7/ssl.py", line 855, in _real_connect\n self.do_handshake()\n File "/opt/splunk/lib/python2.7/ssl.py", line 828, in do_handshake\n self._sslobj.do_handshake()\nerror: [Errno 104] Connection reset by peer\n
4.6.0 upgraded TA in 8.1.3 - Connection succeeds and collects logs for a while but then, we are met with the errors "Invalid JSON in settings file" followed by Subscriberparser is dead, message - We also found this bug reference, similar to the error -
https://quickview.cloudapps.cisco.com/quickview/bug/CSCvy06369
4.6.0 upgraded TA with compatibility enabled for 2.x - Same as above. Connection succeeds but eventually, stops collection after a while and errors out with the same message as present in 2.
Fresh install of 4.6.0 followed by fresh config. Connects fine to FMC but errors out as below
"Error state. Clearing queue"

In a nutshell, what used to be a stable stream of logs from FMC is completely broken/fragmented. In all cases, able to use the splencore test to establish successful connection and have restarted the service but no luck.

We have been through all articles in community and as well, all suggested troubleshooting but no luck. Any advice on getting this working is much appreciated.

@douglashurd - Can you please advise. Thanks!

vik_splunk · ‎09-15-2021

Posting an update in the hope that it will help someone.

We had a ticket open with Cisco support and after a few versions of the TA, the most recent eStreamer version 4.8.1 https://splunkbase.splunk.com/app/3662/ fixed the issue for us.

Thanks! to @skhademd for delivering the fix.

Anyone encountering this issue, please upgrade to ver. 4.8.1 of the TA as that seems to have fix the issue.

View solution in original post

vik_splunk · ‎09-15-2021

Posting an update in the hope that it will help someone.

We had a ticket open with Cisco support and after a few versions of the TA, the most recent eStreamer version 4.8.1 https://splunkbase.splunk.com/app/3662/ fixed the issue for us.

Thanks! to @skhademd for delivering the fix.

Anyone encountering this issue, please upgrade to ver. 4.8.1 of the TA as that seems to have fix the issue.

_joe · ‎09-15-2021

We had 4.8.1 crash in the lab once so far... Now we are running it with the "alwaysAttemptToContinue": true variable. Has this been stable for more than a week for anyone else ingesting at least a million events per day?

vik_splunk · ‎10-07-2021

Sorry I completely missed this from earlier @_joe . We did not have to add the alwaysattempt.. setting and it was stable for 2-3 weeks. We have quite intensive volume of logging and it did cope well.

3 weeks later, we had failure with the same Invalid JSON error and just had to restart encore. The issue has been reported to Cisco and they are investigating. Meanwhile, our workaround is still the tried and tested encore service restart. Thanks!

_joe · ‎10-08-2021

Thanks. It turns out we were hitting another issue in the lab - the eNcore app actually cannot handle low volume without a few tweaks. We had a call with Cisco about it - they mostly just commented no one would have that issue in production. We are running 4.8.1 in prod now - so far so good.

Honestly - I have actually been restarting the eNcore client each night for a while now. I started doing that when I fought my event volume decreased after FMC updates until the next restart.

I don't think this is an issue other people have, but I found after restarting each night, I was statistically getting more events.

_joe · ‎10-20-2021

We were almost perfect for 2 weeks. Now today we keep crashing. I will report back what Cisco says.

_joe · ‎12-03-2021

Hope everyone else is doing well. We have been stable on eNcore 4.8.2 (for a while now), but ONLY with the following changes to estreamer.conf:

Respond timeout
Always attempt to continue
Extended requests
batch size # "batchSize":2,
number of workers # "workerProcesses": 1

These are changes we made as directed by Cisco. For example, we lowered the workerProcesses to avoid issues (they said because our ingest was low, more processes would cause issues), not because we didn't have enough CPUs.

# estreamer.conf
{
    "connectTimeout": 10,
    "alwaysAttemptContinue": true,
    "batchSize":2,
    "enabled": true,
    "handler": {
        "output@comment": "If you disable all outputters it behaves as a sink",
        "outputters": [
            {
                "adapter": "splunk",
                "enabled": true,
                "stream": {
                    "options": {
                        "maxLogs": 10000,
                        "rotate": true
                    },
                    "uri": "relfile:///data/splunk/encore.{0}.log"
                }
            }
        ],
        "records": {
            "connections": false,
            "core": true,
            "excl@comment": [
                "These records will be excluded regardless of above (overrides 'include')",
                "e.g. to exclude flow and IPS events use [ 71, 400 ]"
            ],
            "exclude": [],
            "inc@comment": "These records will be included regardless of above",
            "include": [],
            "intrusion": true,
            "metadata": true,
            "packets": true,
            "rna": true,
            "rua": true
        }
    },
    "logging": {
        "filepath": "estreamer.log",
        "format": "%(asctime)s %(name)-12s %(levelname)-8s %(message)s",
        "lev@comment": "Levels include FATAL, ERROR, WARNING, INFO, DEBUG, VERBOSE and TRACE",
        "level": "INFO",
        "stdOut": true
    },
    "monitor": {
        "bookmark": false,
        "handled": true,
        "period": 120,
        "subscribed": true,
        "velocity": false
    },
    "responseTimeout": 30,
    "star@comment": "0 for genesis, 1 for now, 2 for bookmark",
    "start": 2,
    "subscription": {
        "records": {
            "@comment": [
                "Just because we subscribe doesn't mean the server is sending. Nor does it mean",
                "we are writing the records either. See handler.records[]"
            ],
            "archiveTimestamps": true,
            "eventExtraData": true,
            "extended": true,
            "impactEventAlerts": true,
            "intrusion": true,
            "metadata": true,
            "packetData": true
        },
        "servers": [
            {
			<obviously omitted>
            }
        ]
    },
    "workerProcesses": 1
}

vik_splunk · ‎12-14-2021

This is great information @_joe Thanks for sharing.

Might actually help us as well

Caroffino · ‎05-18-2022

@_joe

Yeah joe i would really like to have a follow if you don't mind on how things have been running. We are using splunk in the cloud with a deployment server and heavy forwarder and the estreamer has been a bear to keep running. We have only had this working up for a week at this point, but most of the time tinkering to get this restarted every other day.

_joe · ‎05-23-2022

Hello,

Nothing has changed from my prior note. It is working pretty well, I no longer have scripted restarts. As a side note, restarting streamer doesn’t work very well and might even cause issues. Our ingest volume is about 5million a day with "workerProcesses": 1. I am not sure what a good “workerProcesses to ingest rate” would be but more workerProcesses did cause us issues.

I believe we may have been told by Cisco that we should upgrade to version 5.x as other people have had special characters crash their ingest but we haven’t had that happen. I'm sure you've noticed, but it looks like they pushed 5.10 a few days ago. We are still on 4.8.

Unfortunately, it really does seem like Cisco is using their customers as their dev department which is frustrating.

douglashurd · ‎06-15-2021

Thanks for the message! We're looking at it. Appreciate all the details.

Doug

_joe · ‎07-02-2021

Just wanted to comment in solidarity. I upgraded to 4.6.0 on Tuesday. Everything crashed 3 days later with the following error. Everything is WORKING AGAIN with 4.6.2 so far. Thanks!

021-07-02 11:29:13,211 Connection INFO Connecting to 172.26.12.30:8302
2021-07-02 11:29:13,212 Connection INFO Using TLS v1.2
2021-07-02 11:29:13,213 Monitor INFO Starting Monitor.
2021-07-02 11:29:13,213 Decorator INFO Starting process.
2021-07-02 11:29:13,218 Transformer INFO Starting process.
2021-07-02 11:29:13,219 Writer INFO Starting process.
2021-07-02 11:29:13,219 Monitor INFO Starting. 0 handled; average rate 0 ev/sec;
2021-07-02 11:31:13,353 Controller INFO Process subscriberParser is dead.
2021-07-02 11:31:13,353 Monitor INFO Running. 0 handled; average rate 0 ev/sec;
2021-07-02 11:31:13,875 Controller INFO Stopping...
2021-07-02 11:31:13,875 Controller INFO Process 82246 (Process-1) exit code: 1
2021-07-02 11:31:13,880 Decorator INFO Stop message received
2021-07-02 11:31:13,885 Decorator INFO Error state. Clearing queue
2021-07-02 11:31:13,885 Decorator INFO Exiting
2021-07-02 11:31:13,885 Controller INFO Process 82247 (Process-2) exit code: 0
2021-07-02 11:31:13,885 Transformer INFO Stop message received
2021-07-02 11:31:13,891 Transformer INFO Error state. Clearing queue
2021-07-02 11:31:13,891 Transformer INFO Exiting
2021-07-02 11:31:13,896 Controller INFO Process 82249 (Process-3) exit code: 0
2021-07-02 11:31:13,901 Writer INFO Stop message received
2021-07-02 11:31:13,906 Writer INFO Error state. Clearing queue
2021-07-02 11:31:13,907 Writer INFO Exiting
2021-07-02 11:31:13,907 Controller INFO Process 82251 (Process-4) exit code: 0
2021-07-02 11:31:13,907 Monitor INFO Stopping Monitor.
2021-07-02 11:31:14,102 Controller INFO Goodbye

vik_splunk · ‎07-06-2021

@_joe @elee_splunk @skhademd @douglashurd Have since made 4.6.2 work for us, overcoming the subscribeparser errors. However 6-8 hours after it was setup, it failed with the same Invalid JSON error. Kindly confirm if it's still working in your respective environments. Just want to ascertain if it's just in our environment or is an issue with the TA.

21-07-05 22:54:12,367 Receiver DEBUG FMC sent no data

2021-07-05 22:54:12,873 Monitor INFO Running. 190600 handled; average rate 12.51 ev/sec;

2021-07-05 22:54:22,875 Receiver DEBUG FMC sent no data

2021-07-05 22:54:33,385 Receiver DEBUG FMC sent no data

2021-07-05 22:55:01,942 Receiver DEBUG FMC sent no data

2021-07-05 22:55:12,452 Receiver DEBUG FMC sent no data

2021-07-05 22:55:22,962 Receiver DEBUG FMC sent no data

2021-07-05 22:55:33,483 Receiver DEBUG FMC sent no data

2021-07-05 22:55:50,507 Receiver DEBUG FMC sent no data

2021-07-05 22:56:01,014 Receiver DEBUG FMC sent no data

2021-07-05 22:56:11,526 Receiver DEBUG FMC sent no data

2021-07-05 22:56:12,040 Monitor INFO Running. 190600 handled; average rate 12.41 ev/sec;

2021-07-05 22:56:22,032 Receiver DEBUG FMC sent no data

2021-07-05 22:56:32,542 Receiver DEBUG FMC sent no data

2021-07-05 22:56:39,564 Receiver DEBUG Got null message.

2021-07-05 22:56:59,585 Receiver DEBUG FMC sent no data

2021-07-05 22:57:17,618 Receiver DEBUG FMC sent no data

2021-07-05 22:57:28,125 Receiver DEBUG FMC sent no data

2021-07-05 22:57:38,639 Receiver DEBUG FMC sent no data

2021-07-05 22:57:49,150 Receiver DEBUG FMC sent no data

2021-07-05 22:57:59,660 Receiver DEBUG FMC sent no data

2021-07-05 22:58:00,677 Monitor INFO Running. 190600 handled; average rate 12.31 ev/sec;

2021-07-05 22:58:10,677 Receiver DEBUG FMC sent no data

2021-07-05 22:58:25,697 Receiver DEBUG FMC sent no data

2021-07-05 22:58:36,209 Receiver DEBUG FMC sent no data

2021-07-05 22:58:46,718 Receiver DEBUG FMC sent no data

2021-07-05 22:59:02,752 Receiver DEBUG FMC sent no data

2021-07-05 22:59:21,775 Receiver DEBUG FMC sent no data

2021-07-05 22:59:32,286 Receiver DEBUG FMC sent no data

2021-07-05 22:59:42,800 Receiver DEBUG FMC sent no data

2021-07-05 22:59:53,310 Receiver DEBUG FMC sent no data

2021-07-05 23:00:03,819 Receiver DEBUG FMC sent no data

2021-07-05 23:00:08,849 Monitor INFO Running. 190600 handled; average rate 12.22 ev/sec;

2021-07-05 23:00:24,864 Receiver DEBUG FMC sent no data

2021-07-05 23:00:35,374 Receiver DEBUG FMC sent no data

2021-07-05 23:00:45,886 Receiver DEBUG FMC sent no data

2021-07-05 23:00:56,398 Receiver DEBUG FMC sent no data

2021-07-05 23:01:09,215 Service ERROR [no message or attrs]: Invalid JSON in settings file

2021-07-06 03:39:08,725 Configure ERROR Invalid JSON in file

vik_splunk · ‎07-05-2021

@_joe Thanks.

Unfortunately, 4.6.2 leads us to the subscribeparser error that you had mentioned.

Just to compare the environments, we run Splunk on RHEL 6.x with Splunk 8.1.3 and FMC 6.6.4.

Can you please confirm your setup?

AlexS · ‎06-29-2021

Hi Douglas,

it seems I´m facing the same issue.

Version 4.0.9 suddenly stopped working, and ran into the following error:
Decorator ERROR [no message or attrs]: 'View' object has no attribute '_view__isHex'...

As I did not get it to work again (restart,reboots...), I tried the new 4.6.0 release.

In the beginning it looked fine (besides missing status and clean options in splencore.sh that I added again) , logs were written and ingested into Splunk.
After almost 3 hours the error "Service ERROR [no message or attrs]: Invalid JSON in settings file" appeared in the estreamer.log. But logs where still written and input to splunk worked - except estreamer.log: monitor logs stopped and no other message was written.

Suddenly more then 7 hours after the "invalid JSON in settings file" error, fmc logs stopped, too.

After restart of the client, it started working again. The "invalid JSON.." error already reappeared after 2 hours. This time I did not wait for fmc logs to stop and directly restarted the client. Now since 5 hours, no error but according to the other feedback in here, it´s only a matter of time.

Did you already identify any issues?

Thanks,
Alex

vik_splunk · ‎06-17-2021

Hi @douglashurd just wanted to follow up and check if you were able to find anything of interest/ a possible fix?

Meanwhile, we managed to test in a different environment with FMC 6.4.4 and that seems to be stable in comparison.

Versions used - Splunk 8.1.3 , Python 3 , Add on ver. 4.6.0 and FMC 6.4.4

Interestingly, we do see the same error as seen earlier but the key difference is that it does not result in abrupt stoppage of logs. Seems to offer the same pre-upgrade stability.

As can be seen, the estreamer.log stopped at 9:39 EDT yesterday with the same error as mentioned in the bug report. However, subsequent listing of the data folder shows the continuous inflow of logs.

2021-06-16 09:33:32,517 Monitor INFO Running. 697900 handled; average rate 86.8 ev/sec;
2021-06-16 09:35:31,612 Monitor INFO Running. 697900 handled; average rate 85.52 ev/sec;
2021-06-16 09:37:31,434 Monitor INFO Running. 698000 handled; average rate 84.29 ev/sec;
2021-06-16 09:39:31,593 Monitor INFO Running. 698000 handled; average rate 83.09 ev/sec;
2021-06-16 09:39:32,450 Service ERROR [no message or attrs]: Invalid JSON in settings file

pwd
/opt/splunk/etc/apps/TA-eStreamer/bin/encore/data/splunk
ls -ltr
Jun 17 04:36 encore.1623902980.log

elee_splunk · ‎06-21-2021

Hi @douglashurd ,

I am also seeing the same issue with the latest TA on Splunk 8.2. Seems like it happens every day or every other day I get the same error and it stops pulling logs.

2021-06-21 08:10:12,573 Monitor INFO Running. 62720000 handled; average rate 1371.82 ev/sec;
2021-06-21 08:12:11,573 Monitor INFO Running. 62860500 handled; average rate 1371.3 ev/sec;
2021-06-21 08:14:11,827 Monitor INFO Running. 63002100 handled; average rate 1370.8 ev/sec;
2021-06-21 08:16:12,215 Monitor INFO Running. 63148100 handled; average rate 1370.4 ev/sec;
2021-06-21 08:16:43,508 Service ERROR [no message or attrs]: Invalid JSON in settings file

vik_splunk · ‎06-22-2021

Error not withstanding, it seems to be stable as before following our upgrade to FMC 6.4.4.

Appears to have been by the bug noted earlier in this ticket.

To conclude, the error for Invalid JSON.. is still appearing for the new FMC version as well. It continues to ingest logs anyway.

Marking this as the solution @douglashurd . Unless you see any fix that is required to the TA, my issue is fixed now. Thanks!

skhademd · ‎06-30-2021

Hi,

We believe this is a bug in the splencore.sh script, specifically with the clean command, in a prior update we modified the location of the ingest directory, this script was not updated to reflect the new location which we believe is causing the error you are seeing. Please perform the following

1) Disable the clean script

2) Modify the splencore.sh to point the proper ingest directory

3) Upgrade the TA to 4.6.1, which includes the modification above

If problems continue please open a ticket with TAC support so we can collect additional trouble shoot files

Thanks

Seyed

vik_splunk · ‎06-30-2021

Thanks @skhademd

The new TA is available now. Will validate and confirm.

vik_splunk · ‎06-30-2021

@skhademd @douglashurd

Have installed 4.6.1 afresh and configured from scratch to connect to our FMCs(with DEBUG logging enabled. Collection has resumed in Splunk. However, here are the initial observations.

1)Data collection is always 10-15 minutes behind current time.

2)Checking estreamer logs, the events per second is consistently on the decrease. Not sure if it's on the path to stoppage of collection

2021-06-30 14:19:58,631 Monitor INFO Running. 103400 handled; average rate 25.34 ev/sec;
2021-06-30 14:20:08,633 Receiver DEBUG FMC sent no data
2021-06-30 14:20:23,145 Receiver DEBUG FMC sent no data
2021-06-30 14:20:33,658 Receiver DEBUG FMC sent no data
2021-06-30 14:20:47,175 Receiver DEBUG FMC sent no data
2021-06-30 14:20:57,684 Receiver DEBUG FMC sent no data
2021-06-30 14:21:08,190 Receiver DEBUG FMC sent no data
2021-06-30 14:21:18,699 Receiver DEBUG FMC sent no data
2021-06-30 14:21:29,209 Receiver DEBUG FMC sent no data
2021-06-30 14:21:36,240 Receiver DEBUG Got null message.
2021-06-30 14:21:47,252 Receiver DEBUG FMC sent no data
2021-06-30 14:21:57,764 Receiver DEBUG FMC sent no data
2021-06-30 14:21:58,274 Monitor INFO Running. 103400 handled; average rate 24.62 ev/sec;
2021-06-30 14:22:08,267 Receiver DEBUG FMC sent no data
2021-06-30 14:22:18,778 Receiver DEBUG FMC sent no data
2021-06-30 14:22:29,289 Receiver DEBUG FMC sent no data
2021-06-30 14:22:36,301 Receiver DEBUG Got null message.
2021-06-30 14:22:46,312 Receiver DEBUG FMC sent no data
2021-06-30 14:22:56,822 Receiver DEBUG FMC sent no data
2021-06-30 14:23:14,347 Receiver DEBUG FMC sent no data
2021-06-30 14:23:29,368 Receiver DEBUG FMC sent no data
2021-06-30 14:23:43,386 Receiver DEBUG FMC sent no data
2021-06-30 14:23:47,402 Monitor INFO Running. 103400 handled; average rate 23.93 ev/sec;
2021-06-30 14:23:57,406 Receiver DEBUG FMC sent no data
2021-06-30 14:24:07,915 Receiver DEBUG FMC sent no data
2021-06-30 14:24:22,436 Receiver DEBUG FMC sent no data
2021-06-30 14:24:32,946 Receiver DEBUG FMC sent no data
2021-06-30 14:24:43,454 Receiver DEBUG FMC sent no data
2021-06-30 14:24:53,959 Receiver DEBUG FMC sent no data
2021-06-30 14:25:04,471 Receiver DEBUG FMC sent no data
2021-06-30 14:25:11,492 Receiver DEBUG Got null message.
2021-06-30 14:25:21,498 Receiver DEBUG FMC sent no data
2021-06-30 14:25:32,006 Receiver DEBUG FMC sent no data
2021-06-30 14:25:42,518 Receiver DEBUG FMC sent no data
2021-06-30 14:25:53,028 Receiver DEBUG FMC sent no data
2021-06-30 14:25:53,542 Monitor INFO Running. 103400 handled; average rate 23.29 ev/sec;
2021-06-30 14:26:03,540 Receiver DEBUG FMC sent no data
2021-06-30 14:26:11,564 Receiver DEBUG Got null message.
2021-06-30 14:26:21,563 Receiver DEBUG FMC sent no data
2021-06-30 14:26:32,076 Receiver DEBUG FMC sent no data
2021-06-30 14:26:42,584 Receiver DEBUG FMC sent no data
2021-06-30 14:26:53,089 Receiver DEBUG FMC sent no data
2021-06-30 14:27:03,601 Receiver DEBUG FMC sent no data
2021-06-30 14:27:11,626 Receiver DEBUG Got null message.
2021-06-30 14:27:21,636 Receiver DEBUG FMC sent no data
2021-06-30 14:27:32,149 Receiver DEBUG FMC sent no data
2021-06-30 14:27:49,673 Receiver DEBUG FMC sent no data
2021-06-30 14:28:00,183 Receiver DEBUG FMC sent no data
2021-06-30 14:28:00,697 Monitor INFO Running. 103400 handled; average rate 22.67 ev/sec;
2021-06-30 14:28:10,688 Receiver DEBUG FMC sent no data
2021-06-30 14:28:21,193 Receiver DEBUG FMC sent no data
2021-06-30 14:28:31,701 Receiver DEBUG FMC sent no data

3)Clean up script has an issue with absolute vs relative path I suspect , as I notice this error in Splunk internal logs.

06-30-2021 14:17:27.087 -0400 ERROR ExecProcessor - message from "/opt/splunk/etc/apps/TA-eStreamer/bin/splencore.sh clean" find: `./encore/data': No such file or directory

Sourcefire Encore data ingestion issue

heavy forwarder

scripted input

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?

Splunk Education Goes to Washington | Splunk GovSummit 2024