I am looking for a solution for my current environment:
- Data residing on AWS S3. This data is from various sources and we collect them to AWS S3 buckets
- We are planning to install HF under the same AWS account where the data is available on S3. This data should be injected from S3 to Heavy Forwarder (HF) and then from HF, it should get ingested into Indexer cluster
- Since we are getting the data from various different sources, do we need to install individual Splunk apps or add-ons for these data types on HF. Data may be Cylance, FireEye etc. data? Since couple of these apps require data ingestion directly from the source device, it seems we cannot use them for our purpose.
My question is: Should we directly inject data from S3 to HF and then from HF to Indexer cluster?
Here is a flow to show end to end picture:
AWS S3 (Data from sources) ->> AWS SQS ->> HF (with Splunk App for AWS to pull data from AWS SQS) ->> Indexer cluster
Thank you for your response. Actually the purpose of configuring HF between AWS S3 and the Indexer cluster is that we don't want Indexer cluster to pull data in any case thereby not using any CPU/Memory for this purpose. So, HF is there just to pull data from S3 and forward it to the Indexer cluster. This indexer cluster is in some other AWS account and we have access to its endpoints. We want indexing and all to happen on this indexer cluster. For this reason we don't want to install any logtype (ie. Cylance, AMP etc.) specific add-ons/apps on HF.
My other question is, since log specific Apps usually require direct ingestion from the source device to the App, we cannot use these apps. So, can we rely on default indexing that pulls Selected fields and Interesting fields as indexed by Indexer from the data that is ingested into the Indexer cluster?
You may want "indexing and all" to happen on the cluster, but it doesn't work that way. Heavy forwarders are indexers that don't store data. They will do all the work of indexing (parsing, typing, etc.) and then send the results to the cluster to be stored. That cannot be changed and is why any apps that assist with parsing must be installed on the HF.
Another option is to replace the HF with a Universal Forwarder (UF). UFs do virtually no parsing themselves so all of the work will be done by the indexer cluster.
Since you already have your data in S3, the only parts of the apps you need are the ones that don't input data (no inputs.conf, nothing in /bin).
You can use default indexing if you wish, but you may be disappointed in the results. Apps often are created to enhance default indexing.
Thanks. Looks like we have to go for UFs as per your suggestion.
Could you please elaborate more on your following comment?:
"Since you already have your data in S3, the only parts of the apps you need are the ones that don't input data (no inputs.conf, nothing in /bin)."
Could you please point us to the document that can provide more details on the above? Being new Splunk users, we don't have complete understanding on the above. If I understand it correctly, you are saying that we don't need to configure these log specific apps to pull data directly from the devices which is the condition for most of the apps.
Apps and add-ons have three main functions - bringing data in, processing it, and displaying it. Not all apps do all three. It's not necessary to use an app/add-on for all of its capabilities, either.
In fact, it's standard (and necessary) practice to disable parts of an app depending on the instance type on which it is installed. See https://docs.splunk.com/Documentation/AddOns/released/Overview/Distributedinstall
An add-on that uses API calls to ingest data likely will also have props.conf settings that process the data. Your data is already in S3 so you don't need the API calls. But you do need the props.conf file for field extractions and other needs.
One more question: Since we can't use Heavy Forwarders as discussed above for our use-case, it seems we can't either use the Universal forwarders, because we have to pick the data/logs from AWS S3 buckets and ingest into our Indexer cluster. Could you please suggest any other way, that can satisfy our requirements.
In nutshell, we want to fetch data from S3, send it to intermediate forwarder (Heavy/Universal) and then this forward sends same data to the indexer cluster. In other words, is there a way disable indexing at Heavy forwarder level? If answer is no, then what option do we have to handle the above use case?
How is the data being read from S3? The answer will determine which forwarder can be used.
There is no way to prevent a HF from parsing data.
We were planning to use Splunk AWS app to read data from S3. Now it seems we can't install this app on Universal Forwarder and we must use Heavy Forwarder (HF). But as HF performs parsing and indexing, using HF may cause unnecessary processing before HF forwards the data to the Indexer cluster. We just want to use HF as a forwarder and all the indexing, parsing etc. we want to do at the Indexer cluster level.
Have you considered kinesis firehose?