Introducing support for Amazon Data Firehose in Splunk Edge Processor

adepp · ‎10-02-2024

We’re excited to announce a powerful update to Splunk Data Management with added support for Amazon Data Firehose in Edge Processor! This enhancement enables you to use Amazon Data Firehose (formerly Amazon Kinesis Data Firehose) as a data source, offering greater flexibility and efficiency in managing data streams. With integration across over 20 AWS services, you now can easily stream data into Splunk from sources like Amazon CloudWatch, SNS, AWS WAF, Network Firewall, IoT, and more.

What’s new?

Integration with Amazon Data Firehose

With this update, Edge Processor can now directly ingest logs from Amazon Data Firehose, enabling seamless streaming from various AWS services into Splunk for real-time analysis and visualization. Whether monitoring cloud infrastructure, applications, or security events, this addition broadens your data source options, enhances your ability to gain real-time insights, and simplifies data pipeline management while both reducing latency and ensuring faster access to critical data.

Acknowledgement for HEC data

This release also introduces another crucial feature in Edge Processor: receiver acknowledgement for upstream HTTP Event Collector (HEC) data. This preserves data integrity by ensuring HEC events sent to the processor are properly received and acknowledged, adding an additional layer of confidence that no information is lost during transmission between data inputs and Edge Processors.

Ingesting VPC flow logs into Edge Processor via Firehose streams

In the following sections, we’ll guide you through how to integrate Amazon Data Firehose into your existing Splunk setup. Specifically, we’ll focus on setting up a HEC token for your Edge Processor, configuring VPC flow log ingestion into Splunk via Amazon Data Firehose, and achieving network traffic CIM compliance using SPL2 pipelines. An architectural diagram illustrating the high-level components involved in this setup can be seen below.

You can also view this step-by-step guide in Lantern.

Note: The following steps assume you already have access to the following: an Edge Processor tenant with a paired EC stack, an Edge Processor instance running on a machine with an accessible URL, and an AWS account. Furthermore, to ensure proper data ingestion, your Edge Processors’ HEC receivers should accept data over TLS—not mTLS. This can be configured in your tenant’s web UI.

Applying a HEC token to your Edge Processor

HEC tokens are used by the HTTP Event Collector to authenticate and authorize data sent to Splunk. These tokens securely manage data intake from various sources over HTTP/HTTPS, ensuring that only authorized data is accepted and properly categorized for analysis. Fortunately, the process of generating and setting up a token for use within your Edge Processor is relatively straightforward:

Open a web browser and navigate to your Splunk Cloud Platform instance. Then, using the dropdown menus located at the top of the page, select “Settings” > “Data Inputs”.
In the table titled “Local inputs”, locate the “HTTP Event Collector” row and click the “+ Add New” button on the right-hand side.
You should be directed to a form requesting various HEC-related information. The only required field is the token name, though you can fill in additional details to better suit your use-case. Once finished, review and submit the form using the navigation buttons in the top-right corner.
Beneath the resulting “Token is being deployed” header should be an immutable text box labeled “Token Value”. Copy this value to your clipboard, as it’ll be needed shortly.

Now that a valid HEC token has been generated, it’s time to apply it to your Edge Processor:

Navigate to your Edge Processor tenant in a web browser. You can do so by visiting console.scs.splunk.com/<tenant-id> and logging in via your user- or company-provided SSO.
On the left-hand side of the landing page, select “Edge Processors” > “Shared settings”. This will open a page used to configure various receiver settings.
In the “Token authentication” section, click the “New token” button on the right-hand side, then paste the previously-copied HEC token value into the “HEC token” field.
(Optional) Configuring the “Source” and “Source type” fields is strongly recommended here, as doing so assigns default values to incoming data lacking them. This is especially important because source/sourcetype are typically used as partition values in the SPL2 pipelines transforming data within an Edge Processor instance.

We’ll be using default-source and default-sourcetype for demonstration purposes. However, more accurate values may consist of aws:kdf, aws:vpc-flow-log, etc.

Once everything has been properly configured, click the “Save” button in the bottom-right corner of the page.

Configuring VPC flow log ingestion into Splunk

VPC flow logs capture essential information about the IP traffic to and from network interfaces in your Virtual Private Cloud. By streaming these logs through Amazon Data Firehose, you can efficiently route the data to Edge Processor for real-time processing and analysis, enabling deeper insights within your Splunk environment. To set this up, you’ll first need to create a Firehose stream:

Navigate to your AWS Management Console.
Use the search bar at the top of the page to locate the “Amazon Data Firehose” service’s homepage, then click the “Create Firehose stream” button in the top-right corner.
For “Source” and “Destination”, select “Direct PUT” and “Splunk” from the input fields’ dropdown menus, respectively. This will populate the form with additional configuration settings.
Within the “Destination Settings” panel, enter the URL of the machine hosting your Edge Processor instance in the “Splunk cluster endpoint” field. This URL should always follow the format https://<host_machine_url>:8088 and should point to your Edge Processor instance—not the tenant.

Note: In order for this to work properly, your instance’s URL must use HTTPS, and the host machine should be configured to allow incoming HTTP/TCP traffic on the specified HEC receiver port (e.g., 8088).

In the “Authentication token” field of this same panel, copy and paste the value of the HEC token generated previously.
Finally, in the “Backup settings” panel, you must specify an S3 bucket to ensure data recovery in the event of transmission failures or other issues during the streaming process. If you do not already have an S3 bucket set up, follow the instructions provided here.
Once finished, click the “Create Firehose stream” button in the bottom-right corner of the form.

To test whether you’ve configured everything correctly before moving on, navigate to your newly-created Firehose stream and expand the panel titled “Test with demo data”. Upon clicking the “Start sending demo data” button, dummy data should be routed from your Firehose stream through your Edge Processor instance. To verify this is working as expected, select the “Edge Processors” tab on the left-hand side of your tenant’s UI and double-click the row containing your Edge Processor. Within a minute or two, the “Data flowing through in the last 30 minutes” metrics in the bottom-right corner of the page should reflect some small amount of inbound data—likely categorized by the default source and sourcetype values specified previously. If this isn’t the case, be sure to check your Firehose stream’s destination error logs in Amazon CloudWatch.

With the Firehose stream now configured to send data to your Edge Processor instance, the final step is to create a VPC log flow and direct it to the Firehose stream:

In the same AWS management console as before, navigate to the “VPC” service’s homepage using the search bar provided at the top of the page.
Depending on your use-case, you may want to create a new VPC or use an existing one. Instructions for creating a new one can be found in the official AWS documentation. For the purposes of this demonstration, we’ll be using the default VPC provided by AWS.
Click the “VPCs” hyperlink in the “Resources by region” section of the page and select its associated “VPC ID”. This value will be of the format “vpc-<hexstring>”.
The resulting page will display information related to the selected VPC’s configuration. Beneath the section titled “Details”, open the “Flow logs” tab and click the “Create flow log” button on the right-hand side of the panel.
For the “Destination” field, choose the “Send to Amazon Data Firehose in the same account” option. Then, select your previously-created Firehose stream from the dropdown menu of the resulting “Amazon Firehose stream name” field.
For the “Log record format” field, you can choose to use AWS’s default format or customize your own. Which fields are included is ultimately decided by your use-case; however, it’s important to note the format preview displayed below. This will come in handy when creating a SPL2 pipeline used to transform these logs in the following step.
Once finished, click “Create flow log” in the bottom-right corner of the form.

At this point, you should begin to see VPC flow logs populating the destination specified by your Edge Processor. If routing to Splunk Cloud Platform, you can identify these logs by searching for the default source and sourcetype values defined previously. Again, in the event something has gone wrong, checking the Firehose stream’s destination error logs is a great starting point for debugging.

Achieving CIM compliance using SPL2 pipelines

With VPC flow logs now successfully ingested into Edge Processor, the next step is to transform these logs to align with the CIM Network Traffic data model. By leveraging specific SPL2 commands, we can build and apply a pipeline that maps the flow log fields to their CIM equivalents. This will ensure the data is normalized, enabling consistent and effective analysis across Splunk’s search and reporting capabilities. To accomplish this, we must first create a SPL2 pipeline:

Navigate to your Edge Processor tenant in a web browser.
On the left-hand side of the page, select the “Pipelines” tab and click the “+ New pipeline” button in the top-right corner.
You will be prompted to select a template from which your pipeline will be created. Despite the abundance of options to choose from, select “Blank pipeline” and click “Next”.
Next, define the partition(s) for your pipeline. Because we configured our receiver to append default source and sourcetype values to logs without them, it’s best to set the partition to match one or both of these values since VPC flow logs don’t include them by default.
(Optional) In the “Enter or upload sample data” input box, it may be useful for testing purposes to paste one of the VPC flow logs ingested earlier. Depending on whether sample data is provided, click either the “Next” or “Skip” button in the bottom-right corner of the page to continue.

For example, the AWS default format will produce a log similar to the following: {"message":"2 215263928837 eni-05a082dab7784e51f 35.203.211.189 172.31.61.177 54623 5800 6 1 44 1723573216 1723573232 REJECT OK"}, which can be seen in the Splunk Cloud Platform screenshot above.

Finally, select the desired data destination from the list and click “Done” to create your pipeline.

Since a new pipeline has been created, we can now use various SPL2 commands to extract information from the flow log and map it to CIM-compliant field names. For AWS flow logs specifically, the default record format—referenced in Step 6 of the previous section—is of the form: ${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status}. According to Splunk’s field mapping documentation, the following changes will need to be made in order to achieve CIM compliance:

(not required) version
account-id → vendor_account
interface-id → dvc
srcaddr → src_ip
dstaddr → dest_ip
srcport → src_port
dstport → dest_port
protocol → transport
(unchanged) packets
(unchanged) bytes
(calculated) start, end → duration
(not required) action
(not required) log-status

The next step involves implementing these changes in code. Notably, the rex command can be used to parse the raw flow log, extracting only fields that are essential for compliance. Fields like version, action, and log-status—which are not required—should be intentionally excluded from this extraction process, ensuring that only necessary information is retained. Additionally, the pipeline should calculate the duration of the network session using the provided start and end timestamps in order to align with the data model specified by the CIM. Finally, the fields command can help remove the start and end fields from the log, as they are not needed after calculating duration and can thus be ignored. Here’s an example of what the resulting SPL2 may look like:

$pipeline = | from $source
    | rex field=_raw /{"message":"\S+ (?P<vendor_account>\S+) (?P<dvc>\S+) (?P<src_ip>\S+) (?P<dest_ip>\S+) (?P<src_port>\S+) (?P<dest_port>\S+) (?P<transport>\S+) (?P<packets>\S+) (?P<bytes>\S+) (?P<start>\S+) (?P<end>\S+) \S+ \S+"}/
    | eval duration = end - start
    | fields - start, end
    | into $destination;

Now that all the data transformation logic is in place, the only remaining step is to save the pipeline and apply it to your running Edge Processor:

In the top-right corner of the pipeline editor, click the “Save pipeline” button, provide a required name and an optional description, and click “Save”.
After a few seconds, you’ll be met with a popup titled “Apply pipeline”. Click “Yes, apply”, select the targeted Edge Processor(s), and click “Save” in the bottom-right corner.
A notification should appear indicating that the pipeline update may take a few minutes to propagate. To check on the status of your processor, click the Splunk icon in the top-left corner to navigate back to the landing page, select the “Edge Processors” tab on the left-hand side, and monitor its “Instance Health”. It should eventually reach a healthy (i.e. green) status.

Logs routed to your specified destination should now contain the CIM-compliant fields appended above.

Conclusion

With the introduction of Amazon Data Firehose support in Edge Processor, managing and analyzing your AWS data streams has never been easier. This update not only expands your data source options but also enhances the reliability of data transmission with receiver acknowledgement for upstream HEC data. Whether you’re monitoring cloud infrastructure, analyzing security events, or ensuring CIM compliance, these new capabilities provide you with the tools needed to optimize your Splunk environment. We encourage you to explore these features and see how they can enhance your data processing workflows.

To get started with one (or both!) of our Data Management pipeline builders, fill out the following form. For more Edge Processor resources, check out the Data Management Resource Hub. If you’d like to request a feature or provide any other feedback, we strongly encourage you to create a Splunk Idea and/or send an email to edgeprocessor@splunk.com. You can also join the lively discussion in the #edge-processor channel of the splunk-usergroups workspace in Slack. It’s an excellent forum to learn from the community on the latest Edge Processor use-cases.

Happy Splunking!

Introducing support for Amazon Data Firehose in Splunk Edge Processor

What’s new?

Integration with Amazon Data Firehose

Acknowledgement for HEC data

Ingesting VPC flow logs into Edge Processor via Firehose streams

Applying a HEC token to your Edge Processor

Configuring VPC flow log ingestion into Splunk

Achieving CIM compliance using SPL2 pipelines

Conclusion

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor

Are you a member of the Splunk Community?

Introducing support for Amazon Data Firehose in Splunk Edge Processor

What’s new?

Integration with Amazon Data Firehose

Acknowledgement for HEC data

Ingesting VPC flow logs into Edge Processor via Firehose streams

Applying a HEC token to your Edge Processor

Configuring VPC flow log ingestion into Splunk

Achieving CIM compliance using SPL2 pipelines

Conclusion

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor