What's the problem?
Many of Splunk’s current customers manage one or more sources producing substantial volumes of ingested logs; however, among this generated content, it’s not uncommon that o...
See more...
What's the problem?
Many of Splunk’s current customers manage one or more sources producing substantial volumes of ingested logs; however, among this generated content, it’s not uncommon that only a few pieces of information—and therefore a relatively small portion of the overall data—hold the majority of insight relevant to their operational needs. As a result, the goal of this article is to propose, explain, and walk through a solution allowing for the extraction of this targeted information while optimizing resource utilization and cost-efficiency for the customer.
What can be done to help remediate this?
Rather than sending all of their unfiltered logs directly to Splunk—ultimately incurring fees related to unnecessary storage and processing power—customers can instead make use of Edge Processor. More specifically, pipelines can be set up to extract and route the information of interest directly to Splunk while the rest of the original log is directed to S3 for long-term storage. Because S3 is designed around archival storage as opposed to running analytics, the cost of persisting unused log data in S3 will be substantially more cost-effective. See the architectural diagram below for more information.
Step-by-Step Walkthrough
The following guide operates under the assumption that you haven’t yet connected your Edge Processor tenant to your Splunk Cloud Platform (SCP) deployment and do not have a live instance on any machine. If you have already connected your tenant to your SCP deployment, feel free to skip step 1 below. Similarly, if you have an Edge Processor instance installed and running on one of your machines, you can skip step 2 as well.
Setting Up Splunk Destination(s)
Before you can start using Edge Processor to work with your logs, you must first connect the tenant to your SCP deployment. This connection allows communication between the Edge Processor service and SCP, thereby providing indexes for storing the logs and metrics passing through the processors. In order to do this, follow the steps outlined in our first-time setup instructions.
Getting an Edge Processor Instance Up-and-Running
Creating the Instance
Now that you have your Splunk destinations correctly set up and configured, create a new Edge Processor instance by selecting Edge Processors > New Edge Processor in your cloud tenant’s web UI.
Enter both a name and a description for the Edge Processor. To further specify a default destination for unprocessed logs, select To a default destination and choose a destination from the resulting drop-down list.
In order to turn on receivers allowing your Edge Processor to ingest logs from specific inputs, select inputs as necessary from the Receive data from these inputs section.
If you want to use TLS to secure communications between your instance and its corresponding log sources, then do the following:
In the Use TLS with these inputs section, select the log inputs for which you want to use TLS encryption.
Upload PEM files containing the appropriate certificates in the Server private key, Server certificate, and CA certificates fields.
Installing the Instance on a Machine
In your cloud tenant, locate and copy the installation commands. This can be found in Edge Processors > [your processor’s row] > Actions Icon (⋮) > Manage instances > Install/uninstall.
On the machine from which the instance will be hosted, open the command-line interface, navigate to the desired target directory, and run the commands copied previously. This should create a splunk-edge/ directory in your chosen installation location.
To verify the instance was installed successfully, return to your tenant and select Manage instances > Instances. Confirm that a new instance has been created and has a "Healthy" status (may take up to a minute).
Setting Up an Amazon S3 Destination
Within your tenant’s web UI, select Destinations > New Destination > Amazon S3 and provide all the credentials necessary to add the S3 destination dataset. This information will include a basic name and description, the object key name used to identify your logs in the S3 bucket, as well as the AWS region and authentication method necessary to allow the destination to connect with your bucket. Information regarding these fields can be found in the table below.
Field
Description
Name
A unique name for your destination.
Description
(Optional) A description of your destination.
Bucket Name
The name of the bucket you want to send your logs to. Edge Processors use this name as a prefix in the object key name.
Folder Name
(Optional) The name of a folder where you want to store your logs in the bucket. In the object key name, Edge Processors include this folder name after the bucket name and before a set of auto-generated timestamp partitions.
File Prefix
(Optional) The file name that you want to use to identify your logs. In the object key name, Edge Processors include this file prefix after the auto-generated timestamp partitions and before an auto-generated UUID value.
Output Data Format
JSON (Splunk HEC schema). This setting causes your logs to be stored as .json files in the Amazon S3 bucket. The contents of these files are formatted into the event schema that's supported by Splunk’s HEC. See Event metadata in the Splunk Cloud Platform Getting Data In manual.
Region
The AWS region that your bucket is associated with.
Authentication
The method for authenticating the connection between your Edge Processor and your Amazon S3 bucket. If all of your Edge Processor instances are installed on Amazon EC2, then select Authenticate using IAM role for Amazon EC2. Otherwise, select Authenticate using access key ID and secret access key.
AWS Access Key ID
The access key ID for your IAM user. This field is available only when Authentication is set to Authenticate using access key ID and secret access key.
AWS Secret Access Key
The secret access key for your IAM user. This field is available only when Authentication is set to Authenticate using access key ID and secret access key.
Constructing Relevant Pipelines
When working with multiple destinations in Edge Processor, separate pipelines are needed to route logs to each desired target (i.e. Splunk + Amazon S3 in this case). Thus, using the Pipelines > New pipeline button in the web UI, create and attach two new pipelines to your existing instance. Depending on the fields present in your ingested logs, you’ll want to define your pipeline’s partition by either sourcetype, source, or host. It doesn’t necessarily matter which of these is selected here; however, it’s crucial that both pipelines partition by the exact same field and value. Furthermore, each of these pipelines should obviously specify separate destinations—namely, those set up in steps 1 and 3 above.
Splunk Destination: The way in which you filter logs to be sent to Splunk will, of course, vary depending on the nature of its format and contained information; however, SPL2 offers a few avenues through which you may extract relevant values from the ingested logs. For large JSON structures, json_extract and json_extract_exact can be used to distill the relevant information. For instance, consider the following cloudwatch log, applied pipeline, and associated output: EVENT DATA ↓
APPLIED PIPELINE: Extracts information related to the event ID, request ID, user account ID, as well as various group IDs associated with the request parameters. All other data (i.e. _raw) is dropped. PIPELINE OUTPUT ↓
Field
Result
event_id
e394a756-ab36-4f7a-a9d9-c2fff8184457
request_id
3c6deda5-e7bf-45c3-8279-3a78f1c42bea
user_account_id
987654321955
req_group_ids
["sg-051ccc60","sg-d81fa120","sg-e48b1fcc"]
Furthermore, for non-JSON logs, regular expressions can also be used to extract information via the rex command. For instance, consider the following snippet taken from a Windows event security log: EVENT DATA ↓
APPLIED PIPELINE: Extracts information related to the log’s timestamp, event code, user account name, and corresponding message. All other data (i.e. _raw) is dropped. PIPELINE OUTPUT ↓
Field
Result
time
12/06/2021 10:01:28 AM
event_code
4624
message
An account was successfully logged on
account_name
WIN-9A3SFCUS26U$
Once you’ve successfully written a pipeline that extracts the targeted information, be sure to double-check that the specified destination is set to the desired Splunk index. This can be seen on the right-hand side of the pipeline builder UI.
Amazon S3 Destination: Assuming you want to route all of the ingested logs directly to S3 for comprehensive storage, the SPL contained within the pipeline builder UI need not contain any complex queries. Simply routing all information from source to destination should suffice. APPLIED PIPELINE: Sends all event data directly to the destination (i.e. no processing necessary).
Again, it’s important to note that the destination here should be set to the Amazon S3 destination you created in step 3 above.
So, what's the takeaway here?
To conclude, we have successfully demonstrated how Edge Processor can be used to efficiently reduce and route customer logs to multiple destinations—optimizing both resource utilization and cost efficiency in the process. Specifically, it has been shown that customers have the ability to filter and extract only relevant pieces of information from their ingested logs via SPL2 queries, which can then be sent to Splunk for long-term storage. Upon setting up another destination pointing to Amazon’s S3 cloud storage, a separate pipeline can be applied and used in parallel to store the complete logs there as well.