Product News & Announcements
All the latest news and announcements about Splunk products. Subscribe and never miss an update!

Data Preparation Made Easy: SPL2 for Edge Processor

courtlynwri
Splunk Employee
Splunk Employee

By now, you may have heard the exciting news that Edge Processor, the easy-to-use Splunk data preparation tool for filtering, transformations, and routing at the edge, is now Generally Available. Edge Processor allows data administrators for Splunk environments the ability to drop unnecessary data, mask sensitive fields, enrich payloads, and conditionally route data to the appropriate destination. Managed via Splunk Cloud Platform but deployed at the customer data edge, Edge Processor helps you control data costs and prepare your data for effective downstream use.

Alongside the announcement of the GA of Edge Processor, we are also excited to announce the General Availability of the SPL2 Profile for Edge Processor! The SPL2 Profile for Edge Processor contains the specific subset of powerful SPL2 commands and functions that can be used to control and transform data behavior within Edge Processor, and represents a portion of the entire SPL2 language surface area.

In Edge Processor, there are two ways you can define your processing pipelines. The first, which is fantastic for quick and easy pipeline authoring, allows data administrators to take advantage of the point-and-click features of the Edge Processor pipeline editor. From this same pipeline editor experience, users can also opt to directly interact in the SPL2 code editor window for extremely flexible pipeline authoring.  This allows data administrators to directly use Splunk’s SPL2 language to author pipelines via a code editor in a manner familiar to SPL experts. This is extremely exciting, as it allows SPL syntactical patterns to be used for transformations on data in motion! Let’s learn a bit more.

What is SPL2? 

SPL2 is Splunk’s next-generation data search and preparation language designed to serve as the single entry point for a wide range of data handling scenarios and in the future will be available across multiple products. Users can leverage SPL2 to author pipelines that process data in motion, create and validate data schemas while leveraging in-line tooling and documentation. SPL2 seeks to enable a “learn once, use anywhere” language model across all Splunk features in a manner extremely familiar to SPL users today.

SPL2 takes the great parts of SPL - the syntax, the most used commands, the investigation-friendliness, and the flow-like structure - and makes it available for use not only against data at rest (e.g., via splunkd), but also for streaming runtimes. This allows data administrators, developers, and others who are familiar with SPL, but unfamiliar with configuring complex rules in props and transforms, to translate their existing SPL knowledge and apply it directly to data in-motion, via Edge Processor.

A template for an SPL2 pipeline that masks IP addresses from the hostname field of syslog data.

SPL2 is already used implicitly by multiple Splunk products today under the hood, to handle data preparation, processing, search, and more. Over time, we intend to make SPL2 available across the entire Splunk portfolio to support a truly unified platform. 

Customers familiar with SPL will be very pleased to hear that SPL2 has introduced a range of new functionality to more seamlessly support needs for data preparation in-motion, including:

  • Data does not have to be cast from one type to another. SPL2 is a weakly typed language with the option for users to create type constraints (including custom types) where necessary; by default, SPL2 implicitly converts between unrelated types, meaning that casting is no longer required. This allows data administrators to spend less time worrying about field format and schema for incoming data, and more time concentrating on getting the right data to the right place.
  • Source and destination functions, which were highly bespoke, are replaced with datasets. These datasets can be created, permissioned, and managed independently, and map cleanly to locations where you want to read from and write to. This allows data administrators to more granularly control how data is accessed and written, while also promoting easy reusability across pipelines.
    • Metadata about the destination is captured in the dataset configuration rather than the pipeline definition, so you do not have to pass this metadata in the pipeline itself; this results in clean pipeline definitions that can be easily understood and copied.
  • JSON handling can be done seamlessly with a range of JSON manipulation eval functions, rather than ucast or other complex logic. 

What is the SPL2 Profile for Edge Processor? 

SPL2 supports a wide range of operations on data. The SPL2 profile for Edge Processor represents a subset of the SPL2 language that can be used via the Edge Processor offering. For example: at launch, Edge Processor is primarily built to help customers manage data egress, mask sensitive data, enrich fields, and prepare data for use in the right destination. SPL2 commands and eval functions that support these behaviors are supported in the profile for Edge Processor to ensure a seamless user experience. Learn more about SPL2 profiles and view a command compatibility matrix by product for SPL2 commands and eval functions.

How does Edge Processor use SPL2?

Edge Processor pipelines are logical constructs that read in data from a source, conduct a set of operations on that data, and then write that data to a destination. All pipelines are defined entirely in SPL2 (either when directly manipulated in the code editor for Edge Processor, or indirectly created via the GUI for pipeline authoring.) SPL2 pipelines define an entire set of transformations, often related to similar types of data.

All pipelines must follow this syntax:

 

 

 

 

 

 

$pipeline =  from $source | <processing logic> | into $destination;

 

 

 

 

 

 

Take the below Edge Processor pipeline, defined in SPL2:

 

 

 

 

 

 

$pipeline =  from $source | rex field=_raw /user_id=(?P<user_id>[a-zA-Z0-9]+)/ | into $destination;

 

 

 

 

 

 

This SPL2 pipeline above can be decomposed into multiple components:

  • $pipeline - this represents the definition of the pipeline statement that will be applied on any given Edge Processor node or cluster. As denoted by the dollar sign ($), it is a parameter, meaning that everything on the right hand side of the assignment (=) is assigned to the left.
  • Note: in case of a very long & complex pipeline, you can decompose your pipeline into segments, like so with this pseudo-code SPL2:

 

 

 

 

 

 

$pipeline_part_1 = from $source | where … | rex field=_raw /fieldA… fieldB… fieldC…

$pipeline = from $pipeline_part_1 | eval … | into $destination;​

 

 

 

 

 

 

  • from $source - indicates that this pipeline should read from a specific dataset, indicated by this dataset variable called $source. This variable can be assigned with a specific dataset representing your data to be processed via the Edge Processor data configuration panel - in this case, the $source is a preconfigured sourcetype you can set up in Edge Processor management pages.
  • rex field… - a regular expression to extract the user_id field from the _raw field. It is important to note that Edge Processor only supports the RE2 regular expression flavor, not PCRE.
  • into $destination - indicates that this pipeline should write into a destination, indicated by this dataset variable called $destination. This variable can be assigned with a specific dataset, such as a Splunk index or S3 bucket, via the Edge Processor data configuration panel.

As you can probably tell, there are some differences between the SPL2 here and the SPL you know. The first is that SPL2 allows for not just single expressions, but expression assignments; entire searches can be named, treated as variables and linked together to compose a single dispatchable unit. SPL2 also supports writing into datasets, not just reading from datasets (and with a slightly different syntax). Datasets can be different things - indexes, S3 buckets, forwarders, views, and more. You’ll likely be writing to a Splunk index most of the time. You can find more details about the differences between SPL2 and SPL here.

But what if your pipeline isn’t constrained to a single sourcetype? For these scenarios, you can instead read from a specific dataset called all_data_ready (the consolidation of all Edge Processor ingress data) and apply any sourcetype logic that you’d like:

 

 

 

 

 

 

$pipeline =  from $all_data_ready | where sourcetype=”WMI:WinEventLog:*” | rex field=_raw /user_id=(?P<user_id>[a-zA-Z0-9]+)/ | into $destination;

 

 

 

 

 

 

  • where sourcetype=”WMI:WinEventLog:*” - this is a filter that takes the data that is piped in, and only keeps events matching this specific sourcetype. The rest of this pipeline will only operate on this sourcetype.

How does SPL2 make data preparation simpler?

You may have begun to see that SPL2 is not just a set of commands and functions, but also core concepts underneath that can enable powerful data processing scenarios. In fact, Edge Processor ships out-of-box SPL2 pipeline templates to address some canned data preparation use cases:

Beyond these templates, let’s walk through a few examples that highlight how SPL2 makes data preparation simpler.

I want to logically separate components of complex, multi-stage pipelines.

SPL2 allows pipelines to be defined in multiple stages, for ease of organization, debugging, and logical separation. Using the statement assignments as variables later in the SPL2 module allow data admins to modularly compose their data preparation rules.

 

 

 

 

 

 

$capture_and_filter = from $all_data_ready | where sourcetype=”WinEventLog:*”

 

 

 

 

 

 

 

 

 

 

 

 

$extract_fields = from $capture_and_filter | rex field = _raw /^(?P<dhcp_id>.*?),(?P<date>.*?),(?P<time>.*?),(?P<description>.*?),(?P<ip>.*?),(?P<nt_host>.*?),(?P<mac>.*?),(?P<msdhcp_user>.*?),(?P<transaction_id>.*?),(?P<qresult>.*?),(?P<probation_time>.*?),(?P<correlation_id>.*?),(?P<dhc_id>.*?),(?P<vendorclass_hex>.*?),(?P<vendorclass_ascii>.*?),(?P<userclass_hex>.*?),(?P<userclass_ascii>.*?),(?P<relay_agent_information>.*?),(?P<dns_reg_error>.*?)/

 

 

 

 

 

 

 

 

 

 

 

 

$indexed_fields = from $extract_fields | eval dest_ip = ip, raw_mac = mac, signature_id = msdhcp_id, user = msdhcp_user

 

 

 

 

 

 

 

 

 

 

 

 

$quarantine_logic = from $indexed_fields | eval quarantine_info = case(qresult==0, "NoQuarantine", qresult == 1, "Quarantine", qresult == 2, "Drop Packet", qresult == 3, "Probation", qresult == 6, "No Quarantine Information")

 

 

 

 

 

 

 

 

 

 

 

 

$pipeline = from $quarantine_logic | into $destination 

 

 

 

 

 

 

As you can see above, we’ve defined four processing “stages” of this pipeline: $capture_and_filter, $extract_fields, $indexed_fields, and $quarantine_logic, with each flowing into the next, and of course with $pipeline tying it all together into the destination. When the $pipeline is run, all stages are concatenated behind the scenes, allowing the pipeline to work as expected while maintaining a degree of logical segmentation and readability. 

I have a complex nested JSON event that I want to easily turn into a multivalue field and then extract into multiple events.

If you’ve ever worked with JSON in Splunk, you know that it can be…tricky. It’s a never ending combination of mvindexes, mvzips, evals, mvexpands, splits, and perhaps even using SEDCMD in prop.conf.

With SPL2, it’s easier than ever, with the expand() and flatten() commands! Often used together, they can be used to first expand a field that contains an array of values to produce a separate result row for each object in the array, then flatten the key-value pairs in the object into separate fields in an event, repeating as many times as necessary.

Let’s take this JSON passed as a single event as an example, and assume it is represented by a dataset named $json_data. We want to create the timestamp at index time (that was previously missing) and extract each nested stanza into an event:

 

 

 

 

 

 

{

        "key": "Email",

        "value": "john.doe@bar.com"

      },

      {

        "key": "ProjectCode",

        "value": "ABCD"

      },

      {

        "key": "Owner",

        "value": "John Doe"

      },

      {

        "key": "Email",

        "value": "jane.doe@foo.com"

      },

      {

        "key": "ProjectCode",

        "value": "EFGH"

      },

      {

        "key": "Owner",

        "value": "Jane Doe"

      }

}

 

 

 

 

 

 

By itself and without preparation, we’re being passed a single event with the fields stuck in the JSON body.

But, we can write the following SPL2 to easily flatten this JSON and timestamp it:

 

 

 

 

 

 

$pipeline = FROM $json_data as json_dataset | eval _time = now()

 | expand json_dataset | flatten json_dataset | into $destination

 

 

 

 

 

 

Which should result in extraction of this JSON event into multiple events with fields, like so:

Getting started with SPL2 in Edge Processor

SPL2 within Edge Processor is extremely powerful, and this blog post only scratches the surface! If you’re interested in learning more about SPL2 or the SPL2 Profile for Edge Processor, join in! Reach out to your account team to get connected, or start a discussion in splunk-usergroups Slack .

Ignore - staging some unrelated text

I always want SEV HIGH and SEV MEDIUM events from all AWS applications to be routed to my “alerts” Splunk index, and all SEV LOW events to be routed to my “low_sev_s3” AWS S3 bucket. All events without an attached severity level should default to an “audit_s3” AWS S3 bucket.

To achieve this, you can use the same logic from above - stringing multiple statements together to create a mega-pipeline - to create individual smaller pipelines routing from the same dataset.

 

 

 

 

 

 

$pipeline = from $source … | <rex> | eval…


| branch
[where Severity=”HIGH” | into $alerts],
[where Severity=”MEDIUM” or Severity=”LOW” | into $low_sev_s3],
[where Severity !=”HIGH” and Severity !=”MEDIUM” and Severity !=”LOW” | into $audit_s3]

 

 

 

 

 

 

Using branching in this manner, combined with the custom logic and multiple destinations, allows for this to be seamlessly represented in SPL2!

 

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...