How to aggregate multiple JSON events into a singl...

sharad06 · ‎08-21-2017

Hi Splunk experts,

I have written a script to read a DB storing network endpoint data and send all the stored info to Splunk using HTTP Event Collector. Each info about the endpoint is sent in a separate JSON format message shown below:

{
    "ip" : "1.1.1.1",
    "timestamp" : "current_system_time",
    "domain" : "domain_name",
    "logged_in_user" : "user_name",
    "message_type" : "read_db",
    "key" : "value",
    "key_type" : "value_type"
}

Now, this leads to sending a lot of redundant data since all KV pairs other than the highlighted ones remain the same in all messages. I am thinking of ways to aggregate these messages using JSON arrays and then to 'unroll' the array on Splunk Enterprise. Here is the basic format I can think of:

{
    "ip" : "1.1.1.1",
    "timestamp" : "current_system_time",
    "domain" : "domain_name",
    "logged_in_user" : "user_name",
    "message_type" : "read_db",
    "properties" : [
        {
            "key1" : "value1",
            "key_type1" : "value_type1"
        },
        {
            "key2" : "value2",
            "key_type2" : "value_type2"
        },
       ...,
       ...
    ]
}

I have following questions in my mind about this approach:

Is there a better approach (format) for sending these events?
Will I need search-time or index-time extractions to 'unroll' the JSON array and separate out individual events on Splunk Enterprise?
How to 'unroll' this JSON array? Is it possible to run a script at search/index-time to achieve this? Or props.conf is my only friend here?
I read a lot of Splunk answers about handling JSON arrays in Splunk Enterprise. The most useful answers mentioned using spath and mvexpand to expand nested JSON arrays. However, does this mean that by sending events in this format, I am forcing my users to write searches using spath and mvexpand, every time they want to search the above events?
If the answer to point 4. is yes, then should I use index-time event extraction to avoid introducing this limitation?

DalJeanis · ‎08-22-2017

Let's start with your first assumption - that some of the data is redundant. Just because the data returned by two queries is going to be identical, does not mean that it is redundant.

SOURCE: Are these DB messages actually redundant with each other? Like, is it an arbitrary architectural decision that the data wasn't all in one bunch in the first place? Or is each message the result of a distinct question asked of the database, and those questions happen to be asked at the same time?

USAGE: Is the data from these events always going to be aggregated together whenever they are used? Or will the data be generally used one key item at a time?

The above information will help us to understand and guide you towards your optimum architecture.

A second item that is worth mentioning is this - index time extractions and search time extractions can both be automatic. All index time extractions MUST be automatic, but the vast majority of all automatic field extractions in Splunk are search-time.

So, your users do not NECESSARILY have to code search-time extractions themselves.

If the JSONs really do need to be used all together, then another potentially useful approach, fully internal to Splunk, could be to load them into a temporary index, reformat and aggregate them, and then collect the results to a summary index.

Depending on some architectural choices, this could be set up where you do use additional licensing, or where you don't.

sharad06 · ‎08-22-2017

Thanks for your reply.

For the initial POC I wanted to send JSON messages in the simplest format possible. So while my script reads all the endpoint info from the DB in one go, I loop over all properties and send those out one by one in the JSON format mentioned in my question. The trouble is that due to all the common keys that are sent for each JSON message consume extra bandwidth/license and hence for the script to scale, I need to figure out a way to send more info in each message. Now, if it was only one KV pair per message, I could have easily put multiple KV pairs together as follows without the need to use nested JSON objects:

{
     "ip" : "1.1.1.1",
     "timestamp" : "current_system_time",
     "domain" : "domain_name",
     "logged_in_user" : "user_name",
     "message_type" : "read_db",
     "key1" : "value1",
     "key2" : "value2",
     "key3" : "value3",
 }

However, every key must also be sent with a key_type field for it to make sense on the Splunk Enterprise. Also on Splunk Enterprise, being a Splunk beginner, I think these events might have to be separated out (one key per event) for all my searches to keep working as expected.

USAGE: Right now, all events show up on Splunk as containing just one KV pair. All searches assume this format of the incoming events. After the sending side starts aggregating these messages, in ideal case I would still want all events to contain just one KV pair (so that all pre-configured searches and any user-defined searches continue to work without extra work for the user).

How to aggregate multiple JSON events into a single JSON event before sending to Splunk Enterprise? How to configure field extractions for such events?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

ATTENTION: We’re Moving! (AGAIN!)

Deep Dive: Optimizing Telemetry Pipelines in Splunk Observability Cloud

Announcing Modern Navigation: A New Era of Splunk User Experience

Join the Conversation