Hi Splunk experts,
I have written a script to read a DB storing network endpoint data and send all the stored info to Splunk using HTTP Event Collector. Each info about the endpoint is sent in a separate JSON format message shown below:
{
    "ip" : "1.1.1.1",
    "timestamp" : "current_system_time",
    "domain" : "domain_name",
    "logged_in_user" : "user_name",
    "message_type" : "read_db",
    "key" : "value",
    "key_type" : "value_type"
}
Now, this leads to sending a lot of redundant data since all KV pairs other than the highlighted ones remain the same in all messages. I am thinking of ways to aggregate these messages using JSON arrays and then to 'unroll' the array on Splunk Enterprise. Here is the basic format I can think of:
{
    "ip" : "1.1.1.1",
    "timestamp" : "current_system_time",
    "domain" : "domain_name",
    "logged_in_user" : "user_name",
    "message_type" : "read_db",
    "properties" : [
        {
            "key1" : "value1",
            "key_type1" : "value_type1"
        },
        {
            "key2" : "value2",
            "key_type2" : "value_type2"
        },
       ...,
       ...
    ]
}
I have following questions in my mind about this approach:
 
					
				
		
Let's start with your first assumption - that some of the data is redundant. Just because the data returned by two queries is going to be identical, does not mean that it is redundant.
SOURCE: Are these DB messages actually redundant with each other? Like, is it an arbitrary architectural decision that the data wasn't all in one bunch in the first place? Or is each message the result of a distinct question asked of the database, and those questions happen to be asked at the same time?
USAGE: Is the data from these events always going to be aggregated together whenever they are used? Or will the data be generally used one key item at a time?
The above information will help us to understand and guide you towards your optimum architecture.
A second item that is worth mentioning is this - index time extractions and search time extractions can both be automatic. All index time extractions MUST be automatic, but the vast majority of all automatic field extractions in Splunk are search-time.
So, your users do not NECESSARILY have to code search-time extractions themselves.
If the JSONs really do need to be used all together, then another potentially useful approach, fully internal to Splunk, could be to load them into a temporary index, reformat and aggregate them, and then collect the results to a summary index.
Depending on some architectural choices, this could be set up where you do use additional licensing, or where you don't.
Thanks for your reply.
For the initial POC I wanted to send JSON messages in the simplest format possible. So while my script reads all the endpoint info from the DB in one go, I loop over all properties and send those out one by one in the JSON format mentioned in my question. The trouble is that due to all the common keys that are sent for each JSON message consume extra bandwidth/license and hence for the script to scale, I need to figure out a way to send more info in each message. Now, if it was only one KV pair per message, I could have easily put multiple KV pairs together as follows without the need to use nested JSON objects:
{
     "ip" : "1.1.1.1",
     "timestamp" : "current_system_time",
     "domain" : "domain_name",
     "logged_in_user" : "user_name",
     "message_type" : "read_db",
     "key1" : "value1",
     "key2" : "value2",
     "key3" : "value3",
 }
However, every key must also be sent with a key_type field for it to make sense on the Splunk Enterprise. Also on Splunk Enterprise, being a Splunk beginner, I think these events might have to be separated out (one key per event) for all my searches to keep working as expected.
USAGE: Right now, all events show up on Splunk as containing just one KV pair. All searches assume this format of the incoming events. After the sending side starts aggregating these messages, in ideal case I would still want all events to contain just one KV pair (so that all pre-configured searches and any user-defined searches continue to work without extra work for the user).
