Getting Data In

How to resolve Host metadata override?

icewolf69
Loves-to-Learn Everything

Hi all, 

I'm trying to do something that seems pretty easy conceptually.  I'm ingesting a .txt report into Splunk and I want to set the MetaData Host to the system that the report was generated from, not the host that Splunk is getting the log from.

The problem is, every path I take, creates a different issue that I can't (or really don't want to) deal with.  I've looked through all the docs, and i'm either missing something, misconfiguring something, or it's not possible.

From what I understand, there are only a couple ways to perform a host overwrite:

1) Specify a regex path in the inputs.conf stanza to extract the host from the source path, which could be either a folder or the filename; but is the "source" path nonetheless. 

2) Specify a regex for the props.conf and transforms.conf, which overwrites the "host" metadata based on the hostname inside the log.

3) Force a specific hostname string through the configuration files, but then this would be a static hostname for the source or sourcetype.

 

I've gotten all of these solutions to work individually, but each one creates a separate issue which prevents me from using it:

1) Works well, but I end up with hundreds of "source" file paths inside Splunk, which eventually just makes everything cluttered when looking through the datasets, and confuses end-users. I can get around this issue by declaring a "source = <source>" in the inputs.conf, but then that changes the metadata that Splunk uses to regex extract the hostname from.  So instead of getting the hostname from "source::c:\\logs\\client1.txt" it tries to regex the host from "source::<source_name>", which of course it will never find.
So it seems like for #1, I either have to deal with a ton of file paths inside Splunk, OR a working host regex extractions.

2) This also seems to work, but brings another issue.  The reports i'm ingesting are pretty large, so I have setup a custom LINE_BREAKER value.  I can successfully extract the hostname from inside the report using props and transforms, but for reasons I can't figure out, the hostname doesn't carry to the rest of the events as it is line broken.  So the first part of the txt file gets the correct host metadata (hostname is in first line of txt), but any line broken event after that, the regex fails and it defaults to the hostname of the system the log resides on.  This really seems to baffle me, because for the time settings, if it can't extract the time in subsequent line_breaks, then it will copy the field from the previous.  So the correct time metadata gets applied to all the events.  But it doesn't do that for the host.  And why would it not apply the host metadata to all event lines as it gets line broken, because Splunk should know, as it's ingesting, that this is all coming from the same "event"?

3) Works, but is not really an option because the reports come from different hosts, and this would just create erroneous data.

 

Transforms.conf:

[SET_HOST_NAME]
DEST_KEY = MetaData:Host
REGEX = \,HostName\:(.\S[^,]+)
FORMAT = host::$1
DEFAULT_VALUE = bonkers

 

Props.conf:

[SCC_Report]
TRANSFORMS-H1 = SET_HOST_NAME
TIME_PREFIX = SessionID:
TIME_FORMAT = %Y-%m-%d_%H%M%S
LINE_BREAKER = (\s\s)Title.*\:\sV\-
SEDCMD-remove_fluff = s/([\s]+?Weight[\s\Sa-zA-Z0-9~@#$^*()_+=[\]{}|\\,.?:]*?---------------)/\n\n<REDACTED DURING INGESTION>\n/g
SHOULD_LINEMERGE = false
category = Custom
disabled = false

 

Fields.conf:

[H1]
INDEXED=true


Any help is appreciated.  I can't tell if i'm trying to get Splunk to do something it can't do, or if i'm just going about it the wrong way. Preferred end-state is:
1) ingest *.txt Report

2) Set both "source" and "sourcetype" to something static (prevent a collection of filenames inside sources and sourcetypes)

3) Set the host metadata for all events created from that single txt report to be the host that is in the first line of the report.

Labels (4)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Answering your second question - timestamp is something unique and it is expected to be different across your events and "carrying over" the timestamp from the previous event is just a form of a fallback when the timestamp cannot be parsed from the raw message.

But host or source is expected to be fairly static and is defined at the whole input level. Your transform only overwrites it when it is applied to a single event.

In general, other than the timestamp falling back to previous event's there is no "state" with event ingestion. Each event is processed on its own and you have no knowledge of any "previous" ones. Which makes sense since other events from your data stream could have been - for example - routed to other indexers/HFs.

The "get last event's time" functionality is also not meant to be used as a reliable timestamping technique, it's just a (kind of) sane failsafe for a situation for when there is something wrong with the event's timestamp (no timestamp or a timestamp too old or too far into the future)

0 Karma

woodcock
Esteemed Legend

You are very correct about your situation.  There are TWO very unknown/unused splunk configurations that I have used in such situations.  You are implying that the host value can be found somewhere inside of the file, hopefully on the first line.  You are going to combine this:
https://docs.splunk.com/Documentation/SplunkCloud/latest/Data/Assignmetadatatoeventsdynamically
with the "unarchive_cmd" here:
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf

So here is some unarchive code that we used to create a Semaphore event that summarizes the data (so that we can test the data found by search against what the semaphore event says should be there and know FOR SURE whether our search has all the data from the file, or some of it is missing for some reason):
[source::....import]
unarchive_cmd = gawk 'BEGIN { min="999999999999"; max="0"; count="0" } /./ { match ($0, /"time":([0-9\.]+)/, time); min = (min > time[1] && time[1] > 0 ? time[1] : min ); max = (max < time[1] && time[1] > 0 ? time[1] : max); count++; print } END { "date +%s.000000" | getline date; close("date"); print "{\"time\":"date",\"earliest\":"min",\"latest\":"max",\"NumberOfRecords\":"count",\"SplunkIndexingStatusSemaphore\":\"Splunk Indexing Complete\"}" }'
sourcetype = preprocess-yourSourcetypeHere

So what this does is that when splunk sees a file named "*.import", it passes the file to this "gawk" script which calculates "min(_time) max(_time) count" as it echoes out each line of data for the UF to process.  Then, at the very end, it emits a final JSON summary event.  So we get each original line/event as-is/as-was, AND 1 extra, super-useful event.

Your use case is a bit different.  You will need to is buffer the events/rows/data until you get the to point where you can discern the host.  Then you emit a line like this to "stdout":
***SPLUNK*** host=YourHostValueHere

Then you will reprocess your queue, then continue processing the file's rows/events echoing out lines as-is.

0 Karma

woodcock
Esteemed Legend

BTW, if you have control over the creation of the file at the original source, you can just ensure that this line is inserted by that host (since he clearly knows his own hostname):
***SPLUNK*** host=YourHostValueHere

0 Karma

mattymo
Splunk Employee
Splunk Employee

Splunk can definitely what you are trying to do, maybe in too many ways :). 

I think finding the one that works for you is the key and I think you are close in your options 1 & 2. 

(I'd ditch #3.)

option #1

this is probably the simplest way.  I'd start here. 

I am not sure I grok why the source field is a problem with using the data, but it can definitely be replaced using props/transforms versus hardcoding in the inputs.

This should allow Splunk to use it to grab your desired hostname with "host_regex" in inputs.conf, then we can overwrite it in a props/transforms as the data flows through the pipeline. I suggest simply using an ingest eval stanza like:

 

## props.conf
[my:sourcetype] 
TRANSFORMS = source_override

## transforms.conf
[source_override]
INGEST_EVAL = source:="my_simple_source"

 


Option #2 

This may provide even more flexibility in your logic for setting the fields you want. I would also use  "ingest_eval" here as well versus the general props/transforms as it provides some powerful logic that regex alone may not be suitable for. 

https://docs.splunk.com/Documentation/Splunk/9.0.4/Data/IngestEval

I believe the issue you may be hitting with losing the hostname is that your current approach relies on the hostname being present in the raw data, which may not happen when linebroken. 

Instead with ingest_eval we can write some logic that allows us to manipulate the metadata you get to achieve your goal. 

Technically you would just be re-implementing the "host_regex" logic here tho, so might be overkill, but might be useful if needed for advanced uses and wouldn't be limited to only the source field. 

 

## props.conf
[my:sourcetype] 
TRANSFORMS = extract_host_from_source,source_override

## transforms.conf

[extract_host_from_source]
SOURCE_KEY = MetaData:Source
REGEX = <some_regex_that_extracts_your_host_value>
FORMAT = host::$1
DEST_KEY = MetaData:Host

# once you have the hostname from source, now overwrite it!
[source_override]
INGEST_EVAL = source:="my_simple_source"

 


tldr: ingest_eval is mad powerful and allows you to refer to fields or metadata that exists already and apply powerful eval logic that goes well beyond what I showed here. I've even done conditional field overrides based on fields extracted, etc. It's super cool and leads into ingest actions world eventually. 

Note: I wrote this pseudocode style, haven't tested them against data. If you have some sample data happy to try and tune them if you run into issues. 


References:
https://docs.splunk.com/Documentation/Splunk/9.0.4/Data/Overridedefaulthostassignments
https://docs.splunk.com/Documentation/Splunk/9.0.4/Data/IngestEval
 https://docs.splunk.com/Documentation/Splunk/9.0.4/Admin/Transformsconf#:~:text=INGEST_EVAL%20%3D%20...


 

- MattyMo
0 Karma

manjunathmeti
Champion

hi @icewolf69,

You can set the host value first using host_regex and host_segment then override the source value using props and transforms configurations.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...