Hey Fellow Splunkers,
I'm having a bit of trouble perhaps understanding how this works and whether I'm doing this correct.
Currently on version 9.0.2
Scenario
Product Logs -> Syslog(Has a HF on it) -> IDX
Syslog is writing to one singular file that I monitor and it has multiple time formats in it and different line breaking. I basically want to bring in all the syslog from this certain product into one sourcetype, kinda like a staging area, then split them out based on REGEX. This is what I've got so far. Most of this is dummy data so dont worry about scrutinizing it for typo's etc.
Configuration
This is all on the HF
Inputs.conf
[monitor://path/to/product/syslogs]
index = syslog
sourcetype = product_staging
Props.conf
[product_staging]
TRANSFORMS = change_sourcetype_one, change_sourcetype_two
[sourcetype_one]
LINE_BREAKER = A line breaking example
TIME_FORMAT = %m-%a-%d %H:%M:%S
TIME_PREFIX = ^
MAX_TIMESTAMP_LOOKAHEAD = 30
[sourcetype_two]
LINE_BREAKER = A line breaking example
TIME_FORMAT = %C-%b-%a %M:%k:%S
TIME_PREFIX = ^
MAX_TIMESTAMP_LOOKAHEAD = 30
Transforms.conf
[change_sourcetype_one]
DEST_KEY = MetaData:Sourcetype
REGEX = (DataOne)
FORMAT = sourcetype::sourcetype_one
[change_sourcetype_two]
DEST_KEY = MetaData:Sourcetype
REGEX = (DataTwo)
FORMAT = sourcetype::sourcetype_two
I can get the data to split, easily, my issue is, when it splits off into the different sourcetypes, the INDEXING TIME features, like TIME_FORMAT, TIME_PREFIX, LINE_BREAKER etc don't take effect on the new sourcetypes that were made from a split.
Is it simply because the original sourcetype [product_staging] has touched the data with its own settings, and now the other sourcetypes can't apply their own?
I honestly don't understand what I'm doing wrong.
Any help would be greatly appreciated.
Hi
please check this "MASA" diagram how splunk pipelines are working for events. https://community.splunk.com/t5/Getting-Data-In/Diagrams-of-how-indexing-works-in-the-Splunk-platfor... especially "Detailed Diagram UF to IDX".
As you could see those Timestamp and transforms are happening in different part of pipeline. Timestamp extraction is happening on aggregator engine and your sourcetype matching after typing Queue. As you cannot send that event back to earlier phase on indexing pipeline it cannot do again timestamp extractions!
Maybe you could try to use INGEST_EVAL for that as a next step on your transforms definition?
r. Ismo
Hi @isoutamo
Thanks for your reply.
While I understand that diagram and believe that its correct as a baseline, it leaves me more confused.
Take the popular Palo Alto TA, that you can get from Splunkbase. It's whole basis is based on splitting a singular sourcetype into many others.
For example, an input is using pan:firewall as the sourcetype. The official docs say to use pan:firewall
[pan:firewall]
category = Network & Security
description = Syslog from Palo Alto Networks Next-generation Firewall
pulldown_type = true
SHOULD_LINEMERGE = false
TIME_PREFIX = ^(?:[^,]*,){6}
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS-sourcetype = pan_threat, pan_traffic, pan_system, pan_config, pan_hipmatch, pan_correlation, pan_userid, pan_globalprotect, pan_decryption
If we look at pan_threat it gets renamed to pan:threat within the TA, and then includes a TIME_FORMAT.
[pan_threat]
rename = pan:threat
[pan:threat]
SHOULD_LINEMERGE = false
EVENT_BREAKER_ENABLE = true
KV_MODE = none
TIME_PREFIX = ^(?:[^,]*,){6}
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %Y/%m/%d %H:%M:%S
This suggests, that when the data comes in as [pan:firewall] it makes its way down to the typingQueue, applies the TRANSFORMS, in our case, we are focusing on [pan_threat], then applies all the configuration in the [pan:threat] stanza, including the TIME_FORMAT which is at the aggQueue, but that would be revisiting the queues again.
How is this TA doing it, but I can't?
First, rename sourcetype is search phase not ingest phase parameter. Those parameters haven't used when you are ingesting data into splunk. See more Sourcetype configuration.
The only reason why those TIME* etc. are under pan:threat stanza is, that there are somewhere input which define that sourcetype directly. But as you can check from props.conf definition
[pan_threat]
rename = pan:threat
cannot take those into use on ingest phase!
You should remember that normally there are not separate props.conf for indexers and search head. Usually those are in same TA / package and quite often those contains also inputs.conf for UF/HF. This can be quite confusing time by time 😉
To see where each parameter is used you should check https://www.aplura.com/assets/pdf/where_to_put_props.pdf which told that little bit easier that Splunk's own documentation (e.g. https://docs.splunk.com/Documentation/Splunk/9.1.0/Deploy/Datapipeline and https://docs.splunk.com/Documentation/Splunk/9.1.0/Indexer/Indextimeversussearchtime).
I hope that this helps you more than confusing?
As I said earlier, you probably can set correct _time by INGEST_EVAL based on those final source types, but you need to add this under original sourcetype definition's TRANSFORM stanza.
Hi @gcusello
Appreciate the reply.
The props.conf and transforms.conf are located within a custom app on the HF for the product. I've btool'd to make sure that they are being applied.
The timestamp config and Regex's up there are dummy data, the actual ones I do not have on me right now. However, I do know they are correct. The timestamps have been tested within Splunk's "Add Data" feature on a SH to confirm the timestamp settings are correct. The regex works as well, as I get 2 other sourcetypes: sourcetype_one and sourcetype_two based on my regex. I have no issues here.
My issue is that none of the other settings are working as soon as they are either in sourcetype_one or sourcetype_two.
For example, sourcetype_one has:
TIME_FORMAT = %m-%a-%d %H:%M:%S
while sourcetype_two has:
TIME_FORMAT = %C-%b-%a %M:%k:%S
Neither of them take effect and Splunk is left not correctly displaying the correct time for either sourcetype.
I've done this in the past but for some reason, it's just not working anymore.
Appreciate the recommendation, however I like to keep it as a staging area for when new "unannounced" data gets through my regex, it just means I need to either fine tune the other two or create another.
HI @konka4,
I suppose that you have logs from different systems, so why don't you try to use two different inputs (if it's possible), one for each sourcetype?
Ciao.
Giuseppe
@gcusello Unfortunately I can't, as the product sends all its logs via syslog to one file per day. That file then represents 2 or more differently formatted data and is monitored by one singular monitor stanza.
This doesn't seem very uncommon and I know Splunk can do it, its just that something is halting it. Theres nothing else in-between for the data to latch on to either. Its just:
Syslog logs -> Syslog Server (with HF on it) -> IDX
It's a very simple setup, yet having troubles with the indexing pipelines on the HF recognizing different sourcetypes have different settings to apply.
Hi @konka4,
as I said, the approach is correct and it runs on other data sources (e.g. you can see the Fortinet Add-On).
The usual issues are the localization of the conf files, that must be on the first full Splunk instance, and the regexes.
Could you share a sample of the logs of your two data sources?
Ciao.
Giuseppe
Hi @konka4,
you approach seems to be correct, only one question, where do you located props.conf and transforms.conf?
It must be located in the first Full Splunk instance that data pass through.
In your case in the HF.
Are you sure about the timestamp configuration and the regexes to override sourcetypes?
if you could share some sample I could check them.
Last thing: if you have only two sourcetypes you could give the value and settings of one of them to the default value, so you have to override only one of them.
Ciao.
Giuseppe