Getting Data In

How to split data into separate sourcetypes with transforms

tkwaller
Builder

Hello

I have a input that is monitoring a file. In this file theres data of multiple formats including timestamps, its bad, but I was thinking I could use a transform to set sourcetype in props that I could use to format data.
So I did this in inputs.conf:

[monitor:///var/log/this_log/*.ec]
index = main
sourcetype=momlog

then I created a transforms.conf

[momlog_json_sourcetype]
DEST_KEY = MetaData:Sourcetype
REGEX = \{\"msys\"
FORMAT = sourcetype::momlog:json


[momlog_basic_sourcetype]
DEST_KEY = MetaData:Sourcetype
REGEX = .*
FORMAT = sourcetype::momlog:basic

I also have a props that looks like

[momlog:basic]
TIME_FORMAT = %s
TIME_PREFIX = ^
LINE_BREAKER = ([\r\n]+)
TRANSFORMS-basic = momlog_basic_sourcetype

[momlog:json]
TIME_FORMAT = %s
TIME_PREFIX = "timestamp":"
INDEXED_EXTRACTIONS = JSON
TRANSFORMS-json = momlog_json_sourcetype

My question is this:
What would the regex be for the NON-JSON data? Do inputs and props look correct? Im testing locally so I can break things all day long.

thanks for the assistance

0 Karma
1 Solution

alemarzu
Motivator

Hi there @tkwaller

Try adding this to your props.conf

 [momlog]
 SHOULD_LINEMERGE=false
 NO_BINARY_CHECK=true
 TIME_PREFIX =\"timestamp\":\"
 TRANSFORMS-sourcetye_routing = momlog_basic_sourcetype, momlog_json_sourcetype

 [momlog:basic]
 TIME_FORMAT = %s
 TIME_PREFIX = ^
 LINE_BREAKER = ([\r\n]+)

 [momlog:json]
 TIME_FORMAT = %s
 TIME_PREFIX = \"timestamp\":\"
 INDEXED_EXTRACTIONS = JSON

EDITED: Added a few things on the main sourcetype and fixed TIME_PREFIX regex for momlog:json sourcetype.

View solution in original post

alemarzu
Motivator

Hi there @tkwaller

Try adding this to your props.conf

 [momlog]
 SHOULD_LINEMERGE=false
 NO_BINARY_CHECK=true
 TIME_PREFIX =\"timestamp\":\"
 TRANSFORMS-sourcetye_routing = momlog_basic_sourcetype, momlog_json_sourcetype

 [momlog:basic]
 TIME_FORMAT = %s
 TIME_PREFIX = ^
 LINE_BREAKER = ([\r\n]+)

 [momlog:json]
 TIME_FORMAT = %s
 TIME_PREFIX = \"timestamp\":\"
 INDEXED_EXTRACTIONS = JSON

EDITED: Added a few things on the main sourcetype and fixed TIME_PREFIX regex for momlog:json sourcetype.

View solution in original post

tkwaller
Builder

I added that but when I did it broke formatting, JSON isnt recognized and sourcetype is still momlog

0 Karma

alemarzu
Motivator

Please try the above to see if it works now that I've added a few more things.

0 Karma

tkwaller
Builder

Yes that was exactly it, Sourcetype now splits properly as well as formatting properly. Thanks everyone for the help!

0 Karma

alemarzu
Motivator

Glad it worked out, happy splunking!

0 Karma

robertlynch2020
Motivator

HI Guys
I used this and it worked thanks.

One small question. The JSON i have has characters before it, so i need to get rid of them before i can get into the 100% JSON, i have done the following - however it is taking the whole line in not just the JSON. Is there a way to get it to take in only the JSON?

Example - 2018-01-10 15:52:03 [metrics-application-1-thread-1] INFO METRIC:41 - {"v":"1.0","t":"MTR","ts":"2018-01-10T15:52:03.700Z","h":"mx7654vm","pid" ....etc..

Transform
[AMBER_RAW_json_METRIC]
DEST_KEY = MetaData:Sourcetype
REGEX = {"v":"1.0\"
FORMAT = sourcetype::AMBER_RAW:METRIC

Props
[AMBER_RAW:METRIC]
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
TIME_PREFIX = \"ts\":\"
INDEXED_EXTRACTIONS = JSON

So it takes the full line, not just the JSON

Thanks in Advance:)

DalJeanis
SplunkTrust
SplunkTrust

Clearly, something is wrong with the props TIME_PREFIX not having a closed quote.

I would expect that anything that doesn't match the json would therefore be non-json, so you would just use .*

0 Karma

woodcock
Esteemed Legend

I would escape all 3 double-quotes (can't hurt).

0 Karma

tkwaller
Builder

Ok I updated the original post with new testing configs. Everything is working EXCEPT sourcetyping. ITs not breaking out the sourcetypes, it just uses the one set in input BUT if I remove that it uses the "too_small" sourcetype. What am I missing? Has to be something simple

Thanks again!

0 Karma

woodcock
Esteemed Legend

How can we possibly know what REGEX will work if you do not post sample data? In any case, the PaloAlto TA does this so you can download that app and check it all out. It gets stuff from syslog that is supposed to come in as sourcetype=pan:log and then it splits it out into 5 or 6 different sourcetypes based on RegEx patterns, just like what you are doing.

0 Karma

tkwaller
Builder

Well, really, all it has to do is match anything that isnt JSON format, meaning anything that ISNT

TIME_PREFIX = "timestamp":"

which is why I didnt add the data samples. I can take a look at the app but I dont think it should really be that difficult.

but JUST IN CASE
(this is actually data from several files)

1503626401@N@/tmp/12354@@user@1
1503664701@@@@M1
1503664761@@@@M1
1503664821@@@@M1
1503664881@@@@M1
1503664941@@@@M1
1503665001@@@@M1
1503665061@@@@M1
1503665121@@@@M1
1503665181@@@@M1
1503665241@@@@M1
1503665301@@@@M1
1503665361@@@@M1
1503665421@@@@M1
1503665481@@@@M1
{"msys":{"message_event":{"origination":"unauthorized_attempt","conn_name":"stuff","recv_method":"esmtp","remote_addr":"10.0.0.0:12345","raw_reason":"500 5.5.2 unrecognized command","node_name":"host@domain.com","scope_name":"scriptlet","pathway_group":"default","error_code":"500","msg_proc_state":"awaiting mailfrom","tenant_id":"__unauthorized__","reason":"500 5.5.2 unrecognized command","pathway":"default","local_addr":"10.0.0.0:12345","timestamp":"1503524959","customer_id":"0","event_id":"1234512354","type":"rejection"}}}
{"msys":{"message_event":{"timestamp":"1503527383","customer_id":"1","msg_proc_state":"awaiting mailfrom","pathway_group":"default","remote_addr":"10.0.0.0:12345","raw_reason":"500 5.5.2 unrecognized command","conn_name":"11/22-12345-1D10E111","event_id":"1234512345","reason":"500 5.5.2 unrecognized command","tenant_id":"__unauthorized__","type":"rejection","error_code":"500","local_addr":"10.0.0.0:12345","recv_method":"esmtp","node_name":"host.domain.com","origination":"unauthorized_attempt","pathway":"default","scope_name":"scriptlet"}}}
{"msys":{"track_event":{"rcpt_to":"user@domain.com","type":"open","rcpt_meta":{ "userMessageId": "123456789" },"campaign_id":"test_campaign","node_name":"host.domain.com","ip_address":"10.0.0.0:12345","customer_id":"1","template_id":"template_1234512345","transmission_id":"1234512345","event_id":"12345122345","user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.8 (KHTML, like Gecko)","message_id":"000074029e597f538c00","accept_language":"en-us","rcpt_tags":[ "testTag" ],"delv_method":"esmtp","template_version":"0","timestamp":"1503527606"}}}
1503676342@@@@M1
1503676402@@@@M1
1503676462@@@@M1
1503676522@@@@M1
1503676582@@@@M1
1503676642@@@@M1
1503676702@@@@M1
1503676402: Marker 1
1503676462: Marker 1
1503676522: Marker 1
1503676582: Marker 1
1503676642: Marker 1
1503676702: Marker 1
1503676762: Marker 1
0 Karma

tkwaller
Builder

Timestamps are correct. Why would the time prefix need a closed quote, its the prefix of the epoch timestamp.

I tried the .* to match but my config must still be incorrect in the props or inputs as I got ONE of the JSON logs and non of the sourcetyping was correct.

I tried several different variations of inputs and props, just not quite right yet. Close though.

I updated the original post to reflect all changes made

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!