I would like to retrieve the data in /var/log as correctly as possible.
Currently I am simply monitoring the entire /var/log folder with no pre-selected source type.
On the List of pretrained source types I see a few callouts for log files such as syslog but the majority of log files are not present in this list. Perhaps some of these types can be used elsewhere though? For example, I see the linux_messages_syslog pretrained type refers to logs in /var/log/messages and since syslog != messages I presume this type may be useful on other files as well?
So I can use the few pretrained source types and then do I need to make my own source types for all the other log files?
Is there any repository with user created source types? I have to imagine most log file types have had source types created for them by now? Or do people just not apply source types and simply search on the unstructured data?
Hi
as @PickleRick said, probably best way is start to look from splunkbase what apps / TAs there is already by someone done.
You could start e.g. with this https://splunkbase.splunk.com/app/833/
When you are thinking what sourcetype actually is you realise that it just define format of individual log event. Basically nothing else. Then it's totally another story how to use those to help your queries etc. But just remember that ST is just lexical format of log file/stream/something. And if/when this change you should change the name of ST e.g. adding increased number after it (my:own:sourcetype:0 vs. my:own:sourcetype:1). Of course you should use more descriptive names for those. There are several docs where you could found naming standards for those.
Naming a sourcetype is that you just add it's name into inputs.conf, nothing else. Then when you want to use / tokenise / extract some fields from it, you need to do additional definition on props.conf and/or transforms.conf.
r. Ismo
Well, it's a tricky subject 😉
In case of a "pre-defined" appliances or similar solutions (like pi-hole, for example) you usually have an app and in the documentation it often specifies what to do on both splunk's input side and the solution's logging settings in order to achieve interoperability (although sometimes the app might be prepared with logging to files in mind and you want to get events by syslog, or you might encounter other issues).
Often, the apps might define several separate sourcetypes for various types of logs coming from a single application and dynamicaly rewrite the sourcetype on input when the ingest pipeline is able to match the event to a particular kind of an event (that's a completely different thing that eventtype on search!).
In case of "general logs", well, it's up to you. As a general rule, from my experience - think what you need the events for. For example, in one of my production environments I restrict syslog messages forwarder into splunk by process name and only get messages from a very strict set of programs.
I'd start by looking for app for a particular application output on splunkbase.
Hi
as @PickleRick said, probably best way is start to look from splunkbase what apps / TAs there is already by someone done.
You could start e.g. with this https://splunkbase.splunk.com/app/833/
When you are thinking what sourcetype actually is you realise that it just define format of individual log event. Basically nothing else. Then it's totally another story how to use those to help your queries etc. But just remember that ST is just lexical format of log file/stream/something. And if/when this change you should change the name of ST e.g. adding increased number after it (my:own:sourcetype:0 vs. my:own:sourcetype:1). Of course you should use more descriptive names for those. There are several docs where you could found naming standards for those.
Naming a sourcetype is that you just add it's name into inputs.conf, nothing else. Then when you want to use / tokenise / extract some fields from it, you need to do additional definition on props.conf and/or transforms.conf.
r. Ismo
I highly recommend not using the built in Splunk apps anymore, especially for monitoring operating system logs. Most of the Splunk apps have been updated to not just pull in the OS logs, but pull in tons of telemetry data also. This can result in orders of magnitude of increased data going into Splunk.
The apps Splunk creates serve Splunk's bottom line not the customers best interest. If you want an affordable and manageable Splunk installation, I'd suggest against using any Splunk apps and instead recommend running data through a different product first to shrink and enhance your data before hitting Splunk. Moving away from Splunk apps will save a massive amount of our storage usage if you don't need telemetry data in a system that is way to expensive to store telemetry data in.
_What_ you're ingesting is entirely up to you. Even if you're pulling or receiving extra data because the source serves it you can always filter it out during ingestion process. Some cases require more data, some less, some people need the full original events retained for investigation/evidence purposes, some don't. So there is no one-size-fits all solution either with Splunk or any other vendor. So writing here that Splunk maliciously publishes apps to pump up your license usage is simply you spreading FUD, please refrain from doing so. Especially digging up an old thread just to do so.
Of course you can use various external tools to manipulate your data before ingesting it into Splunk. You can even mutilate your data to the point that it won't fit any widely-used apps and solutions so you will save some storage but will have to manually do many things for which normally there are ready-made apps. It's your choice.
The problem with _any_ log management/SIEM solution (data analytics maybe less so but also not unheard of) is that people don't know _what_ and _why_ they want ingested and end up pulling everything "just in case".
Thank you both for your assistance! I am working to setup the app now. This seems like a much better solution.
It depends greatly on what is the source of the log entries. In /var/log you can have:
So just because something is in /var/log, it doesn't tell you for sure what it is and what kind of events it contains. You have to know what type of data you're ingesting (and it's usually best to split it by source - meaning specific program - into separate files).
Then it gets easier. You look for TA or at least raw parsing rules for a specific application and create appropriate input reading from a file containing given sourcetype.
We meet again PickleRick,
Thank you for your response. So here is my current inputs.conf file. I've gone through everything in my /var/log folder and attempted to classify it. As you can see I have a number of logs with an unknown source type (or a guess that I'm unsure if it's correct or not) - how can I figure out what to put there?
## inputs.conf for splunk universal forwarders
## /var/log
# update-alternatives, symbolic links
# Ok to ignore?
[monitor:///var/log/alternatives.log]
disabled = false
index = main
# sourcetype = ???
# auth log (sudo, ssh, etc.)
[monitor:///var/log/auth.log]
disabled = false
index = main
sourcetype = linux_secure
# bootstrap (this may not actually get updated on boot)
# Ok to ignore?
[monitor:///var/log/bootstrap.log]
disabled = false
index = main
# sourcetype = linux_bootlog
# btmp log (failed login attempts)
# Splunk cannot index this data type
# TODO figure out how to get this in splunk
# [monitor:///var/log/btmp]
# disabled = false
# index = main
# sourcetype = ??
# dpkg log (dpkg and apt installs)
[monitor:///var/log/dpkg.log]
disabled = false
index = main
# sourcetype = ??
# faillog (failed user logins)
# Splunk cannot index this data type
# TODO figure out how to get this in splunk
# [monitor:///var/log/faillog]
# disabled = false
# index = main
# sourcetype = ??
# kern log (kernel logs)
[monitor:///var/log/kern.log]
disabled = false
index = main
# sourcetype = linux_messages_syslog
# lastlog (last login by user)
# Splunk cannot index this data type
# TODO figure out how to get this in splunk
# [monitor:///var/log/lastlog]
# disabled = false
# index = main
# sourcetype = ??
# syslog (system logs)
[monitor:///var/log/syslog]
disabled = false
index = main
sourcetype = linux_messages_syslog
# tallylog (count of attempted logins/fails)
# Ok to ignore?
# [monitor:///var/log/tallylog]
# disabled = false
# index = main
# sourcetype = ??
# ufw log (firewall)
[monitor:///var/log/ufw.log]
disabled = false
index = main
# sourcetype = linux_messages_syslog
# wtmp (login records)
# Splunk cannot index this data type
# TODO figure out how to get this in splunk
# [monitor:///var/log/wtmp]
# disabled = false
# index = main
# sourcetype = ??
## /var/log/subdirs
# apache access log
[monitor:///var/log/apache/access.log]
disabled = false
index = main
sourcetype = access_combined
# apache error log
[monitor:///var/log/apache/error.log]
disabled = false
index = main
sourcetype = apache_error
# apache other vhosts access log
[monitor:///var/log/apache/other_vhosts_access.log]
disabled = false
index = main
sourcetype = access_combined
# apt history log
[monitor:///var/log/apt/history.log]
disabled = false
index = main
# sourcetype = ??
# apt term log
[monitor:///var/log/apt/term.log]
disabled = false
index = main
# sourcetype = ??
# Ignoring /chrony (empty)
# Ignoring /installer (tons of files)
# Ignorning /journal (binaries)
# letsencrypt log
[monitor:///var/log/letsencrypt/letsencrypt.log]
disabled = false
index = main
# sourcetype = ??
# mysql error log
[monitor:///var/log/mysql/error.log]
disabled = false
index = main
sourcetype = mysqld_error
# unattended upgrades dpkg log
[monitor:///var/log/unattended-upgrades/unattended-upgrades-dpkg.log]
disabled = false
index = main
# sourcetype = ??
# unattended upgrades shutdown log
[monitor:///var/log/unattended-upgrades/unattended-upgrades-shutdown.log]
disabled = false
index = main
# sourcetype = ??