Monitoring Splunk

How do I Save Raw Log Data for Audit Purposes?

Engager

If I have a regulatory requirement to store raw data for audit purposes as well as allowing the possibility of other tools accessing the data how do I do that? The requirement is to maintain all of the information present in the original messages.

  • Forwarding data to syslog appears to rewrite the PRI field which would obfuscate the message priority and source. This is not an option as the message has changed.
  • There is an option to forward data to a TCP socket but what do I need on the other end to write the archive and what format is the data in? Has the data been modified by Splunk by this stage? Is the TCP socket data sent before or after any modifications made for indexing and analysis purposes?
  • The index directories contain files in compressed rawdata format. How would I read these files externally to Splunk?

The solution I have in mind is to have syslog-ng receive the data and then forward it to Splunk whilst at the same time writing to a disk archive but this seems an overly complex solution to me.

Tags (3)
1 Solution

SplunkTrust
SplunkTrust

I don't know your regulatory requirements, but the "raw data" requirement sounds, well, odd. How do you prove your raw data has not been tampered with -- both on disk and in transit? Splunk makes both of those easy by using SSL for guarding in transit and blockSignature to sign the data being indexed for later verification it has not been altered.

Forwarding to a TCP socket requires the receiving software to read off the socket and write it to disk. The data forwarded over the TCP socket is the raw text of the Splunk events themselves. It is possible, depending on your configuration, that Splunk has changed the data before forwarding it. (Another way to say that is that there is no way to guarantee that it has not.) Various Splunk options like nullQueue and SEDCMD are designed to change what gets indexed to meet your requirements. To prove that Splunk did not change your data, you would have to be able to prove that none of these options are (or ever were) in use in any of your app configurations. Not impossible, but painful.

The rawdata stored in the Splunk index buckets is stored in Splunk's proprietary format, which is subject to change. I would not rely on being able to read that data directly long term.

Your syslog solution would work, but would require substantial engineering on your own part to make sure the data was not changed at rest or in transit.

You might be better off getting an opinion from your regulatory authorities as to whether Splunk's native data protection capabilities allow you to meet the intent of their requirements. It doesn't hurt to ask, and could save you a lot of engineering effort.

There are some compelling reasons, though, for sticking rsyslog / syslog-ng in front of Splunk for the purpose of receiving UDP data. One big reason (for my deployment anyway) is that Splunk has to come down now and again for various purposes. Once set up, rsyslog is for the most part "set it and forget it." Having Splunk read the flat files made by rsyslog's UDP receiver gives you a technique for continuing to receive your important log data while Splunk might be down for whatever reason. In my deployment, we do a nightly shutdown of our indexers for a few minutes - long enough to get a cold incremental backup of the hot buckets. That said, we use this data as a staging area only - nothing is done to guarantee the authenticity of this data and it does not contribute to long-term archiving.

With the most-basic TCP forwarding to a 3rd party system, the syslog facility / severity data will by default continue to be stripped out. There is the no_priority_stripping option in inputs.conf - but I don't know what effect this has on syslog data combined with TCP forwarding.

View solution in original post

Splunk Employee
Splunk Employee

Starting with version 6.3 [Sept, 2015], splunk has provided Data Integrity by hashing 'slices' of indexed data in 128kb chunks. Those hashes are stored in l2Hash. The slice size is configurable.

for further details see the documetation page:
http://docs.splunk.com/Documentation/Splunk/7.0.0/Security/Dataintegritycontrol

There is also a good blog that was written on the subject:
https://www.splunk.com/blog/2015/10/28/data-integrity-is-back-baby.html

0 Karma

Communicator

I have a requirement to maintain raw (evt/evtx) logs for several years. I am beginning to address this issue as well. Good question.

0 Karma

SplunkTrust
SplunkTrust

I don't know your regulatory requirements, but the "raw data" requirement sounds, well, odd. How do you prove your raw data has not been tampered with -- both on disk and in transit? Splunk makes both of those easy by using SSL for guarding in transit and blockSignature to sign the data being indexed for later verification it has not been altered.

Forwarding to a TCP socket requires the receiving software to read off the socket and write it to disk. The data forwarded over the TCP socket is the raw text of the Splunk events themselves. It is possible, depending on your configuration, that Splunk has changed the data before forwarding it. (Another way to say that is that there is no way to guarantee that it has not.) Various Splunk options like nullQueue and SEDCMD are designed to change what gets indexed to meet your requirements. To prove that Splunk did not change your data, you would have to be able to prove that none of these options are (or ever were) in use in any of your app configurations. Not impossible, but painful.

The rawdata stored in the Splunk index buckets is stored in Splunk's proprietary format, which is subject to change. I would not rely on being able to read that data directly long term.

Your syslog solution would work, but would require substantial engineering on your own part to make sure the data was not changed at rest or in transit.

You might be better off getting an opinion from your regulatory authorities as to whether Splunk's native data protection capabilities allow you to meet the intent of their requirements. It doesn't hurt to ask, and could save you a lot of engineering effort.

There are some compelling reasons, though, for sticking rsyslog / syslog-ng in front of Splunk for the purpose of receiving UDP data. One big reason (for my deployment anyway) is that Splunk has to come down now and again for various purposes. Once set up, rsyslog is for the most part "set it and forget it." Having Splunk read the flat files made by rsyslog's UDP receiver gives you a technique for continuing to receive your important log data while Splunk might be down for whatever reason. In my deployment, we do a nightly shutdown of our indexers for a few minutes - long enough to get a cold incremental backup of the hot buckets. That said, we use this data as a staging area only - nothing is done to guarantee the authenticity of this data and it does not contribute to long-term archiving.

With the most-basic TCP forwarding to a 3rd party system, the syslog facility / severity data will by default continue to be stripped out. There is the no_priority_stripping option in inputs.conf - but I don't know what effect this has on syslog data combined with TCP forwarding.

View solution in original post

Engager

The thing that is concerning me is the stripping of the information in the PRI field that is present in the original log messages. Losing the priority of the log message is a non option where audit and SIEM is concerned. I will clarify the question about the data being "unavoidably altered" i.e. is there information that will be lost no matter what we do. I'd be interested in your thoughts.

0 Karma

Motivator

hard answer 🙂

Motivator

I think the solution you have in mind is the way to go.

Motivator

easy answer