Solved: Re: How do I Save Raw Log Data for Audit Purposes?

yellek · ‎05-23-2011

If I have a regulatory requirement to store raw data for audit purposes as well as allowing the possibility of other tools accessing the data how do I do that? The requirement is to maintain all of the information present in the original messages.

Forwarding data to syslog appears to rewrite the PRI field which would obfuscate the message priority and source. This is not an option as the message has changed.
There is an option to forward data to a TCP socket but what do I need on the other end to write the archive and what format is the data in? Has the data been modified by Splunk by this stage? Is the TCP socket data sent before or after any modifications made for indexing and analysis purposes?
The index directories contain files in compressed rawdata format. How would I read these files externally to Splunk?

The solution I have in mind is to have syslog-ng receive the data and then forward it to Splunk whilst at the same time writing to a disk archive but this seems an overly complex solution to me.

dwaddle · ‎05-24-2011

I don't know your regulatory requirements, but the "raw data" requirement sounds, well, odd. How do you prove your raw data has not been tampered with -- both on disk and in transit? Splunk makes both of those easy by using SSL for guarding in transit and blockSignature to sign the data being indexed for later verification it has not been altered.

Forwarding to a TCP socket requires the receiving software to read off the socket and write it to disk. The data forwarded over the TCP socket is the raw text of the Splunk events themselves. It is possible, depending on your configuration, that Splunk has changed the data before forwarding it. (Another way to say that is that there is no way to guarantee that it has not.) Various Splunk options like nullQueue and SEDCMD are designed to change what gets indexed to meet your requirements. To prove that Splunk did not change your data, you would have to be able to prove that none of these options are (or ever were) in use in any of your app configurations. Not impossible, but painful.

The rawdata stored in the Splunk index buckets is stored in Splunk's proprietary format, which is subject to change. I would not rely on being able to read that data directly long term.

Your syslog solution would work, but would require substantial engineering on your own part to make sure the data was not changed at rest or in transit.

You might be better off getting an opinion from your regulatory authorities as to whether Splunk's native data protection capabilities allow you to meet the intent of their requirements. It doesn't hurt to ask, and could save you a lot of engineering effort.

There are some compelling reasons, though, for sticking rsyslog / syslog-ng in front of Splunk for the purpose of receiving UDP data. One big reason (for my deployment anyway) is that Splunk has to come down now and again for various purposes. Once set up, rsyslog is for the most part "set it and forget it." Having Splunk read the flat files made by rsyslog's UDP receiver gives you a technique for continuing to receive your important log data while Splunk might be down for whatever reason. In my deployment, we do a nightly shutdown of our indexers for a few minutes - long enough to get a cold incremental backup of the hot buckets. That said, we use this data as a staging area only - nothing is done to guarantee the authenticity of this data and it does not contribute to long-term archiving.

With the most-basic TCP forwarding to a 3rd party system, the syslog facility / severity data will by default continue to be stripped out. There is the no_priority_stripping option in inputs.conf - but I don't know what effect this has on syslog data combined with TCP forwarding.

View solution in original post

tchimento_splun · ‎10-13-2017

Starting with version 6.3 [Sept, 2015], splunk has provided Data Integrity by hashing 'slices' of indexed data in 128kb chunks. Those hashes are stored in l2Hash. The slice size is configurable.

for further details see the documetation page:
http://docs.splunk.com/Documentation/Splunk/7.0.0/Security/Dataintegritycontrol

There is also a good blog that was written on the subject:
https://www.splunk.com/blog/2015/10/28/data-integrity-is-back-baby.html

thomas_forbes · ‎09-08-2015

I have a requirement to maintain raw (evt/evtx) logs for several years. I am beginning to address this issue as well. Good question.

dwaddle · ‎05-24-2011

I don't know your regulatory requirements, but the "raw data" requirement sounds, well, odd. How do you prove your raw data has not been tampered with -- both on disk and in transit? Splunk makes both of those easy by using SSL for guarding in transit and blockSignature to sign the data being indexed for later verification it has not been altered.

Forwarding to a TCP socket requires the receiving software to read off the socket and write it to disk. The data forwarded over the TCP socket is the raw text of the Splunk events themselves. It is possible, depending on your configuration, that Splunk has changed the data before forwarding it. (Another way to say that is that there is no way to guarantee that it has not.) Various Splunk options like nullQueue and SEDCMD are designed to change what gets indexed to meet your requirements. To prove that Splunk did not change your data, you would have to be able to prove that none of these options are (or ever were) in use in any of your app configurations. Not impossible, but painful.

The rawdata stored in the Splunk index buckets is stored in Splunk's proprietary format, which is subject to change. I would not rely on being able to read that data directly long term.

Your syslog solution would work, but would require substantial engineering on your own part to make sure the data was not changed at rest or in transit.

You might be better off getting an opinion from your regulatory authorities as to whether Splunk's native data protection capabilities allow you to meet the intent of their requirements. It doesn't hurt to ask, and could save you a lot of engineering effort.

There are some compelling reasons, though, for sticking rsyslog / syslog-ng in front of Splunk for the purpose of receiving UDP data. One big reason (for my deployment anyway) is that Splunk has to come down now and again for various purposes. Once set up, rsyslog is for the most part "set it and forget it." Having Splunk read the flat files made by rsyslog's UDP receiver gives you a technique for continuing to receive your important log data while Splunk might be down for whatever reason. In my deployment, we do a nightly shutdown of our indexers for a few minutes - long enough to get a cold incremental backup of the hot buckets. That said, we use this data as a staging area only - nothing is done to guarantee the authenticity of this data and it does not contribute to long-term archiving.

With the most-basic TCP forwarding to a 3rd party system, the syslog facility / severity data will by default continue to be stripped out. There is the no_priority_stripping option in inputs.conf - but I don't know what effect this has on syslog data combined with TCP forwarding.

yellek · ‎05-24-2011

The thing that is concerning me is the stripping of the information in the PRI field that is present in the original log messages. Losing the priority of the log message is a non option where audit and SIEM is concerned. I will clarify the question about the data being "unavoidably altered" i.e. is there information that will be lost no matter what we do. I'd be interested in your thoughts.

ftk · ‎05-24-2011

hard answer 🙂

ftk · ‎05-24-2011

I think the solution you have in mind is the way to go.

ftk · ‎05-24-2011

easy answer

How do I Save Raw Log Data for Audit Purposes?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

Join the Conversation