Solved: *nix possible bug in rlog.sh script

bobhampton · ‎08-17-2010

Are the symptoms below a bug in readlog.py (used in rlog.sh) or is it a python configuration problem on our system?

When readlog.py finds a large number in /tmp/seekposition and the audit.log file has been log cycled and is small, then readlog.py does not correctly rewrite the seekposition file: (print statements added in to readlog.py)

SEEKOLD : 8950488
LOGEND  : 896092
SEEKPOS : 0
LFTELL  : 896092
[root@iapetus bin]# cat /tmp/seekposition
8960928
[root@iapetus bin]

The sf.write command is writing 6 characters (896092) to the file, but not deleting the old seventh character thereby incorrectly leaving the contents as 8960928

Lowell · ‎08-17-2010

Yeah, this is a bug, and I reported it as such to splunk support (case # 26091, SPL-20135) on March 11, 2009. I also included a patch (essentially a one line change) to fix the python script. The script has been updated since then, but it appears that the fundamentally flawed logic is still being shipped.

I've long since replaced this script on my systems, so I haven't been paying much attention as to whether or not they've fixed it the released version. (I figure I did my part by reporting the issues and even providing the one line fix.) So for the sake of other feeling this pain, here's the fix I've come up with....

Side rant: The other thing that bugs me about this script is that errors are written to stdout instead of stderr, where they should go. Therefore you end up with script errors in your "audit" sourcetype.... sorry, that's off topic, but it's a pet peeve of mine. I'm done now. Back to the issue at hand.

To correct this, I recommend simply changing this section of code:

# save last read position to file
sf.write("%d" % lf.tell())

To this:

# save last read position to file
sf.write("%d" % lf.tell())
sf.truncate()

In my mind, truncating the file (trimming off that 7th digit in your example) should minimize the risk of any kind of race condition. In other words, if you truncate the file right after you read it and if the script dies/aborts mid-way through reading the new input, then you end up re-loading the whole file rather than just a smaller chunk, this way the odds are pretty slim that anything like that will happen.) There is still a really small window between sf.write() and sf.truncate() and if it crashes in between those commands, then you still have an issue; but that risk is fairly small.

Of course, I also recommend you copy this script to your own custom app, since any inplace modifications will be overwritten during the next upgrade of the "unix" app (aka your next splunk upgrade.)

Of course you still have the highly-probable situation that whenever your audit log rotates you will most likely miss the end of it because "monitor"ing files with an input script like this is not nearly as robust as splunk's default file monitoring process. This script make no attempts to read the remaining portion of your previous log file (audit.log.0 or whatever it's called on your system). You probably already picked up on this, but it's worth point out. It's also possible to rotate through an entire log file in under the default 60 second polling period, depending on your volume. (I would really like to rewrite this entire script from scratch, because it really has a lot of holes in it, but that will have to wait. Generally speaking, it does a "good enough" job.)

Since I'm already complaining, I should also point out that I don't like the default state file location, some OSes (like Ubuntu) actually wipe out /tmp on each restart so your state file will get deleted each reboot leading to the entire audit file being reindexed. Also, /tmp is a public location which other users can access. Instead, I suggest keeping your state file in $SPLUNK_HOME/var/run/. To me this seems like a much more logical place to store the state file. Just for reference, here is the rlog.sh script that I use, it only actually calls the python app and ausearch if and only if the audit file is newer than your state file (in other words, only if the you have new audit events in your log file.) Here is is:

#!/bin/sh
SEEK_FILE=$SPLUNK_HOME/var/run/unix_auditd_seekposition
AUDIT_LOG=/var/log/audit/audit.log

if [ \! -f "$SEEK_FILE" -o "$AUDIT_LOG" -nt "$SEEK_FILE" ]
then
    $SPLUNK_HOME/etc/apps/mfps_linux_extras/bin/readlog.py -s "$SEEK_FILE" -f "$AUDIT_LOG" | /sbin/ausearch -i 2>/dev/null | /bin/grep -v -- ---- 2>/dev/null
fi

Base on some conversations with people in the know at Splunk .conf2010, it sounds like the Windows and Unix apps are going to be receiving a much needed overhaul in up and coming release, which is good. Because I've run into lots of little no-way-this-was-ever-tested kind of issues. This stuff drives me nuts. I've been assured that future quality of these apps will be much better.

View solution in original post

apro · ‎04-04-2011

I'm having the same issue on RHEL 4.Does anyone or splunk support have alternative solution?

yannK · ‎04-02-2011

It seems that the workaround is working for RHEL5 but not for RHEL4.

Lowell · ‎08-17-2010

Yeah, this is a bug, and I reported it as such to splunk support (case # 26091, SPL-20135) on March 11, 2009. I also included a patch (essentially a one line change) to fix the python script. The script has been updated since then, but it appears that the fundamentally flawed logic is still being shipped.

I've long since replaced this script on my systems, so I haven't been paying much attention as to whether or not they've fixed it the released version. (I figure I did my part by reporting the issues and even providing the one line fix.) So for the sake of other feeling this pain, here's the fix I've come up with....

Side rant: The other thing that bugs me about this script is that errors are written to stdout instead of stderr, where they should go. Therefore you end up with script errors in your "audit" sourcetype.... sorry, that's off topic, but it's a pet peeve of mine. I'm done now. Back to the issue at hand.

To correct this, I recommend simply changing this section of code:

# save last read position to file
sf.write("%d" % lf.tell())

To this:

# save last read position to file
sf.write("%d" % lf.tell())
sf.truncate()

In my mind, truncating the file (trimming off that 7th digit in your example) should minimize the risk of any kind of race condition. In other words, if you truncate the file right after you read it and if the script dies/aborts mid-way through reading the new input, then you end up re-loading the whole file rather than just a smaller chunk, this way the odds are pretty slim that anything like that will happen.) There is still a really small window between sf.write() and sf.truncate() and if it crashes in between those commands, then you still have an issue; but that risk is fairly small.

Of course, I also recommend you copy this script to your own custom app, since any inplace modifications will be overwritten during the next upgrade of the "unix" app (aka your next splunk upgrade.)

Of course you still have the highly-probable situation that whenever your audit log rotates you will most likely miss the end of it because "monitor"ing files with an input script like this is not nearly as robust as splunk's default file monitoring process. This script make no attempts to read the remaining portion of your previous log file (audit.log.0 or whatever it's called on your system). You probably already picked up on this, but it's worth point out. It's also possible to rotate through an entire log file in under the default 60 second polling period, depending on your volume. (I would really like to rewrite this entire script from scratch, because it really has a lot of holes in it, but that will have to wait. Generally speaking, it does a "good enough" job.)

Since I'm already complaining, I should also point out that I don't like the default state file location, some OSes (like Ubuntu) actually wipe out /tmp on each restart so your state file will get deleted each reboot leading to the entire audit file being reindexed. Also, /tmp is a public location which other users can access. Instead, I suggest keeping your state file in $SPLUNK_HOME/var/run/. To me this seems like a much more logical place to store the state file. Just for reference, here is the rlog.sh script that I use, it only actually calls the python app and ausearch if and only if the audit file is newer than your state file (in other words, only if the you have new audit events in your log file.) Here is is:

#!/bin/sh
SEEK_FILE=$SPLUNK_HOME/var/run/unix_auditd_seekposition
AUDIT_LOG=/var/log/audit/audit.log

if [ \! -f "$SEEK_FILE" -o "$AUDIT_LOG" -nt "$SEEK_FILE" ]
then
    $SPLUNK_HOME/etc/apps/mfps_linux_extras/bin/readlog.py -s "$SEEK_FILE" -f "$AUDIT_LOG" | /sbin/ausearch -i 2>/dev/null | /bin/grep -v -- ---- 2>/dev/null
fi

Base on some conversations with people in the know at Splunk .conf2010, it sounds like the Windows and Unix apps are going to be receiving a much needed overhaul in up and coming release, which is good. Because I've run into lots of little no-way-this-was-ever-tested kind of issues. This stuff drives me nuts. I've been assured that future quality of these apps will be much better.

V_at_Splunk · ‎01-08-2011

Thank you for the comments. The "quality" of the UNIX app is a function of how many edge and corner cases we are aware of. So, please keep reporting any issues you find! Be sure to specify your UNIX flavor and version.

bobhampton · ‎08-24-2010

Many thanks for the helpful answer and comments

*nix possible bug in rlog.sh script

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR

Are you a member of the Splunk Community?

*nix possible bug in rlog.sh script

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR