Splunk Search

How to index Oracle audit trails stored in .aud files?

f_luciani
Path Finder

Hi,

I have a request from a client to index the .aud files generated by Oracle. I have been searching Splunk Answers but for all questions I read the answers point out to syslog audit. The client specified the .aud files so I cannot use Oracle Audit Trail app. I have also tried to manually extract the fields to no avail, main reason being the files have a header that messes up with the interpretation of the fields, plus not all field values come in the same format, for instance, LENGTH values come in double quotes and single quotes, which confuses Splunk. I found a partial solution in this question:

http://answers.splunk.com/answers/59094/how-to-customize-the-oracle-audit-trail-app-to-apply-custom-...

Problem is, apart from the header of the file, it seems there are two audit event formats:

  • Format 1 - line breaks after event's values

    Mon Jun 30 17:55:59 2014 -03:00
    LENGTH : '152'
    ACTION :[7] 'CONNECT'
    DATABASE USER:[1] '/'
    PRIVILEGE :[6] 'SYSDBA'
    CLIENT USER:[6] 'oracle'
    CLIENT TERMINAL:[0] ''
    STATUS:[1] '0'
    DBID:[8] '33405459'

  • Format 2 - single line for all events

    Sun Jun 29 09:51:00 2014 -03:00
    LENGTH: "369"
    SESSIONID:[5] "61298" ENTRYID:[1] "1" STATEMENT:[1] "1" USERID:[23] "EDITED TO PRESERVE PRIVACY" USERHOST:[19] "EDITED TO PRESERVE PRIVACY" TERMINAL:[7] "SERVINT" ACTION:[3] "100" RETURNCODE:[1] "0" COMMENT$TEXT:[97] "Authenticated by: DATABASE; Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=X.X.X.X)(PORT=64423))" OS$USERID:[7] "iadabas" DBID:[8] "33405459" PRIV$USED:[1] "5"
    As I am not an regex expert, the format #2 is giving me a headache. How can I get rid of the header on each file? And how can I extract the the key/value pairs from format #2? And finally how should I deal with LENGTH values that are enclosed in single or double quotes without creating one regex for each?

1 Solution

chanfoli
Builder

I see a few problems with the input and will offer suggestions for each of them.

  1. For the headers that you want to get rid of -- I would suggest applying a null-queue transform as described here: http://answers.splunk.com/answers/29396/help-sending-header-and-footer-of-csv-to-nullqueue.html - you will need to come up with a good regex to match the header but that should be pretty easy.
  2. Input has 2 event formats - I would consider using another transform to split the events into 2 sourcetypes where you apply distinct field extractions. The documentation here ( http://docs.splunk.com/Documentation/Splunk/6.1.4/Data/Advancedsourcetypeoverrides ) should get you started on that. It looks to me like the single line events in format 2 would be an easier override regex to write so I would probably start there and let the rest of the lines pass through as the configured sourcetype.
  3. Format 2 is missing timestamps. This may not be an issue to you but it is unfortunate. With these mixed in with events bounded by timestamps it is likely guessing at timestamps for format2 events. What will end up happening with the suggestions above is that without any time reference for format2 events, indexing time will be applied. Again this is unfortunate, but splunk has no way to extract the true time of such events regardless of what you do, unless the application can be configured to include a timestamp.

Hope this helps,
chanfoli

View solution in original post

altink
Builder

The two formats are two different Oracle audits:

Format 1
Administrative SYSDBA audit - always in OS when *Nix, always in Event Viewer when *Win.
When initialization parameter audit_sys_operations=TRUE you will get all SYSDBAs Top-SQL commands, other then normal administrative audits, like instance startup-shutdown and others

Format 2
Standard audit. These can go in OS (and is this case) when initialization parameter audit_trail=OS

0 Karma

chanfoli
Builder

Given the new information about both formats being multi-line events, I might take a different approach of not splitting the events and just writing appropriate regexes to extract the fields. Some of them seem like they could be single regexes since the fieldnames precede the field values with the len in brackets and only quote characters being different, this could be accomplished in the regex extraction with the alternation operator (|).

I did a quick test by pasting a few of your example events together and used the following props (I did not have an example header to drop so I skipped that part):

[db-audit-test]
MAX_TIMESTAMP_LOOKAHEAD=35
NO_BINARY_CHECK=1 TIME_FORMAT=%a %b %d
%H:%M:%S %Y

After building extractions in the interactive field extractor, I eventually used the following search with rex to test the extraction of LENGTH and ACTION:

index=main sourcetype="db-audit-test"
| head 10000 | rex "(?i)LENGTH.?:
('|\")(?P\d+)(?=(\"|'))" |
rex
"(?i)ACTION\s
:\[\d+\]\s*(\"|')(?P\w*)(\"|')"
|table LENGTH ACTION

The actual regex displayed in the IFE for ACTION after I built it to extract for both formats was like this:

(?i)ACTION\s*:[\d+]\s*("|')(?P\w*)("|')

f_luciani
Path Finder

The header for both formats is as follow:

Audit file /path_to_audit_files/oraprod_dm00_11796784_20140626182611606394143795.aud
Oracle Database 11g Release 11.2.0.4.0 - 64bit Production
ORACLE_HOME = /oracle/11.2
System name:    AIX
Node name:  node_name_here
Release:    1
Version:    7
Machine:     aabbccddeeff
Instance name: instance_name_here
Redo thread mounted by this instance: 1
Oracle process number: 40
Unix process pid: 11796784, image: oracle@node_name_here (DM00)

All events (including the header) are set apart by a blank line, in the above example, after the Unix... line, there is a blank line, then the events follow, beginning with a timestamp-lenght-event data, another blank line, another block of events, and so on.

0 Karma

chanfoli
Builder

Looks to me like all of these lines could be sent to the null queue with simple regex considering that all the lines start with words which do not start lines in any events. You could either set up a filter to discard all lines which start with the words Audit, Oracle, ORACLE, System, Node, Release, Version, Machine, Instance, Redo or Unix OR you could discard everything which does not have either 3 colons (:) or any number of single or double quotes.

0 Karma

f_luciani
Path Finder

I've tried, based on the links you posted in your first answer, could not check straight away due to a license problem; today I was able to reset my license and test it, did not seem to work. Here is an extract of my props.conf and transforms.conf:

props.conf

[oracle_audit]
SHOULD_LINEMERGE = False
LINE_BREAKER = ([\r\n]+)\w{3}\s+\w{3}\s+\d+\s+\d+\:\d+\:\d+\s+\d{4}
KV_MODE = None
...
TRANSFORMS-Audit = eliminate-audit
TRANSFORMS-Oracle-Database = eliminate-oracle-database
...

transforms.conf

[eliminate-audit]
REGEX=^Audit;.*$
DEST_KEY=queue
FORMAT=nullQueue

[eliminate-oracle-database]
REGEX=^Oracle Database;.*$
DEST_KEY=queue
FORMAT=nullQueue

I've tried to understand the example in the link, then adapt it to my case. As the header is still showing up in search, it seems I did not understand it...

0 Karma

chanfoli
Builder

What is the intent of the semicolons in your regexes above?

0 Karma

f_luciani
Path Finder

I was not sure, then I adapted the example and left it there as it was. I have been running tests with regex online testers and found out the semicolon has no use there, thus I got rid of it. I also changed the regex to:

REGEX=^Oracle.*$

...meaning "a line beginning with ( ^ ) 'Oracle' plus whatever comes after it (any char any length - .*) till the end of the line ( $ )". But it is still not working. More tweaking needed here.
On the other hand, I found out how to insert the "|' pair in transforms.conf, it should be in the format:

[\'|\"]

This way, both LENGTH and ACTION fields are being correctly extracted.

0 Karma

f_luciani
Path Finder

Hi, chanfoli,

Regarding format 1, I was able to use props.conf and transforms.conf, following some models I gathered from other answers, to extract most of the key:value pairs. Here is an example of what I am using now:

  1. in props.conf, new stanza oracle_audit created

    [oracle_audit]
    SHOULD_LINEMERGE = False
    LINE_BREAKER = ([\r\n]+)\w{3}\s+\w{3}\s+\d+\s+\d+:\d+:\d+\s+\d{4}
    KV_MODE = None
    REPORT-ACTION_for_oracle = ACTION_for_oracle
    ...

  2. in transforms.conf, one of the rex I am using to extract the pairs from format 1

    [ACTION_for_oracle]
    REGEX = ACTION[\s+]?:[\d+]\s+\'(.*?)\'([\r\n]+)(DATABASE\s+USER)
    FORMAT = ACTION::"$1"

I couldn't help but noticing the regex you came by is simpler than mine, thus I'll give yours a try in order to see if it does the same and then replace mine by yours in case the answer is positive.
Nevertheless, the ("|') solution I will use for sure, since it solves my single quote|double quote problem.

0 Karma

chanfoli
Builder

I see a few problems with the input and will offer suggestions for each of them.

  1. For the headers that you want to get rid of -- I would suggest applying a null-queue transform as described here: http://answers.splunk.com/answers/29396/help-sending-header-and-footer-of-csv-to-nullqueue.html - you will need to come up with a good regex to match the header but that should be pretty easy.
  2. Input has 2 event formats - I would consider using another transform to split the events into 2 sourcetypes where you apply distinct field extractions. The documentation here ( http://docs.splunk.com/Documentation/Splunk/6.1.4/Data/Advancedsourcetypeoverrides ) should get you started on that. It looks to me like the single line events in format 2 would be an easier override regex to write so I would probably start there and let the rest of the lines pass through as the configured sourcetype.
  3. Format 2 is missing timestamps. This may not be an issue to you but it is unfortunate. With these mixed in with events bounded by timestamps it is likely guessing at timestamps for format2 events. What will end up happening with the suggestions above is that without any time reference for format2 events, indexing time will be applied. Again this is unfortunate, but splunk has no way to extract the true time of such events regardless of what you do, unless the application can be configured to include a timestamp.

Hope this helps,
chanfoli

f_luciani
Path Finder

I was able to, using the info from the links you sent, extract all necessary key:value pairs, after testing each rex in http://www.regexr.com/. Only thing is, now the client wants NOT to ditch the header but quite de opposite, store it somewhere and associate it to each event in the file it came from, in order to have the events fully identified. The reason is the fact that, since they have many servers/instances/databases, they want to know from which server/instance/database each even came. I do understand the "why", but don't have the "how" yet. More tweaking.

0 Karma

f_luciani
Path Finder

Edited format 2 in question and added the missing timestamp

0 Karma

f_luciani
Path Finder

Hi, chanfoli,

Many thanks for the prompt answers! I'll give it a go and let you know as soon as I make it work. Regarding format 2, my bad, it does have timestamp indeed, I just copy-pasted it wrongly, so no issues there, the timestamp is the same as the one in format 1 😉
Thanks again!

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...