Getting Data In

SPLUNK Raw data reducing

DanAlexander
Communicator

Hi Community,

Trying to build regex that can help me reduce the size of an EventCode in my case this is 4627

The idea is to use props and transforms:

props.conf

[XmlWinEventLog:Security]

TRANSFORMS-reduce_raw = reduce_event_raw


transforms.conf

[reduce_event_raw]
REGEX = <Event[^>]*>.*?<System>.*?<Provider\s+Name='(?<ProviderName>[^']*)'\s+Guid='(?<ProviderGuid>[^']*)'.*?<EventID>(?<EventID>\d+)</EventID>.*?<Version>(?<Version>\d+)</Version>.*?<Level>(?<Level>\d+)</Level>.*?<Task>(?<Task>\d+)</Task>.*?<Opcode>(?<Opcode>\d+)</Opcode>.*?<Keywords>(?<Keywords>[^<]*)</Keywords>.*?<TimeCreated\s+SystemTime='(?<SystemTime>[^']*)'.*?<EventRecordID>(?<EventRecordID>\d+)</EventRecordID>.*?<Correlation\s+ActivityID='(?<ActivityID>[^']*)'.*?<Execution\s+ProcessID='(?<ProcessID>\d+)'\s+ThreadID='(?<ThreadID>\d+)'.*?<Channel>(?<Channel>[^<]*)</Channel>.*?<Computer>(?<Computer>[^<]*)</Computer>.*?<EventData>.*?<Data\s+Name='SubjectUserSid'>(?<SubjectUserSid>[^<]*)</Data>.*?<Data\s+Name='SubjectUserName'>(?<SubjectUserName>[^<]*)</Data>.*?<Data\s+Name='SubjectDomainName'>(?<SubjectDomainName>[^<]*)</Data>.*?<Data\s+Name='SubjectLogonId'>(?<SubjectLogonId>[^<]*)</Data>.*?<Data\s+Name='TargetUserSid'>(?<TargetUserSid>[^<]*)</Data>.*?<Data\s+Name='TargetUserName'>(?<TargetUserName>[^<]*)</Data>.*?<Data\s+Name='TargetDomainName'>(?<TargetDomainName>[^<]*)</Data>.*?<Data\s+Name='TargetLogonId'>(?<TargetLogonId>[^<]*)</Data>.*?<Data\s+Name='LogonType'>(?<LogonType>[^<]*)</Data>.*?<Data\s+Name='EventIdx'>(?<EventIdx>[^<]*)</Data>.*?<Data\s+Name='EventCountTotal'>(?<EventCountTotal>[^<]*)</Data>.*?<Data\s+Name='GroupMembership'>(?<GroupMembership>.*?)</Data>.*?</EventData>.*?</Event>

FORMAT = ProviderName::$1 ProviderGuid::$2 EventID::$3 Version::$4 Level::$5 Task::$6 Opcode::$7 Keywords::$8 SystemTime::$9 EventRecordID::$10 ActivityID::$11 ProcessID::$12 ThreadID::$13 Channel::$14 Computer::$15 SubjectUserSid::$16 SubjectUserName::$17 SubjectDomainName::$18 SubjectLogonId::$19 TargetUserSid::$20 TargetUserName::$21 TargetDomainName::$22 TargetLogonId::$23 LogonType::$24 EventIdx::$25 EventCountTotal::$26 GroupMembership::$27

DEST_KEY = _raw

Then I will be able to pick which bits from the raw data to be indexed

It looks like the regex would not pick up on fields correctly

There is the raw event:

<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'><System><Provider Name='Microsoft-Windows-Security-Auditing' Guid='{54849625-5478-4994-a5ba-3e3bxxxxxx}'/><EventID>4627</EventID><Version>0</Version><Level>0</Level><Task>12554</Task><Opcode>0</Opcode><Keywords>0x8020000000000000</Keywords><TimeCreated SystemTime='2024-11-27T11:27:45.6695363Z'/><EventRecordID>2177113</EventRecordID><Correlation ActivityID='{01491b93-40a4-0002-6926-4901a440db01}'/><Execution ProcessID='1196' ThreadID='1312'/><Channel>Security</Channel><Computer>Computer1</Computer><Security/></System><EventData><Data Name='SubjectUserSid'>S-1-5-18</Data><Data Name='SubjectUserName'>CXXXXXX</Data><Data Name='SubjectDomainName'>CXXXXXXXX</Data><Data Name='SubjectLogonId'>0x3e7</Data><Data Name='TargetUserSid'>S-1-5-18</Data><Data Name='TargetUserName'>SYSTEM</Data><Data Name='TargetDomainName'>NT AUTHORITY</Data><Data Name='TargetLogonId'>0x3e7</Data><Data Name='LogonType'>5</Data><Data Name='EventIdx'>1</Data><Data Name='EventCountTotal'>1</Data><Data Name='GroupMembership'>
%{S-1-5-32-544}
%{S-1-1-0}
%{S-1-5-11}
%{S-1-16-16384}</Data></EventData></Event

Any help t-shoot the problem will be highly valued.

Thank you in advance!

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Ouch.

1. If you're using numbered capture groups you don't have to name them. (I'm not even sure if index-time extractions support named capture groups).

2. Assuming your regex was right you'd get a key::value pairs in your raw event. Are you sure that's what you want? Also, this will cause "interesting" side effects since that data would get split into terms at major breakers and would get indexed as indexed fields.

3. Manipulating structured data with regexes is asking for trouble. You have no guarantee that the fields will always be in the same order (and they might not always contain full data). That's why you use structured data format.

DanAlexander
Communicator

Hi @PickleRick,

Thank you for your valuable feedback.

  1. Index-Time Extractions: You're right that named capture groups might not be supported at index time. I'll modify my configurations to use numbered capture groups to ensure they function correctly.

  2. Rewriting _raw: I appreciate you highlighting the potential issues with rewriting _raw to contain key-value pairs. My intention was to reduce the size of the events by removing unnecessary data, but I see how this could lead to unintended side effects during indexing. I'll reconsider this approach.

  3. Structured Data Parsing: Your point about the risks of using regex to parse XML is well-taken. Given that XML fields may vary in order and presence, relying on regex could indeed cause problems. Utilizing Splunk's structured data parsing capabilities seems like a better solution.

Next steps:

To achieve my goal of reducing the indexed data volume for EventID=4627 events, I'd like to leverage Splunk's XML parsing features. Specifically, I'm thinking of using INDEXED_EXTRACTIONS = xml and configuring EXCLUDE rules in props.conf to omit the unwanted fields at index time.

Example Configuration BEFORE:

[reduce_event_raw]
REGEX = (?ms)<Event[^>]*>.*?<System>.*?<EventID>4627<\/EventID>.*?<Computer>(?<Computer>[^<]*)<\/Computer>.*?<Data\s+Name='SubjectUserName'>(?<SubjectUserName>[^<]*)<\/Data>.*?<Data\s+Name='TargetUserName'>(?<TargetUserName>[^<]*)<\/Data>.*?<Data\s+Name='LogonType'>(?<LogonType>[^<]*)<\/Data>
FORMAT = Computer::$1 SubjectUserName::$2 TargetUserName::$3 LogonType::$4
DEST_KEY = _raw

Example Configuration AFTER:

[XmlWinEventLog:Security]
INDEXED_EXTRACTIONS = xml
KV_MODE = none
EXCLUDE = (?i)(SubjectUserSid|SubjectDomainName|SubjectLogonId|TargetUserSid|TargetDomainName|TargetLogonId|EventIdx|EventCountTotal|GroupMembership)

Do you think this approach would effectively remove the unnecessary fields before indexing while maintaining reliable field extraction for the essential data? If you have any suggestions or best practices for this method, I'd greatly appreciate your guidance.

Regards,

Dan

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Try something like this

(?ms)<Event[^>]*>.*?<System>.*?<Provider\s+Name='(?<ProviderName>[^']*)'\s+Guid='(?<ProviderGuid>[^']*)'.*?<EventID>(?<EventID>\d+)<\/EventID>.*?<Version>(?<Version>\d+)<\/Version>.*?<Level>(?<Level>\d+)<\/Level>.*?<Task>(?<Task>\d+)<\/Task>.*?<Opcode>(?<Opcode>\d+)<\/Opcode>.*?<Keywords>(?<Keywords>[^<]*)<\/Keywords>.*?<TimeCreated\s+SystemTime='(?<SystemTime>[^']*)'.*?<EventRecordID>(?<EventRecordID>\d+)<\/EventRecordID>.*?<Correlation\s+ActivityID='(?<ActivityID>[^']*)'.*?<Execution\s+ProcessID='(?<ProcessID>\d+)'\s+ThreadID='(?<ThreadID>\d+)'.*?<Channel>(?<Channel>[^<]*)<\/Channel>.*?<Computer>(?<Computer>[^<]*)<\/Computer>.*?<EventData>.*?<Data\s+Name='SubjectUserSid'>(?<SubjectUserSid>[^<]*)<\/Data>.*?<Data\s+Name='SubjectUserName'>(?<SubjectUserName>[^<]*)<\/Data>.*?<Data\s+Name='SubjectDomainName'>(?<SubjectDomainName>[^<]*)<\/Data>.*?<Data\s+Name='SubjectLogonId'>(?<SubjectLogonId>[^<]*)<\/Data>.*?<Data\s+Name='TargetUserSid'>(?<TargetUserSid>[^<]*)<\/Data>.*?<Data\s+Name='TargetUserName'>(?<TargetUserName>[^<]*)<\/Data>.*?<Data\s+Name='TargetDomainName'>(?<TargetDomainName>[^<]*)<\/Data>.*?<Data\s+Name='TargetLogonId'>(?<TargetLogonId>[^<]*)<\/Data>.*?<Data\s+Name='LogonType'>(?<LogonType>[^<]*)<\/Data>.*?<Data\s+Name='EventIdx'>(?<EventIdx>[^<]*)<\/Data>.*?<Data\s+Name='EventCountTotal'>(?<EventCountTotal>[^<]*)<\/Data>.*?<Data\s+Name='GroupMembership'>(?<GroupMembership>.*?)<\/Data>.*?<\/EventData>.*?<\/Event>

https://regex101.com/r/19eJtB/1

 

DanAlexander
Communicator

Hi @ITWhisperer,

Thank you for your feedback.

The Regex works, but according to @PickleRick I will need to adjust my approach.

Kind regards,

Dan

0 Karma
Get Updates on the Splunk Community!

Bridging the Gap: Splunk Helps Students Move from Classroom to Career

The Splunk Community is a powerful network of users, educators, and organizations working together to tackle ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Unleash Unified Security and Observability with Splunk Cloud Platform

     Now Available on Microsoft AzureThursday, March 27, 2025  |  11AM PST / 2PM EST | Register NowStep boldly ...