I'm trying to index an XML file that has multiple lines in the beginning that I do not want or need indexed. I've worked out the regex in RegExr (external online regex testing site) that does select all the unwanted lines. But when I bring the file into Splunk the lines are still indexed. Below are my transforms.conf and props.conf.
props.conf
[sourcetype]
TRANSFORMS-sourcetype_junk = sourcetype_junk
BREAK_ONLY_BEFORE = \<ReportHost
DATETIME_CONFIG = CURRENT
MAX_TIMESTAMP_LOOKAHEAD = 0
SHOULD_LINEMERGE = true
TRUNCATE = 0
transforms.conf
[sourcetype_junk]
LOOKAHEAD = 100000
DEST_KEY = queue
REGEX = ^((.|\n|\r)*)\<\/Policy\>
FORMAT = nullQueue
Any ideas how to accomplish this?
Example, everything from the beginning to end of Policy is not needed. There is quite a few more line than what is shown below.:
<?xml version="1.0" ?>
<NessusClientData_v2>
<Policy>
<FamilyItem>
<FamilyName>CentOS Local Security Checks</FamilyName>
<Status>enabled</Status>
</FamilyItem>
<FamilyItem>
<FamilyName>AIX Local Security Checks</FamilyName>
<Status>enabled</Status>
</FamilyItem>
<FamilyItem>
<FamilyName>CISCO</FamilyName>
<Status>enabled</Status>
</FamilyItem>
<FamilyItem><FamilyName>Junos Local Security Checks</FamilyName>
<Status>enabled</Status>
</FamilyItem>
</FamilySelection>
<IndividualPluginSelection>
<PluginItem><PluginId>34220</PluginId>
<PluginName>Netstat Portscanner (WMI)</PluginName>
<Family>Port scanners</Family>
<Status>enabled</Status>
</PluginItem><PluginItem><PluginId>14274</PluginId>
<PluginName>Nessus SNMP Scanner</PluginName>
<Family>Port scanners</Family>
<Status>enabled</Status>
</PluginItem><PluginItem><PluginId>14272</PluginId>
<PluginName>netstat portscanner (SSH)</PluginName>
<Family>Port scanners</Family>
<Status>enabled</Status>
</PluginItem><PluginItem><PluginId>10180</PluginId>
<PluginName>Ping the remote host</PluginName>
<Family>Port scanners</Family>
<Status>enabled</Status>
</PluginItem><PluginItem><PluginId>11219</PluginId>
<PluginName>Nessus SYN scanner</PluginName>
<Family>Port scanners</Family>
<Status>enabled</Status>
</PluginItem></IndividualPluginSelection>
</Policy>
<Report name="ScanNumber2" xmlns:cm="http://www.nessus.org/cm">
<ReportHost name="192.168.1.100"><HostProperties>
<tag name="HOST_END">Sat Feb 25 09:31:53 2012</tag>
<tag name="system-type">general-purpose</tag>
<tag name="operating-system">Microsoft Windows Server 2003 Service Pack 2</tag>
<tag name="mac-address">00:0c:29:2e:7c:68</tag>
<tag name="host-ip">192.168.1.100</tag>
<tag name="host-fqdn">system32.localdomain.com</tag>
<tag name="netbios-name">SYSTEM32</tag>
<tag name="HOST_START">Sat Feb 25 09:20:12 2012</tag>
</HostProperties>
Thanks in adavance,
Joe
WORKING Configurations
props.conf
MAX_EVENTS = 210000
[sourcetype]
TRANSFORMS-sourcetype_junk = sourcetype_junk
BREAK_ONLY_BEFORE = (?m)\<ReportHost\sname
DATETIME_CONFIG = CURRENT
MAX_TIMESTAMP_LOOKAHEAD = 0
SHOULD_LINEMERGE = true
TRUNCATE = 0
BREAK_ONLY_BEFORE_DATE = false
transforms.conf
[sourcetype_junk]
LOOKAHEAD = 10000
DEST_KEY = queue
REGEX = (?m)(^\<\?\bxml.*)
FORMAT = nullQueue
Due to the number of lines in each event the flashtimeline.xml did need to be adjusted with an override to display a larger number of lines in the EventsViewer Module.
Another thank you to MarioM for his assistance with the nullQueue problem.
did you try with (?m) in front of you regex?
(?m)^((.|\n|\r)*)\<\/Policy\>
As well any nullqueue transforms require splunk restart to be applied.
If it still not working it will be useful to paste here the part of your xml you want to filter.
UPDATE
and with this regex:
(?m)((.*(\r*))+?\<\/Policy\>$) - **NOT WORKING**
UPDATE 2:
With below confs i got it filtered out
props.conf:
[test_xml]
TRANSFORMS-sourcetype_junk=sourcetype_junk
BREAK_ONLY_BEFORE_DATE=false
BREAK_ONLY_BEFORE=(?m)\<ReportHost\sname
SHOULD_LINEMERGE=true
TRUNCATE=0
transforms.conf:
[sourcetype_junk]
LOOKAHEAD = 10000
DEST_KEY = queue
REGEX = (?m)(^\<\?\bxml.*)
FORMAT = nullQueue
did you try with (?m) in front of you regex?
(?m)^((.|\n|\r)*)\<\/Policy\>
As well any nullqueue transforms require splunk restart to be applied.
If it still not working it will be useful to paste here the part of your xml you want to filter.
UPDATE
and with this regex:
(?m)((.*(\r*))+?\<\/Policy\>$) - **NOT WORKING**
UPDATE 2:
With below confs i got it filtered out
props.conf:
[test_xml]
TRANSFORMS-sourcetype_junk=sourcetype_junk
BREAK_ONLY_BEFORE_DATE=false
BREAK_ONLY_BEFORE=(?m)\<ReportHost\sname
SHOULD_LINEMERGE=true
TRUNCATE=0
transforms.conf:
[sourcetype_junk]
LOOKAHEAD = 10000
DEST_KEY = queue
REGEX = (?m)(^\<\?\bxml.*)
FORMAT = nullQueue
MarioM,
Thank you for your assistance with this. I now have it indexing as I was trying to get it to index. The policy information is not there and all the events are split into ReportHost name events. I can now continue to try and get this productive. Thanks again. I will update my question with my final props and transforms configurations.
i don't think it will help...for strange reason it work on my conf as per update 2 from my answer
MarioM,
If it would help, use the contact me button in my profile and we can work on a screen share so that this can be figured out. There seems to be a few older posts with people looking for the same thing with no solution.
i think it's something to do with your line breaking...I am testing it out...
MarioM,
It looks like I spoke too soon and the file was not indexed when I looked. The section I need excluded is still being indexed.
MarioM,
Thank you very much. You figured it out.
Thanks again!
Yes, I have tried the multiline entry (?m). I will try to sanitize a small sample. Currently with just one entry what needs to be filtered out is over 1700 lines long.