I am trying to eventually get to the point where I can add this to props.conf but am trying out the searches in splunk first to make sure they work. I was following this example but it wasn't work for me so I backed it up a bit and simplified it
If I run this search, it works and converts all instances abc to def....
| rex field=query mode=sed "s/abc/def/"
However, when I do this, it doesn't throw an error but doesn't convert anything, all abc's are still present in the fields..
| rex mode=sed "s/abc/def/"
Been driving me nuts trying to figure out why.
What I am trying to do is convert MS DNS Logs to readable text. I understand that there is probably an app for this but want to do it manually
The input data is (3)www(6)google(3)com(0) and I want to change it to www.google.com
I had this working fine -
| rex field=query mode=sed "s/\(.*?\)/./g s/^\.+(\s+)?// s/\.$//"
It takes all the (#) and converts it to a . and then goes through and removes the first and last .'s
So I am trying to convert this to a sed command to do this on indexing but can't get it to work, I simplified what I was doing with examples that showed the same behavior.
OK now this makes sense. Your actual regex is not simply s/abc/def/, but something like s/^abc/def/. In regex, "^" and "$" are anchors that do not correspond to actual characters. Whereas "abc" is anchored at the beginning of the field "query", it may not - and often is not anchored at the beginning of _raw.
Suppose your raw event is
blah blahsomething query="(3)www(6)google(3)com(0)" morestuff |
Splunk will give you
_raw | query |
blah blahsomething query="(3)www(6)google(3)com(0)" morestuff | (3)www(6)google(3)com(0) |
In this case,
| rex field=query mode=sed "s/\(.*?\)/./g s/^\.+(\s+)?// s/\.$//"
will give you
_raw | query |
blah blahsomething query="(3)www(6)google(3)com(0)" morestuff | www.google.com |
but
| rex mode=sed "s/\(.*?\)/./g s/^\.+(\s+)?// s/\.$//"
gives
_raw | query |
blah blahsomething query=".www.google.com." morestuff | (3)www(6)google(3)com(0) |
Does this sound right?
In such cases, you will need to find other ways to anchor your replacements in regex. In the above example, "query" in the raw event is bounded by quotation marks. So, you can use quotation marks as anchor, i.e.,
| rex mode=sed "s/\(.*?\)/./g s/\"\.+(\s+)?/\"/ s/\.\"/\"/"
Of course, depending on actual raw events, /\(.*?\)/ could be way too broad, and quotation marks could be used in other fields that may legitimately begin or end with a dot. So, this might be a safer choice:
| rex mode=sed "s/\"\(\d+\){1,}(\s+)?/\"/ s/\(\d+\)\"/\"/ s/\(\d+\)/./g"
When I try the two samples provided;
| rex mode=sed "s/\(.*?\)/./g s/\"\.+(\s+)?/\"/ s/\.\"/\"/"
and
| rex mode=sed "s/\"\(\d+\){1,}(\s+)?/\"/ s/\(\d+\)\"/\"/ s/\(\d+\)/./g"
They run without error but don't actually modify the output. Similar to what I was seeing earlier.
I really appreciate your help with this
Can you share more of raw data than just (3)www(6)google(3)com(0)?
Here are a few more examples;
(3)www(6)google(2)ca(0)
(7)outlook(9)office365(3)com(0)
(7)updates(4)asdf(3)com(0)
(4)test(4)test(3)com(0)
@secphilomath1 wrote:Here are a few more examples;
(3)www(6)google(2)ca(0)
(7)outlook(9)office365(3)com(0)
(7)updates(4)asdf(3)com(0)
(4)test(4)test(3)com(0)
This is not what meant by more details of raw events because all of these can pass the original regex. I want to see what is surrounding the RAW events, not just query field. In other word, it is critical to know the boundary before the first "." and the last ".". Without knowing that, volunteers are just wasting time speculating.
It is impossible that an entire raw event only contains a single string "(7)updates(4)asdf(3)com(0)". (Otherwise your original regex should have succeeded.) Is this correct?
Also, you said somewhere earlier that you want to do this "on indexing". So what's the real issue here?
Ok, I am an idiot and apologize, I am building my experience in Splunk still. I was outputting the results to a table but when I went to look at the raw data I see that the following is actually working!
index=wineventlog eventtype="msad-dns-debuglog"
| rex mode=sed "s/\(.*?\)/./g s/^\.+(\s+)?// s/\.$//"
I am getting .www.google.com in the raw data which is a lot closer than I thought I was. I am unsure why I am still getting that leading dot, but this is something.
you are right, I want to catch this in indexing but wanted to verify my sed logic was accurate before I did that.
index=wineventlog eventtype="msad-dns-debuglog"| rex mode=sed "s/\(.*?\)/./g s/^\.+(\s+)?// s/\.$//"
I am getting .www.google.com in the raw data which is a lot closer than I thought I was. I am unsure
You are still not illustrating what is in the raw event. This result only suggests that
If there is some guarantee that 1 is always true in eventtype mdad-dns-debuglog, it would be fine to anchor your regex against $. But you have to show us what that leading anchor can possibly be. By the way, using elimination of \. AFTER substitution, whether leading or trailing, is a very risky strategy because you could easily be altering parts of the raw string you don't want to alter. It is much safer to be explicit about those "(3)", etc.
If you want to be as generic as possible but minimize the risk of undesirable alterations, this is perhaps the best approach:
| rex mode=sed "s/(\W+)\(\d+\)/\1/ s/\(\d+\)$// s/\(\d+\)(\W)/\1/ s/(\w)\(\d+\)(\w)/\1.\2/g"
If you expect the rex command to substitute one string for another in raw event and thus make Splunk extract all the field values from an event modified that way - it won't work. Why should it?
Splunk extracts fields automatically as needed at the beginning of the pipeline. When you modify the _raw field it's just a field - yes, it's a default field for many commands but it's just a field. So you might modify _raw with rex or any other command but it won't change the extracted fields.
Per analogiam - if you do
index=whatever
| fields *
| eval _raw=""
You should expect to see all your original fields extracted even though at some point you've overwritten the _raw field with empty string.
This means that the data that populates the field "query" at search time is absent from _raw events. For example, "query" could come from an automatic lookup. Or it could be a calculated field. And so on.
This test can help you diagnose:
| where match(_raw, "abc")
If this returns any event, and the rex mode=sed command still doesn't take effect, you have discovered a bug.
Another useful test would be
| rex field=query mode=sed "s/abc/def/" ``` you indicate that this successfully changes abc in query to def ```
| where match(_raw, "abc")
This is the same expectation: you should get no event because the prior sed doesn't change _raw field.
Would this count as a calculated field, this is all I see in the props.conf currently for this particular field.
FIELDALIAS-query = questionname AS query
That is a field alias, not calculated field. Based on this information, I assume that questionname is in raw events. Do you see any event with questionname and "abc"? I understand the need to anonymize data. But you need to describe your data characteristics accurately. What is the data format? Key-value pair? JSON? XML? Freehand? Given a snippet of raw event, how is Splunk supposed to know how to populate questionname?
Also, does the test query return any events?
Hi @secphilomath1
Without seeing the original event it is hard to know for certain but I suspect that you simply need to add the global (g) field to the sed command. Without it only the first match will be switched.
For example...
| makeresults
| eval _raw="dummy event: abc query=abc"
| rex mode=sed "s/abc/def/"
Result: "dummy event: def query=abc"
| makeresults
| eval _raw="dummy event abc query=abc"
| rex mode=sed "s/abc/def/g"
Result: "dummy event: def query=def"
Ok, using the original data, here is a result that works.....
| makeresults
| eval _raw="(3)www(6)google(2)ca(0)"
| rex mode=sed "s/\(.*?\)/./g s/^\.+(\s+)?// s/\.$//"
I get
www.google.ca