Getting Data In

Create field extraction before linebreaks and apply it to broken-out sections after?

willial
Communicator

Let's say I'm doing extractions on a really big file, thousands of lines, that looks like this:

Section1
ID_Number: 12345
lots and lots of text
@@@
Section2
lots and lots of text
@@@
Section3
lots and lots of text

So I've set up props and transforms to handle it so that it breaks at the @@@, and each section is a sourcetype -- sourcetype=Section1, sourcetype=Section2, etc.

My question is this: that ID_Number in the first section is important, is there any way to extract it and add it to each section/sourcetype as a field so that it's not stuck only in Section1?

woodcock
Esteemed Legend

OK, now that I know the full input/forwarding pipeline (TCP was not originally mentioned; only files were), I think I have a solution for you. You can use netcat to perform a quick and dirty man-in-the-middle scripting for this. Write a script that reads from STDIN and writes to STD out and watches for any "ID_Number" line and outputs the last-seen line after every time it sees a "Section" line. I whipped up an awk script for your sample data that you can use as a start. Then redirect your Splunk forwarder to read from a different unused port (say port 1313) and use netcat on the old port like this:

nc -l 540 | awk '{if ( $1 ~ /ID_Number/ ) {ID_Number = $0; print $0} else {if ( $1 ~ /Section/ && $1 !~ /Section1/ ) {print $0 "\n" ID_Number} else {print $0}}}' | nc localhost 1313

Every good linux man should know about netcat.

0 Karma

woodcock
Esteemed Legend

I do not think this can be done at index-time without pre-processing the file yourself and copying the line to every section but you can do it easily enough at search-time like this:

... | eventstats first(ID_Number) AS ID_Number by source | ...
0 Karma

willial
Communicator

I think we broke the comment section. Moving up here.

I'm doing event linebreaking at index time, but I'm not actually working with a unique source field per original file since I'm getting data over TCP and UDP -- I'm using files too, but not 100%, which is the only case in which your solution would really work. I'm not pre-processing outside of Splunk, I'm using props.conf and transforms.conf stanzas to handle the event linebreaking at index time. What I'm really looking for, and what I think would be most useful, is a way to extract a value from the input before event linebreaking and then embed it in each new event after.

0 Karma

willial
Communicator

The sections are broken out at index time, so I can't grab the ID from Section1 if I'm searching on Section3. Currently I lose the relationship between sections, which is why I'm trying to put a unique identifier in all sections so they can be tracked after they're broken out.

0 Karma

woodcock
Esteemed Legend

I am pretty sure it is impossible without pre-processing. Did you try my search-time solution? It should work just fine.

0 Karma

willial
Communicator

As I mentioned, I'm dividing the files at index time so if I'm searching sourcetype=Section2 or sourcetype=Section3, the ID number isn't in the current search results to run eventstats on. The files are ~15,000 lines long and contain 20 or so sections, so I can't really manage them at search time in their original format.

0 Karma

woodcock
Esteemed Legend

I don't think you understand my answer, probably because you may not appreciate how eventstats works. Just pretend the field exists and write your search. Then insert my eventstats solution at the very beginning of the command chain and it will work as you would expect. The only thing is that you need to be sure NOT to discriminate out sourcetype=section1 until after adding in the solution, even if it means doing a broader search than you need at first. Just give it a try.

0 Karma

willial
Communicator

If this is right, I don't think I understand how eventstats works. This may get complicated.

So I made my sample look very generic. My actual ID looks like (sorry for the regex):

System Type:       \w+-\w+

(I'm also looking to do similar searches filtering out by MAC addresses and some other things)

I'm pretty sure I can't just ask it to eventstats first("System Type") and expect it to pick up what I want -- also I tried it, so I'm even more pretty sure there. Will I need to build in an auto field extraction or is there a better way around this?

0 Karma

woodcock
Esteemed Legend

Let us assume you have forwarded 2 files as follows:

File1:

Section1
ID_Number: 12345
lots and lots of text
@@@
Section2
lots and lots of text
@@@
Section3
lots and lots of text

File 2:

Section1
ID_Number: 98765
lots and lots of text
@@@
Section2
lots and lots of text
@@@
Section4
lots and lots of text

Then you do a search like this:

index=myindex | rex "Section(?<Section>\d+)" | rex "ID_Number:\s*(?<ID_Number>\d+)" | eventstats first(ID_Number) AS ID_Number by source | table Section ID_Number,sourcetype,source

Then you will get data like this:

1,12345,Section1,File1
2,12345,Section2,File1
3,12345,Section3,File1
1,98765,Section1,File2
2,98765,Section2,File2
4,98765,Section4,File2

So ID_Number has been associated with every Section (sourcetype/event) within each file/source, which is what you said you needed.

But there is nothing that can make this automatic; you just have to do the eventstats on every search.

0 Karma

willial
Communicator

That doesn't work. It just extracts from the very first event that matches the rex for System Type and applies that to every result in the search. Since each section was broken into its own event at index time and there's no overlapping unique information, I don't have any relational linkage between sections.

EDIT: Additionally, the files themselves are too big not to break into smaller events. I still get some truncation issues on subsections, even.

0 Karma

woodcock
Esteemed Legend

You DO have relational linkage: the source field! You are totally mistaken: JUST TRY THE SEARCH! My search does EXACTLY what I say it does for the example files and data that I showed. It most certainly DOES NOT "just extracts from the very first event that matches the rex for System Type and applies that to every result in the search" if I had used | eventstats first(ID_Number) AS ID_Number then it would, but I DID NOT, I used | eventstats first(ID_Number) AS ID_Number by source. I have never worked so hard to help a person to just try an answer before! If you would just TRY IT you would see that it works and that you have had a workable solution (the only one, mind you) since the very first comment.

0 Karma

willial
Communicator

I'm sorry if I wasn't clear -- I did try it, and it acted as I stated in my previous comment. The Source field is not unique.

0 Karma

woodcock
Esteemed Legend

I know that the source field is not unique (i.e. you have more than 1 file, each of which has a different ID_Number); that is the WHOLE POINT, right? Unless there is something totally crazy that you are not divulging, then I stand 100% by this answer. I will not believe that it does't work until you show me the output and out how it is wrong (please do so):

 index=myindex | rex "Section(?<Section>\d+)" | rex "ID_Number:\s*(?<ID_Number>\d+)" | eventstats first(ID_Number) AS ID_Number by source | table Section ID_Number,sourcetype,source
0 Karma

woodcock
Esteemed Legend

Perhaps the confusion is in your phrase I'm dividing the files at index time. I took this to mean that you are doing the event linebreaking at index-time but perhaps what you means is that you are pre-processing the files outside of Splunk and breaking up each big file into many smaller files such that only the first file has Section1 with the ID_Number. Even if that's the case, just name each split off file like "file1.1", "file1.2", etc. and we can still use the source filed (with a bit of a tweak to ignore everything after the last period) as I described. Other than this, I cannot imagine how it is that we are so misunderstanding and confusing one another.

0 Karma

willial
Communicator

No, the source field is not unique in that source = source = source = source = source for most files, unless you're literally using individual text files as your inputs. If you're gathering data from a TCP source, source is always going to equal TCP:500 or however. Which means when you go to do eventstats by source, and none of your sources are unique, the ID number that's applied to each event is going to be the same one.

This isn't crazy, and there are limited instances in which your solution would work, but this isn't one of them.

0 Karma
Get Updates on the Splunk Community!

The Splunk Success Framework: Your Guide to Successful Splunk Implementations

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...

Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

Investigate Security and Threat Detection with VirusTotal and Splunk Integration

As security threats and their complexities surge, security analysts deal with increased challenges and ...