Solved: Re: what is the best way to index a lot of data?

amir_thales · ‎12-20-2017

Hello everybody,

I will set up a platform for a future project and integrate Splunk to analyze all the generated logs (millions logs windows and linux and maybe cisco). This platform will contain forwarders that will redirect the data to the Splunk server and I will analyze them later.

Now I want to know what is the best way to manage my indexes and the data ie in terms of security, backup and speed. Should I separate the indexes into 3 indexes: Linux, Windows and Cisco or is there another way?

Thank you
Amir

skoelpin · ‎12-20-2017

You want to create a new index for access controls and retention. If you have different retention policies for different sources of data then yes, you will need a new index for each policy. If you have access controls for data sets, then yes, you need a new index. If access and retention is the same across all data sources then you can put it in the same index.

The only exception to this is if you put security logs in with access logs, it may slow down searches, even if you explicitly select the source and sourcetype when searching

View solution in original post

skoelpin · ‎12-20-2017

You want to create a new index for access controls and retention. If you have different retention policies for different sources of data then yes, you will need a new index for each policy. If you have access controls for data sets, then yes, you need a new index. If access and retention is the same across all data sources then you can put it in the same index.

The only exception to this is if you put security logs in with access logs, it may slow down searches, even if you explicitly select the source and sourcetype when searching

amir_thales · ‎12-20-2017

The problem is that I have no different data, it is only the OS or the hardware that changes so set the sourcetype following the data is not very interesting.

I think that making different indexes for each OS and hardware is the best solution since the queries I'm going to run will not affect all machines but machines following their OS or hardware (cisco) ie before each request I will make a filter following the index linux or windows I do not filter by machines because I want a global view of the platform. But you do not think that it is the same because the more we will advance in time and more index windows to fill it for example and if I run a query in an index containing millions of logs it will be slow too. Can we configure Splunk to archive data that is 1 year old for example? so the searches would be faster and we would get rid of data that is no longer relevant.

Amir

skoelpin · ‎12-20-2017

I guess it depends on how much data your bringing in. If your datasets are massive and you want to forwardly think, then it may be best if you create a new index for each source. Yes, you can archive data older than 1 year. This is known as rolling data to frozen. You set retention policies per index, you can also set by size or time. Default time is 6 years I believe

https://docs.splunk.com/Documentation/Splunk/7.0.1/Indexer/Automatearchiving

amir_thales · ‎12-20-2017

Ok thank you this is very important and I just learned at the moment I did not know that splunk removes frozen data so it's important to set up before setting up a platform.

skoelpin · ‎12-20-2017

If this answered your question, can you accept it and close it out?

amir_thales · ‎12-20-2017

So from what we said before there is no optimal way to index a datum of the same type because I have not had a precise answer on this subject apart from putting different sourcetype to filter the incoming data.

skoelpin · ‎12-20-2017

I thought I made it clear that either way is optimal and will work. You HAVE to explicitly call your sourcetype when searching if you want fast results. If you do not do this, then you HAVE to have a new index for each source.

You need to revise your question to make it more clear on what your asking or create a new question

amir_thales · ‎12-20-2017

sorry to insist like that but do you have any leads to give me to understand how to facilitate indexing?

Amir

skoelpin · ‎12-20-2017

You should create a new question for this since its out of scope of your existing question. I've spent a considerable amount of time explaining this to you and you should either accept it as the answer to close it out or modify the question to make it more clear on what your asking.

amir_thales · ‎12-20-2017

thank you for your help and happy holidays of the end of the year

amir_thales · ‎12-20-2017

Hi,

In the prototype that I put in place I put the logs Linux and Windows in the same index but the search is very slow and I have for the moment only 300000 logs approximately. But if I do the following search for windows hosts for example search is faster, that's why I thought to separate indexes depending on the operating system or hardware. Is this interesting?

With respect to security and data backup then what is most relevant? an index that contains all or more indexes depending on the operating system or hardware.

Amir

skoelpin · ‎12-20-2017

Can you share the search your using? If its all in the same index and your not explicitly calling the sourcetype then it will be very slow since you have a lot of noise. You can think of sourcetype as defining the shape of your data. Anytime you have a different looking data format, this will require a new sourcetype. You can have many sourcetypes in the same index.

Like I said in the exception above, if you have a massive dataset then it will slow it down if everything is in the same index. Just know that creating new indexes will require additional storage for the additional tsidx files. It's really your call if you want to create new indexes for each source. Some companies I've consultanted in created 3 indexes for each application while others had 3-4 indexes company wide. A lot of the search time also depends on your hardware.

what is the best way to index a lot of data?

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes