Solved: How to archive specific Splunk data?

daniel333 · ‎12-27-2016

All,

I have legal telling me I must have a certain subset of data available in Splunk for 12-18 months depending. Basically a 24 hour period where we DDOS's. SO it's quite a lot of data. 100gigs or so.

Is there a way to tell Splunk "take this index, for this time range and keep the data on this disk and available for the next year"?

thanks
-Daniel

lguinn2 · ‎12-27-2016

This is a somewhat unusual request, so I am going to propose a somewhat unusual solution:
Find all the buckets that apply to that 24-hour period and copy them to another location. Then back them up to some archival media and store them.

With this solution, you don't need to worry about exporting the data or setting a retention on the index that will affect a lot of other data. But you do need to understand indexes and buckets. You should understand this page in the documentation: How the indexer stores indexes. Here are a few key concepts from that page, that you will need for my solution:
Indexes consist of a set of directories. The buckets themselves are subdirectories within the index directories. The bucket directory names are based on the age of the data they contain (and other things).

To identify the proper buckets, you need to know a few specific things:
- Which index to look in - and where to find the directory for that index
- How to interpret the bucket name to identify the time range of the bucket
- The epoch time of the start and end of the timerange that you want to save

To find the index, you should be able to look into indexes.conf (or under Settings>>Indexes). Under the main directory for the index, you will find 2 subdirectories: db and colddb. You will need to look in both of these to be sure that you collect all the buckets.

Within the db and colddb directories, examine the name of every bucket. If the timerange of the bucket overlaps the timerange that you want to save - then copy that bucket to your chosen "save" location. Be sure to copy recursively to get all the subdirectories! Note: one way to identify the buckets that you need is to run a search like this

| dbinspect index=yourIndexHere 
| eval myEarliest = strptime("12/1/2015","%m/%d/%Y")  | eval myLatest = strptime("1/1/2017","%m/%d/%Y")
| where (myEarliest >= startEpoch AND myEarliest <= endEpoch) OR (myLatest >= startEpoch AND myLatest <= endEpoch) OR
        (myEarliest < startEpoch AND myLatest > endEpoch)
| eval bucketStart = strftime(startEpoch,"%x %X") | eval bucketEnd = strftime(endEpoch,"%x %X")
| table path bucketStart bucketEnd sizeOnDiskMB
| addcoltotals

Of course, you would need to change 12/1/2015 to your starting date. And you need to change 1/1/2017 to the day after your end date (because strptime will calculate the epoch time as of the start of the day).

If the data that you need is in multiple indexes, repeat the process for each index. If you have multiple indexers, repeat the process for each indexer. If you are using indexer clustering, you may need to look more carefully at the bucket names; if you just follow these steps, you will end up backing up multiple copies of the same data. But the docs page on bucket names explains that.

It would not be that difficult to write a script to make the bucket copies. Be sure to check the list of copied buckets against the results of the search. I hope this suggestion is a reasonable solution!

View solution in original post

lguinn2 · ‎12-27-2016

This is a somewhat unusual request, so I am going to propose a somewhat unusual solution:
Find all the buckets that apply to that 24-hour period and copy them to another location. Then back them up to some archival media and store them.

With this solution, you don't need to worry about exporting the data or setting a retention on the index that will affect a lot of other data. But you do need to understand indexes and buckets. You should understand this page in the documentation: How the indexer stores indexes. Here are a few key concepts from that page, that you will need for my solution:
Indexes consist of a set of directories. The buckets themselves are subdirectories within the index directories. The bucket directory names are based on the age of the data they contain (and other things).

To identify the proper buckets, you need to know a few specific things:
- Which index to look in - and where to find the directory for that index
- How to interpret the bucket name to identify the time range of the bucket
- The epoch time of the start and end of the timerange that you want to save

To find the index, you should be able to look into indexes.conf (or under Settings>>Indexes). Under the main directory for the index, you will find 2 subdirectories: db and colddb. You will need to look in both of these to be sure that you collect all the buckets.

Within the db and colddb directories, examine the name of every bucket. If the timerange of the bucket overlaps the timerange that you want to save - then copy that bucket to your chosen "save" location. Be sure to copy recursively to get all the subdirectories! Note: one way to identify the buckets that you need is to run a search like this

| dbinspect index=yourIndexHere 
| eval myEarliest = strptime("12/1/2015","%m/%d/%Y")  | eval myLatest = strptime("1/1/2017","%m/%d/%Y")
| where (myEarliest >= startEpoch AND myEarliest <= endEpoch) OR (myLatest >= startEpoch AND myLatest <= endEpoch) OR
        (myEarliest < startEpoch AND myLatest > endEpoch)
| eval bucketStart = strftime(startEpoch,"%x %X") | eval bucketEnd = strftime(endEpoch,"%x %X")
| table path bucketStart bucketEnd sizeOnDiskMB
| addcoltotals

Of course, you would need to change 12/1/2015 to your starting date. And you need to change 1/1/2017 to the day after your end date (because strptime will calculate the epoch time as of the start of the day).

If the data that you need is in multiple indexes, repeat the process for each index. If you have multiple indexers, repeat the process for each indexer. If you are using indexer clustering, you may need to look more carefully at the bucket names; if you just follow these steps, you will end up backing up multiple copies of the same data. But the docs page on bucket names explains that.

It would not be that difficult to write a script to make the bucket copies. Be sure to check the list of copied buckets against the results of the search. I hope this suggestion is a reasonable solution!

daniel333 · ‎12-28-2016

Hey Lisa,

You're answer was great. And was added to my toolbox for sure. Ultimately here is how I solved this one. I created a new index which had maxed out space and no deletion rules.

index=web sourcetype=akamai host=myserver  | collect index=weblegal sourcetype=akamai

niketn · ‎12-27-2016

You can consider:

1) Summarizing the data or retaining entire data from one index to other using collect command
2) Scheduling the search for regular interval (for example every month for past month on main index).
3) Finally you can define a separate retention period for your summary index.

If you summarize the data with stats of important data, it will lead to faster search and significantly less number of events.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

lguinn2 · ‎12-27-2016

If you are saving this data for legal reasons, I don't think you should summarize it in any way, or alter it from the original log file format.

daniel333 · ‎12-28-2016

Correct, summarization would loose the content we're exploring.

niketn · ‎12-28-2016

By Summarization I meant use of collect command. Summaries was as per your use case. You can push raw data with collect command as well.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

somesoni2 · ‎12-27-2016

Sounds like a retention policy related question. Does this "subset of data" is for an index and does that index contain only the "data that needs to be retained for longer period"?

daniel333 · ‎12-28-2016

Order is to not delete any log or database record from the time range.

How to archive specific Splunk data?

Splunk MCP & Agentic AI: Machine Data Without Limits

Finding Based Detections General Availability

Get Your Hands Dirty (and Your Shoes Comfy): The Splunk Experience

Join the Conversation

How to archive specific Splunk data?

Splunk MCP & Agentic AI: Machine Data Without Limits

Finding Based Detections General Availability

Get Your Hands Dirty (and Your Shoes Comfy): The Splunk Experience