Index Migration from non-cluster standalone Splunk...

hemendralodhi · ‎10-08-2016

Hello Splunkers.

I am facing issue while implementing steps for the migration of legacy standalone Splunk instances to multi-site cluster environment.I tried to perform the changes today but roll backed them after the issue. Can you please advise on the same?

Scenario: Migration of legacy data from non-cluster splunk instances acting as standalone servers to multisite cluster (2 node in each site)

Procedure followed:

Upgraded the source server to match target splunk server version v6.4.1

1) Create required indexes on Target Enterprise Splunk servers(Multisite environment)
2) Roll buckets and Rsync data from non-cluster splunk instances to all Target Splunk Indexer servers. 1 copy of bucket on all 4 target indexers.
3) Copy data to target indexes (source db->target db and source colddb -> target colddb) and perform bucket scrubbing to prevent bucket id collision.
4) Rebuild the buckets for each index (E.g. ./splunk _internal call /data/indexes/customindexname/rebuild-metadata-and-manifests)
5) Performed rolling restart from Cluster Master and all indexers went down.
6) After issue checked .bucketManifest file and found that "origin_site" header was set to default, tried updating origin_site for each indexes on all indexer depending on which site the indexer server is.
7) Again performed rolling restart from Cluster Master but same result.
8) After restart indexer servers are crashing with Error:"Cannot disable indexes on a clustering slave."' failed.

Below is the snippet of crash log from one of the indexer:

[build debde650d26e] 2016-10-09 09:03:36
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 24368 running under UID 33335.
Crashing thread: SplunkdSpecificInitThread
Registers:
RIP: [0x00007F6DA4628625] gsignal + 53 (/lib64/libc.so.6 + 0x32625)
RDI: [0x0000000000005F30]
RSI: [0x0000000000005F3B]
RBP: [0x00007F6DA7690638]
RSP: [0x00007F6D9F5FE3F8]
RAX: [0x0000000000000000]
RBX: [0x00007F6DA5B08000]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0x00007F6D9BC00000]
R9: [0x00007F6DA0D5F880]
R10: [0x0000000000000008]
R11: [0x0000000000000206]
R12: [0x00007F6DA7690AE8]
R13: [0x00007F6DA7745980]
R14: [0x00007F6D9F04A460]
R15: [0x00007F6D9F5FE8D0]
EFL: [0x0000000000000206]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]

OS: Linux
Arch: x86-64

Backtrace (PIC build):
[0x00007F6DA4628625] gsignal + 53 (/lib64/libc.so.6 + 0x32625)
[0x00007F6DA4629E05] abort + 373 (/lib64/libc.so.6 + 0x33E05)
[0x00007F6DA462174E] ? (/lib64/libc.so.6 + 0x2B74E)
[0x00007F6DA4621810] assert_perror_fail + 0 (/lib64/libc.so.6 + 0x2B810)
[0x00007F6DA6511C4D] _ZN14IndexerService35disableIndexesAndReinitGlobalConfigERKN9gnu_cxx17_normal_iteratorIPK3StrSt6vectorIS2_SaIS2_EEEESA + 1741 (splunkd + 0x9BAC4D)
[0x00007F6DA6512C27] _ZN14IndexerService18initPerIndexConfigEP9StrVectorb + 455 (splunkd + 0x9BBC27)
[0x00007F6DA65151F1] _ZN14IndexerService12reloadConfigERK14IndexConfigRef + 481 (splunkd + 0x9BE1F1)
[0x00007F6DA6AF06A0] _ZN9EventLoop20internal_runInThreadEP13InThreadActorb + 256 (splunkd + 0xF996A0)
[0x00007F6DA6511128] _ZN14IndexerService16loadLatestConfigEP14IndexConfigRef + 808 (splunkd + 0x9BA128)
[0x00007F6DA651129B] _ZN14IndexerService16loadLatestConfigEv + 43 (splunkd + 0x9BA29B)
[0x00007F6DA65158EB] _ZN14IndexerServiceC2Ev + 859 (splunkd + 0x9BE8EB)
[0x00007F6DA6515D87] _ZN14IndexerService14_new_singletonEv + 55 (splunkd + 0x9BED87)
[0x00007F6DA61B755F] _ZN25SplunkdSpecificInitThread4mainEv + 159 (splunkd + 0x66055F)
[0x00007F6DA6BADC00] _ZN6Thread8callMainEPv + 64 (splunkd + 0x1056C00)
[0x00007F6DA4991AA1] ? (/lib64/libpthread.so.0 + 0x7AA1)
[0x00007F6DA46DE93D] clone + 109 (/lib64/libc.so.6 + 0xE893D)
Linux / vvslm0123.vodafone.com.au / 2.6.32-504.30.3.el6.x86_64 / #1 SMP Thu Jul 9 15:20:47 EDT 2015 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
2016-10-09 08:57:40.610 +1000 splunkd started (build debde650d26e)
splunkd: /home/build/build-src/galaxy/src/pipeline/indexer/IndexerService.cpp:921: void IndexerService::disableIndexesAndReinitGlobalConfig(const const_iterator&, const const_iterator&): Assertion 0 && "Cannot disable indexes on a clustering slave."' failed. 2016-10-09 08:58:49.564 +1000 splunkd started (build debde650d26e) 2016-10-09 09:03:15.382 +1000 Interrupt signal received 2016-10-09 09:03:33.212 +1000 splunkd started (build debde650d26e) splunkd: /home/build/build-src/galaxy/src/pipeline/indexer/IndexerService.cpp:921: void IndexerService::disableIndexesAndReinitGlobalConfig(const const_iterator&, const const_iterator&): Assertion0 && "Cannot disable indexes on a clustering slave."' failed.

/etc/redhat-release: Red Hat Enterprise Linux Server release 6.6 (Santiago)
glibc version: 2.12
glibc release: stable
Last errno: 2
Threads running: 21
Runtime: 3.295799s
argv: [splunkd -p 8089 start]
Thread: "SplunkdSpecificInitThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f6da27c6710:
00000000 00 f7 5f 9f 6d 7f 00 00 |.._.m...|
00000008

InThreadActor @0x7f6d9f5fea20: _queuedOn=(nil), ran=N, wantWake=Y, wantFailIfLoopDone=N

Please advise.

Thanks

lguinn2 · ‎10-09-2016

That is not the way to do what you want, and I am not sure if it is feasible to do what you seem to be attempting. First, buckets that were created as non-clustered buckets cannot become clustered buckets. There is no supported way to convert these buckets to clustered buckets. In particular, multi-site buckets have special requirements; it is not sufficient to make the bucket ids unique. See Migrate Non-clustered indexers to a clustered environment. Have you contacted Splunk Professional Services for assistance?

The default behavior is that the non-clustered buckets remain as they were; only new buckets are clustered. Over a period of time, all the non-clustered buckets will age out of the indexes and all the data will be clustered.

Are you migrating from one server to many servers? What is the old configuration vs. the new? What are the search/replication factors in your multi-site cluster?

hemendralodhi · ‎10-09-2016

Thanks for the response Iguinn.

It is fine for us to not make legacy bucket as clustered buckets.

We are consolidating and migrating data from standalone splunk instances to multi-site set up having rep factor : origin 2 and other 1 . Search factor is 3.

If we have single copy of each bucket, lets say just on 1 target indexer server - will it work with procedure specified? and for new data new buckets will be created that will be replicated.

lguinn2 · ‎10-10-2016

Yes, although you may not be able to copy ALL of the indexes to a single cluster member. I am not quite sure what you are planning there.

Example 1: you have 6 standalone indexers, and you are migrating to 6 new indexer peers (3 per site). Just copy all of the indexes from each standalone to a different peer. For example, standalone-1's data can be copied to newpeer-1, standalone-2 to newpeer-2 and so forth. There is no risk of "collision" if you keep all the old data separate in this way. And Splunk will handle the new data for you, replicating it across the cluster.
BUT - you will need to have a consolidated indexes.conf, as described below.

Example 2: Let's say that you have 10 standalone indexers, but you will have only 6 indexer peers (3 per site) in your new environment. Then things are more complicated - you can't do a 1-to-1 migration. You will need both the consolidated indexes.conf and a plan for moving the data without collisions, like this:

First - every indexer that might ever have data for an indexer, must have the index defined. Even if an indexer does not start out with any data for index A, it might get a copy of index A buckets during either indexing or recovery. So build an indexes.conf that defines all the possible indexes and push it to every indexer in the cluster (via an app in the cluster master). You will also have to reconcile any differences in the old definitions: what if standalone-1 has a 90-day retention for index A, but standalone-3 has a 60-day retention? All of that must be planned and worked out before the migration can begin.

Second, figure out which of the old indexers/indexes will be copied to which of the new indexers. Perhaps the complete indexes from standalone-1 can be copied to newpeer-1. Next, if standalone-2 has entirely different indexes (no overlap) all the indexes from standalone-2 can also be copied to newpeer-1. BUT if two of the standalone indexers have indexes with the same name - those indexes have to be copied to different indexers; they cannot be combined. So you may have some mapping to do, to make sure that you don't have name collisions when you copy.

Now, there are ways to cheat on some of my "rules" here - but I try to take an approach that is conservative, preserves data integrity, and is easy to roll-back if necessary.

hemendralodhi · ‎10-11-2016

Thanks Iguinn for your detailed response. It is really helpful. I will try to test this with few index and update here.

Though I have one question - why can't we combine data from 2 standalone indexers to one of the multisite indexer ? You said there will be name collision but if we take care of bucket conflict, will there be collision? Also after running rebuild command on new indexes , origin_site is coming as default, do we have to update that as well as per site name on each indexer or splunk will continue to work with default for legacy data and origin specific for any new data.

Your responses are highly appreciated.

Thanks
Hemendra

hemendralodhi · ‎10-11-2016

Thanks much Iguinn. I tested for 1 index and it is working fine with origin_site as default- no need to update .bucketmanifest file.

I will work on more and will update here.

Thanks

Index Migration from non-cluster standalone Splunk instances to Multi-site Cluster

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?