In my organization we are planning to use distributed search and index where our requirement is 3Gb data volume in index per day.
Could you please suggest how many Index node and search nodes should be enough to give optimum output?
In order to know the answer to either, we need to know more about your environment. The easy answer to your question is "Probably one server, maybe two for redundancy, probably 3-5 if you have a virtual environment you can steal some cycles from." The harder but better answer is "It depends on so many things."
That said, at least some very vague ballpark ideas can be gotten.
So, 3 GB/day is a small amount. This will fit perfectly fine on a single indexer, even virtualized. You may even be small enough to keep your search head on the same box, too, but I'd suggest against it for reasons mentioned farther down.
First, talk to your Splunk rep. They can bring in someone to help you decide what sort of direction you ought to take given your environment. Second, talk to your Splunk rep about training, there's a lot of great education classes available on how to manage this environment. Third, talk to your Splunk rep about possibly getting some professional services time to help you stand this up.
Are you getting the idea your Splunk rep may be of help here?
If it were up to me and knowing so little about data retention needs, redundancy and reliability requirements or even search load, I'd say something like the following.
If you need a proof of concept, you should have no problem fixing up one decently specced system (physical or virtual) with enough disk to make this work. But be careful you don't let that turn into a production box unless you PLAN it to be your production environment. A smallish virtual machine, sometimes just running off my desktop, can be a good enough PoC sometimes.
If you have a virtual environment, I'd build a pair of indexers in a cluster (makes LOTS of things better down the road, even if it's slightly more complicated to start with). One search head. One Cluster Master. One "extra" machine (Deployment Server, License Master and whatever else). You can use Ubuntu or CentOS just fine, so there's no real licensing cost except whatever your hypervisor costs you. Splunk also costs by the MB/day ingestion rate, not by how many Splunk servers you have. So in this case a reasonably apportioned set of indexers and search head, the rest can be pretty small boxes.
In a physical environment, oh that makes this small scale stuff hard to do right. Hmm. Honestly for 3 GB/day you can just stick it all on one. That has some upsides, but also has some downsides. Upside is a single management place. Downside is that means you'll have a very hard time keeping configs isolated - not a problem as long as you stay single machine but if you grow and start dividing out your roles among other machines you'll find you'll have a lot of untangling to do. Also it means you won't have the resilience of multiple clustered indexers or search heads.
If you wanted to mix and match between physical and virtual, I'd say put indexer(s) physical first, then SH, then I don't think I'd suggest physical for any of the other machines, they simply don't need enough oomph. But, investigate what you buy - especially the indexers have some specific IOPS requirements, and will have disk space requirements too. Splunk can help with sizing them, I think.
Anyway, your biggest questions you'll want to try to find some sort of answers to are
1) Who is the business owner of the project?
2) What are the goals?
3) What sort of funding model is in place, or are you figuring out how that ought to work right now?
4) What searching do you expect to happen?
5) How long do we need to keep each type of data?
6) How much tolerance do we have for a) losing data or b) losing access to data during a DR event?
7) How much data do I have sitting around that would be easy and useful in Splunk (because this data is very likely to end up IN Splunk, so we should really make a plan for it)
Let us know how you get along with this!
So our setup will be like: 1 splunk machine for SPlunk web UI where only web service will be running.
next to search forwarder machines where we will disable , webapp parameter to decativate web service and same in other to splunk index machines. Does that look fine?
The setup you are going for is probably adequate for your data assuming enough IOPS and only moderate search load (e.g. not 14 users hammering it while it's doing 743 saved searches every evening to send reports to folks.. etc...).
For years we ran a 5 GB/day environment off a single physical windows server with a old 2 CPU 2 Core XEON (=4 CPU) with 16 GB of RAM and 4 15k rpm disks in R10. That really did work OK. It wasn't super fast, but it was fast enough for my couple of users with that little bit of data. Way better than trying to search Firewall data from the FW's interface.
At that time, if I had even a moderate speed search head as a separate machine and had it a bit more CPU and maybe 2 more disks in that RAID 10 set, then performance wise it probably would have done pretty well to double, triple or even more of my data.
If you can swing it use Splunk on *nix. Free ones are fine and several options are available. If that gives you the heebie-jeebies, well, Windows will work fine too. Just not "As fine."
Don't get hung up on disabling the web UI and all that. I honestly don't think the surface area is that much bigger, and frankly it's just more work than IMO it's worth, especially in a small environment. In a non-clustered environment, there is goodness to be had in the DMC on those devices to help keep track of your data.
Do tell your SH to search your Indexer, though.