Some Notes on Monitoring and Testing Load Balanced Resources
One of the challenges faced when designing end-to-end checks is encountered when you are facing a load balancing arrangement. It is necessary to repeat part of the check more than once so that you can be reasonably sure that you hit all the resources that are being load balanced. Repeat too many times and you push your resources (ah, nothing like stress testing a live production system). Repeat too few times, and node 12 will keep serving that critical error you thought you'd fixed.
Random
Perhaps the most common case is random or semi-random balancing. If the balancing is fair (that is if the probability of a request being assigned to a given resource is constant across all resources), this is a case of the coupon collectors problem. Essentially the problem is the same as someone collecting a full set of coupons (or baseball cards, world cup stickers or pokemon etc.) - the first few are easy to get but it becomes progressively more difficult to get new ones as your collection grows.
In this case, the problem scales as O(N ⋅ log(N)), where N is the number of resources. To calculate the number of times we need to repeat the checks to cover all the resources(C), we first calculate it's expectation:
〈C〉 = N ⋅ HN
= N ⋅ loge N + γN + o(1⁄N2)
≅ N ⋅ loge N + γN as N → ∞
Hn is the Harmonic Number of N, and γ is the Euler-Mascheroni constant. For small N we can find Hn by hand.
Hn = 1/1 + 1/2 + 1/3 + … + 1/n
We can then use the markov inequality to find a bound on the probability of coverage, which should be a satisfactory guide.
P(C ≥ xnHn) ≤ 1⁄x
Round Robin
The other common load balancing algorithm is set up for round robin balancing, when the resources are ordered in a list, and the balancer chooses them in sequence, restarting from the top when necessary.
In this case we will need to know the number of 'organic' requests coming through between one test and the next. We can calculate this if we know the request probability density L(t); which is something you really want to be keeping track of anyway. If we can assume our L(t) is cyclostationary, the expectation of the number of requests between checks that take with period T at time 0 is:
∫0NT L(t)/N dt
= L(t)T as NT → 0.
We can find the probability that no requests come through by using the markov inequality again, if needed.
If you are running the checks in a quiet period (or running them quickly enough) then you get full coverage at C = N with probability 1-PO, where PO is the probability of an an organic request. Unfortunately simply adding an extra check or two will not help; you'll have add another N-1, to get probability 1-PO2. If you do your checks at busy times L(t)T >> N, when it's almost certain that several requests have come through, you can delay or randomise the timing of the checks so that either
T >> σ(L(t))
or
σ(T) >> L(t)N
in which case it reduces to random.
Other Algorithms
If the load balancing is set up to favour 'least loaded' resources in some way, then we have a different kettle of fish. In fact, unless we're load balancing requests that always take the same time to complete this is usually a bad thing to have in place anyway, and even then it's highly questionable whether this added complexity helps.