From: antirez Date: Tue, 31 Jul 2012 08:14:23 +0000 (+0200) Subject: Sentinel: abort failover when in wait-start if master is back. X-Git-Url: https://git.saurik.com/redis.git/commitdiff_plain/3da75e2ca42a12623d80293755eaafa780de8074?hp=e328e41a3a26a5d7da875317a4e053768d6d4c7a Sentinel: abort failover when in wait-start if master is back. When we are a Leader Sentinel in wait-start state, starting with this commit the failover is aborted if the master returns online. This improves the way we handle a notable case of net split, that is the split between Sentinels and Redis servers, that will be a very common case of split becase Sentinels will often be installed in the client's network and servers can be in a differnt arm of the network. When Sentinels and Redis servers are isolated the master is in ODOWN condition since the Sentinels can agree about this state, however the failover does not start since there are no good slaves to promote (in this specific case all the slaves are unreachable). However when the split is resolved, Sentinels may sense the slave back a moment before they sense the master is back, so the failover may start without a good reason (since the master is actually working too). Now this condition is reversible, so the failover will be aborted immediately after if the master is detected to be working again, that is, not in SDOWN nor in ODOWN condition. --- diff --git a/src/sentinel.c b/src/sentinel.c index 1048e8c7..d1c6befe 100644 --- a/src/sentinel.c +++ b/src/sentinel.c @@ -2400,6 +2400,24 @@ sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) { /* ---------------- Failover state machine implementation ------------------- */ void sentinelFailoverWaitStart(sentinelRedisInstance *ri) { + /* If we in "wait start" but the master is no longer in ODOWN nor in + * SDOWN condition we abort the failover. This is important as it + * prevents a useless failover in a a notable case of netsplit, where + * the senitnels are split from the redis instances. In this case + * the failover will not start while there is the split because no + * good slave can be reached. However when the split is resolved, we + * can go to waitstart if the slave is back rechable a few milliseconds + * before the master is. In that case when the master is back online + * we cancel the failover. */ + if ((ri->flags & (SRI_S_DOWN|SRI_O_DOWN)) == 0) { + sentinelEvent(REDIS_WARNING,"-failover-abort-master-is-back", + ri,"%@"); + sentinelAbortFailover(ri); + return; + } + + /* Start the failover going to the next state if enough time has + * elapsed. */ if (mstime() >= ri->failover_start_time) { ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE; ri->failover_state_change_time = mstime();