Added more information about slave election in Redis Cluster alternative doc

author antirez <antirez@gmail.com>

Thu, 29 Apr 2010 13:39:11 +0000 (15:39 +0200)

committer antirez <antirez@gmail.com>

Thu, 29 Apr 2010 13:39:11 +0000 (15:39 +0200)
author antirez <antirez@gmail.com>
Thu, 29 Apr 2010 13:39:11 +0000 (15:39 +0200)
committer antirez <antirez@gmail.com>
Thu, 29 Apr 2010 13:39:11 +0000 (15:39 +0200)
diff --git a/design-documents/REDIS-CLUSTER-2 b/design-documents/REDIS-CLUSTER-2

index 930fea614f88f37f4081f4b696ff407504e3b1db..62fa114b2432b71af52eae5e59bfb826747c4511 100644 (file)
--- a/design-documents/REDIS-CLUSTER-2
+++ b/design-documents/REDIS-CLUSTER-2
@@ -278,4 +278,66 @@ to the same hash slot. In order to guarantee this, key tags can be used,
  where when a specific pattern is present in the key name, only that part is
  hashed in order to obtain the hash index.
  
+Random remarks
+==============
+
+- It's still not clear how to perform an atomic election of a slave to master.
+- In normal conditions (all the nodes working) this new design is just
+  K clients talking to N nodes without intermediate layers, no routes:
+  this means it is horizontally scalable with O(1) lookups.
+- The cluster should optionally be able to work with manual fail over
+  for environments where it's desirable to do so. For instance it's possible
+  to setup periodic checks on all the nodes, and switch IPs when needed
+  or other advanced configurations that can not be the default as they
+  are too environment dependent.
+
+A few ideas about client-side slave election
+============================================
+
+Detecting failures in a collaborative way
+-----------------------------------------
+
+In order to take the node failure detection and slave election a distributed
+effort, without any "control program" that is in some way a single point
+of failure (the cluster will not stop when it stops, but errors are not
+corrected without it running), it's possible to use a few consensus-alike
+algorithms.
+
+For instance all the nodes may take a list of errors detected by clients.
+
+If Client-1 detects some failure accessing Node-3, for instance a connection
+refused error or a timeout, it logs what happened with LPUSH commands against
+all the other nodes. This "error messages" will have a timestamp and the Node
+id. Something like:
+
+    LPUSH __cluster__:errors 3:1272545939
+
+So if the error is reported many times in a small amount of time, at some
+point a client can have enough hints about the need of performing a
+slave election.
+
+Atomic slave election
+---------------------
+
+In order to avoid races when electing a slave to master (that is in order to
+avoid that some client can still contact the old master for that node in
+the 10 seconds timeframe), the client performing the election may write
+some hint in the configuration, change the configuration SHA1 accordingly and
+wait for more than 10 seconds, in order to be sure all the clients will
+refresh the configuration before a new access.
+
+The config hint may be something like:
+
+"we are switching to a new master, that is x.y.z.k:port, in a few seconds"
+
+When a client updates the config and finds such a flag set, it starts to
+continuously refresh the config until a change is noticed (this will take
+at max 10-15 seconds).
+
+The client performing the election will wait that famous 10 seconds time frame
+and finally will update the config in a definitive way setting the new
+slave as mater. All the clients at this point are guaranteed to have the new
+config either because they refreshed or because in the next query their config
+is already expired and they'll update the configuration.
+
  EOF
author	antirez <antirez@gmail.com>
	Thu, 29 Apr 2010 13:39:11 +0000 (15:39 +0200)
committer	antirez <antirez@gmail.com>
	Thu, 29 Apr 2010 13:39:11 +0000 (15:39 +0200)