[redis.git] / design-documents / REDIS-CLUSTER

Redis Cluster Design Proposal (work in progress)

28 Nov 2010: Ver 1.0 - initial version
22 APr 2010: Ver 1.1 - more details and rationales

Overview
========

Redis is a fast key-value store supporting complex aggregate data types as
values. For instance keys can be bound to lists with many elements, sets,
sub-dictionaries (hashes) and so forth.

While Redis is very fast, currently it lacks scalability in the form of ability
to transparently run across different nodes. This is desirable mainly for the
following three rasons:

A) Fault tolerance. Some node may go off line without affecting the operations.
B) Holding bigger datasets without using a single box with a lot of RAM.
C) Scaling writes.

Since a single Redis instance supports 140,000 operations per second in a good
Linux box costing less than $1000, the need for Redis Cluster arises more
from "A" and "B". Scaling writes can also be useful in very high load
environments. Scaling reads is already easily accomplished using Redis built-in
replication.

Design goals
============

Designing a DHT in 2010 is hard as there is too much bias towards good designs
that are already well tested in practice, like the Amazon Dynamo design.
Still a Dynamo alike DHT may not be the best fit for Redis.

Redis is very simple and fast at its core, so Redis cluster should try to
follow the same guidelines. The first problem with a Dynamo-alike DHT is that
Redis supports complex data types. Merging complex values like lsits, where
in the case of a netsplit may diverge in very complex ways, is not going to
be easy. The "most recent data" wins is not applicable and all the resolution
business should be in the application.

Even a simple application can end up with complex schema of keys and complex
values. Writing code in order to resolve conflicts is not going to be
programmer friendly.

So the author of this document claims that Redis does not need to resist to
netsplits, but it is enough to resist to M-1 nodes going offline, where
M is the number of nodes storing every key-value pair.

For instance in a three nodes cluster I may configure the cluster in order to
store every key into two instances (M=2). Such a cluster can resist to a single
node going offline without interruption of the service.

When more than M-1 nodes are off line the cluster should detect such a condition
and refusing any further query. The system administrator should check why
M-1 nodes are offline and bring them back again if possible.

Once resisting to big net splits is no longer a requirement as there is no
conflict resolution stage, since at least an original node responsible of
holding every possible key must be online for the cluster to work, there is
also no need for a design where every node can act as an independent entity
receiving queries and forwarding this queries to other nodes as needed.

Instead a more decoupled approach can be used, in the form of a Redis Proxy
node (or multiple Proxy nodes) that is contacted by clients, and
is responsible of forwarding queries and replies back and forth from data nodes.

Data nodes can be just vanilla redis-server instances.

Network layout
==============

 - One ore more Data Nodes. Every node is identified by ip:port.
 - A single Configuration Node.
 - One more more Proxy Nodes (redis-cluster nodes).
 - A single Handling Node.

Data Nodes and the Configuration Node are just vanilla redis-server instances.

Configuration Node
==================

 - Contains information about all the Data nodes in the cluster.
 - Contains information about all the Proxy nodes in the cluster.
 - Contains information about what Data Node holds a given sub-space of keys.

The keyspace is divided into 1024 different "hashing slots".
(1024 is just an example, this value should be configurable)

Given a key perform SHA1(key) and use the last 10 bits of the result to get a 10 bit number representing the "key slot" (from 0 to 1023).

The Configuration node maps every slot of the keyspace to M different Data Nodes (every key is stored into M nodes, configurable).

The Configuration node can be modified by a single client at a time. Locking is performed using SETNX.

The Configuration node should be replicated as there is a single configuration node for the whole network. It is the only single point of failure of the system.
When a Configuration node fails the cluster does not stop operating, but is not
able to recover if there is some exceptional condition to handle, like a Data
Node going off line or the addition of a new Data Node to the cluster.

The Configuration node is a standard Redis server, like every other Data node.

Data Nodes
==========

Data nodes just hold data, and are normal Redis processes. There is no configuration stored on nodes, nor the nodes are "active" in the cluster, they just receive normal Redis commands.

Proxy Nodes
===========

Proxy nodes get requests from clients and route this requests to the right Redis nodes.

Proxy nodes take persistent connections to all the Data Nodes and the
Configuration Node. This connections are keep alive with PING requests from time
to time if there is no traffic. This way Proxy Nodes can understand asap if
there is a problem in some Data Node or in the Configuration Node.

When a Proxy Node is started it needs to know the Configuration node address in order to load the infomration about the Data nodes and the mapping between the key space and the nodes.

On startup a Proxy Node will also register itself in the Configuration node, and will make sure to refresh it's configuration every N seconds (via an EXPIREing key) so that it's possible to detect when a Proxy node fails.

Clients can submit queries to any Proxy Node, so well designed clients may ask
at startup the list of Proxy Nodes querying the Configuration Node. Then if
a query fails against a given Proxy Node it can be retried against the next.

The Proxy Node is also in charge of signaling failing Data nodes to the Configuration node, so that the Handling Node can take appropriate actions.

When a new Data node joins or leaves the cluster, and in general when the cluster configuration changes, all the Proxy nodes will receive a notification and will reload the configuration from the Configuration node.

Proxy Nodes - how queries are submited
======================================

This is how a query is processed:

1) A client sends a query to a Proxy Node, using the Redis protocol like if it was a plain Redis Node.
2) The Proxy Node inspects the command arguments to detect the key. The key is hashed. The Proxy Node has the table mapping a given key to M nodes, and persistent connections to all the nodes.

At this point the process is different in case of read or write queries:

WRITE QUERY:

3a) The Proxy Node forwards the query to M Data Nodes at the same time, waiting for replies.
3b) Once all the replies are received the Proxy Node checks that the replies are consistent. For instance all the M nodes need to reply with OK and so forth. If the query fails in a subset of nodes but succeeds in other nodes, the failing nodes are considered unreliable and are put off line notifying the configuration node.
3c) The reply is transfered back to the client.

READ QUERY:

3d) The Proxy Node forwards the query to a single random client, passing the reply back to the client.

Handling Node
=============

The handling node is a special Redis client with the following role:

 - Handles the cluster configuration stored in the Config node.
 - Is in charge for adding and removing nodes dynamically from the net.
 - Relocates keys on nodes additions / removal.
 - Signal a configuration change to Proxy nodes.

More details on hashing slots
============================

The Configuration node holds 1024 keys in the following form:

    hashingslot:0
    hashingslot:1
    ...
    hashingslot:1023

Every hashing slot is actually a Redis list, containing a single or more ip:port pairs. For instance:

    hashingslot:10 => 192.168.1.19:6379, 192.168.1.200:6379

This mean that keys hashing to slot 10 will be saved in the two Data nodes  192.168.1.19:6379 and 192.168.1.200:6379.

When a client performs a read operation (via a proxy node), the proxy will contact a random Data node among the data nodes in charge for the given slot.

For instance a client can ask for the following operation to a given Proxy node:

    GET mykey

"mykey" hashes to (for instance) slot 10, so the Proxy will forward the request to either Data node 192.168.1.19:6379 or 192.168.1.200:6379, and then forward back the reply to the client.

When a write operation is performed, it is forwarded to both the Data nodes in the example (and in general to all the data nodes).

Adding or removing a node
=========================

When a Data node is added to the cluster, it is added via an LPUSH operation into a Redis list representing a queue of Data nodes that are ready to enter the cluster. This list is hold by the Configuration node of course, and can be added manually or via a configuration utility.

    LPUSH newnodes 192.168.1.55:6379

The Handling node will check from time to time for this new elements in the "newode" list. If there are new nodes pending to enter the cluster, they are processed one after the other in this way:

For instance let's assume there are already two Data nodes in the cluster:

    192.168.1.1:6379
    192.168.1.2:6379

We add a new node 192.168.1.3:6379 via the LPUSH operation.

We can imagine that the 1024 hash slots are assigned equally among the two inital nodes. In order to add the new (third) node what we have to do is to move incrementally 341 slots form the two old servers to the new one.

For now we can think that every hash slot is only stored in a single server, to generalize the idea later.

In order to simplify the implementation every slot can be moved from one Data node to another one in a blocking way, that is, read operations will continue to all the 1024 slots, but a single slot at a time will delay write operations until the moving from one Data node to another is completed.

In order to do so the Handler node, before to move a given node, marks it as "write-locked" in the Configuration server, than asks all the Proxy nodes to refresh the configuration.

Then the slot is moved (1/1024 of all the keys). The Configuration server is modified to reflect the new hashing slots configuration, the slot is unlocked, the Proxy nodes notified.

Implementation details
======================

To run the Handling node and the Configuration node in the same physical computer is probably a good idea.
Commit	Line	Data
9c21a518	1	Redis Cluster Design Proposal (work in progress)
9c21a518	2
34b8a559	3	28 Nov 2010: Ver 1.0 - initial version
	4	22 APr 2010: Ver 1.1 - more details and rationales
	5
	6	Overview
	7	========
	8
	9	Redis is a fast key-value store supporting complex aggregate data types as
ffcc5608	10	values. For instance keys can be bound to lists with many elements, sets,
ffcc5608	11	sub-dictionaries (hashes) and so forth.
34b8a559	12
	13	While Redis is very fast, currently it lacks scalability in the form of ability
	14	to transparently run across different nodes. This is desirable mainly for the
	15	following three rasons:
	16
	17	A) Fault tolerance. Some node may go off line without affecting the operations.
	18	B) Holding bigger datasets without using a single box with a lot of RAM.
	19	C) Scaling writes.
	20
	21	Since a single Redis instance supports 140,000 operations per second in a good
	22	Linux box costing less than $1000, the need for Redis Cluster arises more
	23	from "A" and "B". Scaling writes can also be useful in very high load
	24	environments. Scaling reads is already easily accomplished using Redis built-in
	25	replication.
	26
	27	Design goals
	28	============
	29
	30	Designing a DHT in 2010 is hard as there is too much bias towards good designs
	31	that are already well tested in practice, like the Amazon Dynamo design.
	32	Still a Dynamo alike DHT may not be the best fit for Redis.
	33
	34	Redis is very simple and fast at its core, so Redis cluster should try to
	35	follow the same guidelines. The first problem with a Dynamo-alike DHT is that
	36	Redis supports complex data types. Merging complex values like lsits, where
	37	in the case of a netsplit may diverge in very complex ways, is not going to
	38	be easy. The "most recent data" wins is not applicable and all the resolution
	39	business should be in the application.
	40
	41	Even a simple application can end up with complex schema of keys and complex
	42	values. Writing code in order to resolve conflicts is not going to be
	43	programmer friendly.
	44
	45	So the author of this document claims that Redis does not need to resist to
	46	netsplits, but it is enough to resist to M-1 nodes going offline, where
	47	M is the number of nodes storing every key-value pair.
	48
	49	For instance in a three nodes cluster I may configure the cluster in order to
	50	store every key into two instances (M=2). Such a cluster can resist to a single
	51	node going offline without interruption of the service.
	52
	53	When more than M-1 nodes are off line the cluster should detect such a condition
	54	and refusing any further query. The system administrator should check why
	55	M-1 nodes are offline and bring them back again if possible.
	56
	57	Once resisting to big net splits is no longer a requirement as there is no
7accafbb	58	conflict resolution stage, since at least an original node responsible of
	59	holding every possible key must be online for the cluster to work, there is
	60	also no need for a design where every node can act as an independent entity
	61	receiving queries and forwarding this queries to other nodes as needed.
34b8a559	62
	63	Instead a more decoupled approach can be used, in the form of a Redis Proxy
	64	node (or multiple Proxy nodes) that is contacted by clients, and
	65	is responsible of forwarding queries and replies back and forth from data nodes.
	66
	67	Data nodes can be just vanilla redis-server instances.
	68
9c21a518	69	Network layout
	70	==============
	71
34b8a559	72	- One ore more Data Nodes. Every node is identified by ip:port.
	73	- A single Configuration Node.
	74	- One more more Proxy Nodes (redis-cluster nodes).
	75	- A single Handling Node.
	76
	77	Data Nodes and the Configuration Node are just vanilla redis-server instances.
9c21a518	78
	79	Configuration Node
	80	==================
	81
34b8a559	82	- Contains information about all the Data nodes in the cluster.
	83	- Contains information about all the Proxy nodes in the cluster.
	84	- Contains information about what Data Node holds a given sub-space of keys.
9c21a518	85
9c21a518	86	The keyspace is divided into 1024 different "hashing slots".
34b8a559	87	(1024 is just an example, this value should be configurable)
9c21a518	88
34b8a559	89	Given a key perform SHA1(key) and use the last 10 bits of the result to get a 10 bit number representing the "key slot" (from 0 to 1023).
9c21a518	90
34b8a559	91	The Configuration node maps every slot of the keyspace to M different Data Nodes (every key is stored into M nodes, configurable).
9c21a518	92
	93	The Configuration node can be modified by a single client at a time. Locking is performed using SETNX.
	94
34b8a559	95	The Configuration node should be replicated as there is a single configuration node for the whole network. It is the only single point of failure of the system.
	96	When a Configuration node fails the cluster does not stop operating, but is not
	97	able to recover if there is some exceptional condition to handle, like a Data
	98	Node going off line or the addition of a new Data Node to the cluster.
9c21a518	99
	100	The Configuration node is a standard Redis server, like every other Data node.
	101
	102	Data Nodes
	103	==========
	104
	105	Data nodes just hold data, and are normal Redis processes. There is no configuration stored on nodes, nor the nodes are "active" in the cluster, they just receive normal Redis commands.
	106
	107	Proxy Nodes
	108	===========
	109
	110	Proxy nodes get requests from clients and route this requests to the right Redis nodes.
	111
34b8a559	112	Proxy nodes take persistent connections to all the Data Nodes and the
	113	Configuration Node. This connections are keep alive with PING requests from time
	114	to time if there is no traffic. This way Proxy Nodes can understand asap if
	115	there is a problem in some Data Node or in the Configuration Node.
	116
	117	When a Proxy Node is started it needs to know the Configuration node address in order to load the infomration about the Data nodes and the mapping between the key space and the nodes.
9c21a518	118
34b8a559	119	On startup a Proxy Node will also register itself in the Configuration node, and will make sure to refresh it's configuration every N seconds (via an EXPIREing key) so that it's possible to detect when a Proxy node fails.
9c21a518	120
34b8a559	121	Clients can submit queries to any Proxy Node, so well designed clients may ask
	122	at startup the list of Proxy Nodes querying the Configuration Node. Then if
	123	a query fails against a given Proxy Node it can be retried against the next.
	124
	125	The Proxy Node is also in charge of signaling failing Data nodes to the Configuration node, so that the Handling Node can take appropriate actions.
9c21a518	126
	127	When a new Data node joins or leaves the cluster, and in general when the cluster configuration changes, all the Proxy nodes will receive a notification and will reload the configuration from the Configuration node.
	128
34b8a559	129	Proxy Nodes - how queries are submited
	130	======================================
	131
	132	This is how a query is processed:
	133
	134	1) A client sends a query to a Proxy Node, using the Redis protocol like if it was a plain Redis Node.
	135	2) The Proxy Node inspects the command arguments to detect the key. The key is hashed. The Proxy Node has the table mapping a given key to M nodes, and persistent connections to all the nodes.
	136
	137	At this point the process is different in case of read or write queries:
	138
	139	WRITE QUERY:
	140
	141	3a) The Proxy Node forwards the query to M Data Nodes at the same time, waiting for replies.
	142	3b) Once all the replies are received the Proxy Node checks that the replies are consistent. For instance all the M nodes need to reply with OK and so forth. If the query fails in a subset of nodes but succeeds in other nodes, the failing nodes are considered unreliable and are put off line notifying the configuration node.
	143	3c) The reply is transfered back to the client.
	144
	145	READ QUERY:
	146
	147	3d) The Proxy Node forwards the query to a single random client, passing the reply back to the client.
	148
9c21a518	149	Handling Node
	150	=============
	151
	152	The handling node is a special Redis client with the following role:
	153
34b8a559	154	- Handles the cluster configuration stored in the Config node.
	155	- Is in charge for adding and removing nodes dynamically from the net.
	156	- Relocates keys on nodes additions / removal.
	157	- Signal a configuration change to Proxy nodes.
9c21a518	158
	159	More details on hashing slots
	160	============================
	161
	162	The Configuration node holds 1024 keys in the following form:
	163
	164	hashingslot:0
	165	hashingslot:1
	166	...
	167	hashingslot:1023
	168
	169	Every hashing slot is actually a Redis list, containing a single or more ip:port pairs. For instance:
	170
	171	hashingslot:10 => 192.168.1.19:6379, 192.168.1.200:6379
	172
	173	This mean that keys hashing to slot 10 will be saved in the two Data nodes 192.168.1.19:6379 and 192.168.1.200:6379.
	174
	175	When a client performs a read operation (via a proxy node), the proxy will contact a random Data node among the data nodes in charge for the given slot.
	176
	177	For instance a client can ask for the following operation to a given Proxy node:
	178
	179	GET mykey
	180
	181	"mykey" hashes to (for instance) slot 10, so the Proxy will forward the request to either Data node 192.168.1.19:6379 or 192.168.1.200:6379, and then forward back the reply to the client.
	182
	183	When a write operation is performed, it is forwarded to both the Data nodes in the example (and in general to all the data nodes).
	184
	185	Adding or removing a node
	186	=========================
	187
	188	When a Data node is added to the cluster, it is added via an LPUSH operation into a Redis list representing a queue of Data nodes that are ready to enter the cluster. This list is hold by the Configuration node of course, and can be added manually or via a configuration utility.
	189
	190	LPUSH newnodes 192.168.1.55:6379
	191
	192	The Handling node will check from time to time for this new elements in the "newode" list. If there are new nodes pending to enter the cluster, they are processed one after the other in this way:
	193
	194	For instance let's assume there are already two Data nodes in the cluster:
	195
	196	192.168.1.1:6379
	197	192.168.1.2:6379
	198
	199	We add a new node 192.168.1.3:6379 via the LPUSH operation.
	200
	201	We can imagine that the 1024 hash slots are assigned equally among the two inital nodes. In order to add the new (third) node what we have to do is to move incrementally 341 slots form the two old servers to the new one.
	202
	203	For now we can think that every hash slot is only stored in a single server, to generalize the idea later.
	204
	205	In order to simplify the implementation every slot can be moved from one Data node to another one in a blocking way, that is, read operations will continue to all the 1024 slots, but a single slot at a time will delay write operations until the moving from one Data node to another is completed.
	206
	207	In order to do so the Handler node, before to move a given node, marks it as "write-locked" in the Configuration server, than asks all the Proxy nodes to refresh the configuration.
	208
	209	Then the slot is moved (1/1024 of all the keys). The Configuration server is modified to reflect the new hashing slots configuration, the slot is unlocked, the Proxy nodes notified.
	210
	211	Implementation details
	212	======================
	213
9c21a518	214	To run the Handling node and the Configuration node in the same physical computer is probably a good idea.