Add more clustering documentation

author Lorand Jakab <lojakab@cisco.com>

Fri, 6 Jan 2017 22:19:25 +0000 (00:19 +0200)

committer Lori Jakab <lorand.jakab@gmail.com>

Tue, 17 Jan 2017 21:25:28 +0000 (21:25 +0000)
author Lorand Jakab <lojakab@cisco.com>
Fri, 6 Jan 2017 22:19:25 +0000 (00:19 +0200)
committer Lori Jakab <lorand.jakab@gmail.com>
Tue, 17 Jan 2017 21:25:28 +0000 (21:25 +0000)
diff --git a/docs/getting-started-guide/common-features/clustering.rst b/docs/getting-started-guide/common-features/clustering.rst

index c4d76b89be36cf47d320dbd83d1c420f2315d50b..ee7f45077e9416a5239890d65f8a99c152eb0a99 100755 (executable)
--- a/docs/getting-started-guide/common-features/clustering.rst
+++ b/docs/getting-started-guide/common-features/clustering.rst
@@ -364,3 +364,284 @@ Example::
      bin/set_persistence.sh off
  
  The above command will disable the config datastore persistence.
+
+Cluster Monitoring
+------------------
+
+OpenDaylight exposes shard information via MBeans, which can be explored with
+JConsole, VisualVM, or other JMX clients, or exposed via a REST API using
+`Jolokia <https://jolokia.org/features-nb.html>`_, provided by the
+``odl-jolokia`` Karaf feature. This is convenient, due to a significant focus
+on REST in OpenDaylight. In case the feature is not available, loading Jolokia
+can be achieved by installing the bundle directly::
+
+    bundle:install mvn:org.jolokia/jolokia-osgi/1.3.5
+
+The basic URI that lists a schema of all available MBeans, but not their
+content itself is::
+
+    GET  /jolokia/list
+
+To read the information about the shards local to the queried OpenDaylight
+instance use the following REST calls. For the config datastore::
+
+    GET  /jolokia/read/org.opendaylight.controller:type=DistributedConfigDatastore,Category=ShardManager,name=shard-manager-config
+
+For the operational datastore::
+
+    GET  /jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational
+
+The output contains information on shards present on the node::
+
+    {
+      "request": {
+        "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
+        "type": "read"
+      },
+      "value": {
+        "LocalShards": [
+          "member-1-shard-default-operational",
+          "member-1-shard-entity-ownership-operational",
+          "member-1-shard-topology-operational",
+          "member-1-shard-inventory-operational",
+          "member-1-shard-toaster-operational"
+        ],
+        "SyncStatus": true,
+        "MemberName": "member-1"
+      },
+      "timestamp": 1483738005,
+      "status": 200
+    }
+
+The exact names from the "LocalShards" lists are needed for further
+exploration, as they will be used as part of the URI to look up detailed info
+on a particular shard. An example output for the
+``member-1-shard-default-operational`` looks like this::
+
+    {
+      "request": {
+        "mbean": "org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore",
+        "type": "read"
+      },
+      "value": {
+        "ReadWriteTransactionCount": 0,
+        "SnapshotIndex": 4,
+        "InMemoryJournalLogSize": 1,
+        "ReplicatedToAllIndex": 4,
+        "Leader": "member-1-shard-default-operational",
+        "LastIndex": 5,
+        "RaftState": "Leader",
+        "LastCommittedTransactionTime": "2017-01-06 13:19:00.135",
+        "LastApplied": 5,
+        "LastLeadershipChangeTime": "2017-01-06 13:18:37.605",
+        "LastLogIndex": 5,
+        "PeerAddresses": "member-3-shard-default-operational: akka.tcp://opendaylight-cluster-data@192.168.16.3:2550/user/shardmanager-operational/member-3-shard-default-operational, member-2-shard-default-operational: akka.tcp://opendaylight-cluster-data@192.168.16.2:2550/user/shardmanager-operational/member-2-shard-default-operational",
+        "WriteOnlyTransactionCount": 0,
+        "FollowerInitialSyncStatus": false,
+        "FollowerInfo": [
+          {
+            "timeSinceLastActivity": "00:00:00.320",
+            "active": true,
+            "matchIndex": 5,
+            "voting": true,
+            "id": "member-3-shard-default-operational",
+            "nextIndex": 6
+          },
+          {
+            "timeSinceLastActivity": "00:00:00.320",
+            "active": true,
+            "matchIndex": 5,
+            "voting": true,
+            "id": "member-2-shard-default-operational",
+            "nextIndex": 6
+          }
+        ],
+        "FailedReadTransactionsCount": 0,
+        "StatRetrievalTime": "810.5 μs",
+        "Voting": true,
+        "CurrentTerm": 1,
+        "LastTerm": 1,
+        "FailedTransactionsCount": 0,
+        "PendingTxCommitQueueSize": 0,
+        "VotedFor": "member-1-shard-default-operational",
+        "SnapshotCaptureInitiated": false,
+        "CommittedTransactionsCount": 6,
+        "TxCohortCacheSize": 0,
+        "PeerVotingStates": "member-3-shard-default-operational: true, member-2-shard-default-operational: true",
+        "LastLogTerm": 1,
+        "StatRetrievalError": null,
+        "CommitIndex": 5,
+        "SnapshotTerm": 1,
+        "AbortTransactionsCount": 0,
+        "ReadOnlyTransactionCount": 0,
+        "ShardName": "member-1-shard-default-operational",
+        "LeadershipChangeCount": 1,
+        "InMemoryJournalDataSize": 450
+      },
+      "timestamp": 1483740350,
+      "status": 200
+    }
+
+The output helps identifying shard state (leader/follower, voting/non-voting),
+peers, follower details if the shard is a leader, and other
+statistics/counters.
+
+The Integration team is maintaining a Python based `tool
+<https://github.com/opendaylight/integration-test/tree/master/tools/clustering/cluster-monitor>`_,
+that takes advantage of the above MBeans exposed via Jolokia, and the
+*systemmetrics* project offers a DLUX based UI to display the same
+information.
+
+Geo-distributed Active/Backup Setup
+-----------------------------------
+
+An OpenDaylight cluster works best when the latency between the nodes is very
+small, which practically means they should be in the same datacenter. It is
+however desirable to have the possibility to fail over to a different
+datacenter, in case all nodes become unreachable. To achieve that, the cluster
+can be expanded with nodes in a different datacenter, but in a way that
+doesn't affect latency of the primary nodes. To do that, shards in the backup
+nodes must be in "non-voting" state.
+
+The API to manipulate voting states on shards is defined as RPCs in the
+`cluster-admin.yang <https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/sal-cluster-admin-api/src/main/yang/cluster-admin.yang>`_
+file in the *controller* project, which is well documented. A summary is
+provided below.
+
+.. note::
+
+  Unless otherwise indicated, the below POST requests are to be sent to any
+  single cluster node.
+
+To create an active/backup setup with a 6 node cluster (3 active and 3 backup
+nodes in two locations) there is an RPC to set voting states of all shards on
+a list of nodes to a given state::
+
+   POST  /restconf/operations/cluster-admin:change-member-voting-states-for-all-shards
+
+This RPC needs the list of nodes and the desired voting state as input. For
+creating the backup nodes, this example input can be used::
+
+    {
+      "input": {
+        "member-voting-state": [
+          {
+            "member-name": "member-4",
+            "voting": false
+          },
+          {
+            "member-name": "member-5",
+            "voting": false
+          },
+          {
+            "member-name": "member-6",
+            "voting": false
+          }
+        ]
+      }
+    }
+
+When an active/backup deployment already exists, with shards on the backup
+nodes in non-voting state, all that is needed for a fail-over from the active
+"sub-cluster" to backup "sub-cluster" is to flip the voting state of each
+shard (on each node, active AND backup). That can be easily achieved with the
+following RPC call (no parameters needed)::
+
+    POST  /restconf/operations/cluster-admin:flip-member-voting-states-for-all-shards
+
+If it's an unplanned outage where the primary voting nodes are down, the
+"flip" RPC must be sent to a backup non-voting node. In this case there are no
+shard leaders to carry out the voting changes. However there is a special case
+whereby if the node that receives the RPC is non-voting and is to be changed
+to voting and there's no leader, it will apply the voting changes locally and
+attempt to become the leader. If successful, it persists the voting changes
+and replicates them to the remaining nodes.
+
+When the primary site is fixed and you want to fail back to it, care must be
+taken when bringing the site back up. Because it was down when the voting
+states were flipped on the secondary, its persisted database won't contain
+those changes. If brought back up in that state, the nodes will think they're
+still voting. If the nodes have connectivity to the secondary site, they
+should follow the leader in the secondary site and sync with it. However if
+this does not happen then the primary site may elect its own leader thereby
+partitioning the 2 clusters, which can lead to undesirable results. Therefore
+it is recommended to either clean the databases (i.e., ``journal`` and
+``snapshots`` directory) on the primary nodes before bringing them back up or
+restore them from a recent backup of the secondary site (see section
+:ref:`cluster_backup_restore` below).
+
+If is also possible to gracefully remove a node from a cluster, with the
+following RPC::
+
+    POST  /restconf/operations/cluster-admin:remove-all-shard-replicas
+
+and example input::
+
+    {
+      "input": {
+        "member-name": "member-1"
+      }
+    }
+
+or just one particular shard::
+
+    POST  /restconf/operations/cluster-admin:remove-shard-replica
+
+with example input::
+
+    {
+      "input": {
+        "shard-name": "default",
+        "member-name": "member-2",
+        "data-store-type": "config"
+      }
+    }
+
+Now that a (potentially dead/unrecoverable) node was removed, another one can
+be added at runtime, without changing the configuration files of the healthy
+nodes (requiring reboot)::
+
+    POST  /restconf/operations/cluster-admin:add-replicas-for-all-shards
+
+No input required, but this RPC needs to be sent to the new node, to instruct
+it to replicate all shards from the cluster.
+
+.. _cluster_backup_restore:
+
+Backing Up and Restoring the Datastore
+--------------------------------------
+
+The same cluster-admin API that is used above for managing shard voting states
+has an RPC allowing backup of the datastore in a single node, taking only the
+file name as a parameter::
+
+    POST  /restconf/operations/cluster-admin:backup-datastore
+
+RPC input JSON::
+
+    {
+      "input": {
+        "file-path": "/tmp/datastore_backup"
+      }
+    }
+
+.. note::
+
+  This backup can only be restored if the YANG models of the backed-up data
+  are identical in the backup OpenDaylight instance and restore target
+  instance.
+
+To restore the backup on the target node the file needs to be placed into the
+``$KARAF_HOME/clustered-datastore-restore`` directory, and then the node
+restarted. If the directory does not exist (which is quite likely if this is a
+first-time restore) it needs to be created. On startup, ODL checks if the
+``journal`` and ``snapshots`` directories in ``$KARAF_HOME`` are empty, and
+only then tries to read the contents of the ``clustered-datastore-restore``
+directory, if it exists. So for a successful restore, those two directories
+should be empty. The backup file name itself does not matter, and the startup
+process will delete it after a successful restore.
+
+The backup is node independent, so when restoring a 3 node cluster, it is best
+to restore it on each node for consistency. For example, if restoring on one
+node only, it can happen that the other two empty nodes form a majority and
+the cluster comes up with no data.
author	Lorand Jakab <lojakab@cisco.com>
	Fri, 6 Jan 2017 22:19:25 +0000 (00:19 +0200)
committer	Lori Jakab <lorand.jakab@gmail.com>
	Tue, 17 Jan 2017 21:25:28 +0000 (21:25 +0000)