docs/getting-started-guide/common-features/clustering.rst

   1 Setting Up Clustering
   2 =====================
   3
   4 Clustering Overview
   5 -------------------
   6
   7 Clustering is a mechanism that enables multiple processes and programs to work
   8 together as one entity.  For example, when you search for something on
   9 google.com, it may seem like your search request is processed by only one web
  10 server. In reality, your search request is processed by may web servers
  11 connected in a cluster. Similarly, you can have multiple instances of
  12 OpenDaylight working together as one entity.
  13
  14 Advantages of clustering are:
  15
  16 * Scaling: If you have multiple instances of OpenDaylight running, you can
  17   potentially do more work and store more data than you could with only one
  18   instance. You can also break up your data into smaller chunks (shards) and
  19   either distribute that data across the cluster or perform certain operations
  20   on certain members of the cluster.
  21 * High Availability: If you have multiple instances of OpenDaylight running and
  22   one of them crashes, you will still have the other instances working and
  23   available.
  24 * Data Persistence: You will not lose any data stored in OpenDaylight after a
  25   manual restart or a crash.
  26
  27 The following sections describe how to set up clustering on both individual and
  28 multiple OpenDaylight instances.
  29
  30 Multiple Node Clustering
  31 ------------------------
  32
  33 The following sections describe how to set up multiple node clusters in OpenDaylight.
  34
  35 Deployment Considerations
  36 ^^^^^^^^^^^^^^^^^^^^^^^^^
  37
  38 To implement clustering, the deployment considerations are as follows:
  39
  40 * To set up a cluster with multiple nodes, we recommend that you use a minimum
  41   of three machines. You can set up a cluster with just two nodes. However, if
  42   one of the two nodes fail, the cluster will not be operational.
  43
  44   .. note:: This is because clustering in OpenDaylight requires a majority of the
  45              nodes to be up and one node cannot be a majority of two nodes.
  46
  47 * Every device that belongs to a cluster needs to have an identifier.
  48   OpenDaylight uses the node's ``role`` for this purpose. After you define the
  49   first node's role as *member-1* in the ``akka.conf`` file, OpenDaylight uses
  50   *member-1* to identify that node.
  51
  52 * Data shards are used to contain all or a certain segment of a OpenDaylight's
  53   MD-SAL datastore. For example, one shard can contain all the inventory data
  54   while another shard contains all of the topology data.
  55
  56   If you do not specify a module in the ``modules.conf`` file and do not specify
  57   a shard in ``module-shards.conf``, then (by default) all the data is placed in
  58   the default shard (which must also be defined in ``module-shards.conf`` file).
  59   Each shard has replicas configured. You can specify the details of where the
  60   replicas reside in ``module-shards.conf`` file.
  61
  62 * If you have a three node cluster and would like to be able to tolerate any
  63   single node crashing, a replica of every defined data shard must be running
  64   on all three cluster nodes.
  65
  66   .. note:: This is because OpenDaylight's clustering implementation requires a
  67             majority of the defined shard replicas to be running in order to
  68             function. If you define data shard replicas on two of the cluster nodes
  69             and one of those nodes goes down, the corresponding data shards will not
  70             function.
  71
  72 * If you have a three node cluster and have defined replicas for a data shard
  73   on each of those nodes, that shard will still function even if only two of
  74   the cluster nodes are running. Note that if one of those remaining two nodes
  75   goes down, the shard will not be operational.
  76
  77 * It is  recommended that you have multiple seed nodes configured. After a
  78   cluster member is started, it sends a message to all of its seed nodes.
  79   The cluster member then sends a join command to the first seed node that
  80   responds. If none of its seed nodes reply, the cluster member repeats this
  81   process until it successfully establishes a connection or it is shut down.
  82
  83 * After a node is unreachable, it remains down for configurable period of time
  84   (10 seconds, by default). Once a node goes down, you need to restart it so
  85   that it can rejoin the cluster. Once a restarted node joins a cluster, it
  86   will synchronize with the lead node automatically.
  87
  88 Setting Up a Multiple Node Cluster
  89 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  90
  91 To run OpenDaylight in a three node cluster, perform the following:
  92
  93 First, determine the three machines that will make up the cluster. After that,
  94 do the following on each machine:
  95
  96 #. Copy the OpenDaylight distribution zip file to the machine.
  97 #. Unzip the distribution.
  98 #. Open the following .conf files:
  99
 100    * configuration/initial/akka.conf
 101    * configuration/initial/module-shards.conf
 102
 103 #. In each configuration file, make the following changes:
 104
 105    Find every instance of the following lines and replace _127.0.0.1_ with the
 106    hostname or IP address of the machine on which this file resides and
 107    OpenDaylight will run::
 108
 109       netty.tcp {
 110         hostname = "127.0.0.1"
 111
 112    .. note:: The value you need to specify will be different for each node in the
 113              cluster.
 114
 115 #. Find the following lines and replace _127.0.0.1_ with the hostname or IP
 116    address of any of the machines that will be part of the cluster::
 117
 118       cluster {
 119         seed-nodes = ["akka.tcp://opendaylight-cluster-data@127.0.0.1:2550",
 120                       <url-to-cluster-member-2>,
 121                       <url-to-cluster-member-3>]
 122
 123 #. Find the following section and specify the role for each member node. Here
 124    we assign the first node with the *member-1* role, the second node with the
 125    *member-2* role, and the third node with the *member-3* role::
 126
 127      roles = [
 128        "member-1"
 129      ]
 130
 131    .. note:: This step should use a different role on each node.
 132
 133 #. Open the configuration/initial/module-shards.conf file and update the
 134    replicas so that each shard is replicated to all three nodes::
 135
 136       replicas = [
 137           "member-1",
 138           "member-2",
 139           "member-3"
 140       ]
 141
 142    For reference, view a sample config files <<_sample_config_files,below>>.
 143
 144 #. Move into the +<karaf-distribution-directory>/bin+ directory.
 145 #. Run the following command::
 146
 147       JAVA_MAX_MEM=4G JAVA_MAX_PERM_MEM=512m ./karaf
 148
 149 #. Enable clustering by running the following command at the Karaf command line::
 150
 151       feature:install odl-mdsal-clustering
 152
 153 OpenDaylight should now be running in a three node cluster. You can use any of
 154 the three member nodes to access the data residing in the datastore.
 155
 156 Sample Config Files
 157 """""""""""""""""""
 158
 159 Sample ``akka.conf`` file::
 160
 161    odl-cluster-data {
 162      bounded-mailbox {
 163        mailbox-type = "org.opendaylight.controller.cluster.common.actor.MeteredBoundedMailbox"
 164        mailbox-capacity = 1000
 165        mailbox-push-timeout-time = 100ms
 166      }
 167
 168      metric-capture-enabled = true
 169
 170      akka {
 171        loglevel = "DEBUG"
 172        loggers = ["akka.event.slf4j.Slf4jLogger"]
 173
 174        actor {
 175
 176          provider = "akka.cluster.ClusterActorRefProvider"
 177          serializers {
 178                    java = "akka.serialization.JavaSerializer"
 179                    proto = "akka.remote.serialization.ProtobufSerializer"
 180                  }
 181
 182                  serialization-bindings {
 183                      "com.google.protobuf.Message" = proto
 184
 185                  }
 186        }
 187        remote {
 188          log-remote-lifecycle-events = off
 189          netty.tcp {
 190            hostname = "10.194.189.96"
 191            port = 2550
 192            maximum-frame-size = 419430400
 193            send-buffer-size = 52428800
 194            receive-buffer-size = 52428800
 195          }
 196        }
 197
 198        cluster {
 199          seed-nodes = ["akka.tcp://opendaylight-cluster-data@10.194.189.96:2550",
 200                        "akka.tcp://opendaylight-cluster-data@10.194.189.98:2550",
 201                        "akka.tcp://opendaylight-cluster-data@10.194.189.101:2550"]
 202
 203          auto-down-unreachable-after = 10s
 204
 205          roles = [
 206            "member-1"
 207          ]
 208
 209        }
 210      }
 211    }
 212
 213    odl-cluster-rpc {
 214      bounded-mailbox {
 215        mailbox-type = "org.opendaylight.controller.cluster.common.actor.MeteredBoundedMailbox"
 216        mailbox-capacity = 1000
 217        mailbox-push-timeout-time = 100ms
 218      }
 219
 220      metric-capture-enabled = true
 221
 222      akka {
 223        loglevel = "INFO"
 224        loggers = ["akka.event.slf4j.Slf4jLogger"]
 225
 226        actor {
 227          provider = "akka.cluster.ClusterActorRefProvider"
 228
 229        }
 230        remote {
 231          log-remote-lifecycle-events = off
 232          netty.tcp {
 233            hostname = "10.194.189.96"
 234            port = 2551
 235          }
 236        }
 237
 238        cluster {
 239          seed-nodes = ["akka.tcp://opendaylight-cluster-rpc@10.194.189.96:2551"]
 240
 241          auto-down-unreachable-after = 10s
 242        }
 243      }
 244    }
 245
 246 Sample ``module-shards.conf`` file::
 247
 248    module-shards = [
 249        {
 250            name = "default"
 251            shards = [
 252                {
 253                    name="default"
 254                    replicas = [
 255                        "member-1",
 256                        "member-2",
 257                        "member-3"
 258                    ]
 259                }
 260            ]
 261        },
 262        {
 263            name = "topology"
 264            shards = [
 265                {
 266                    name="topology"
 267                    replicas = [
 268                        "member-1",
 269                        "member-2",
 270                        "member-3"
 271                    ]
 272                }
 273            ]
 274        },
 275        {
 276            name = "inventory"
 277            shards = [
 278                {
 279                    name="inventory"
 280                    replicas = [
 281                        "member-1",
 282                        "member-2",
 283                        "member-3"
 284                    ]
 285                }
 286            ]
 287        },
 288        {
 289             name = "toaster"
 290             shards = [
 291                 {
 292                     name="toaster"
 293                     replicas = [
 294                        "member-1",
 295                        "member-2",
 296                        "member-3"
 297                     ]
 298                 }
 299             ]
 300        }
 301    ]
 302
 303 Clustering Scripts
 304 ------------------
 305
 306 OpenDaylight includes some scripts to help with the clustering configuration.
 307
 308 .. note::
 309
 310     Scripts are stored in the OpenDaylight distribution/bin folder, and
 311     maintained in the distribution project
 312     `repository <https://git.opendaylight.org/gerrit/p/integration/distribution>`_
 313     in the folder distribution-karaf/src/main/assembly/bin/.
 314
 315 Configure Cluster Script
 316 ^^^^^^^^^^^^^^^^^^^^^^^^
 317
 318 This script is used to configure the cluster parameters (e.g. akka.conf,
 319 module-shards.conf) on a member of the controller cluster. The user should
 320 restart the node to apply the changes.
 321
 322 .. note::
 323
 324     The script can be used at any time, even before the controller is started
 325     for the first time.
 326
 327 Usage::
 328
 329     bin/configure_cluster.sh <index> <seed_nodes_list>
 330
 331 * index: Integer within 1..N, where N is the number of seed nodes. This indicates
 332   which controller node (1..N) is configured by the script.
 333 * seed_nodes_list: List of seed nodes (IP address), separated by comma or space.
 334
 335 The IP address at the provided index should belong to the member executing
 336 the script. When running this script on multiple seed nodes, keep the
 337 seed_node_list the same, and vary the index from 1 through N.
 338
 339 Optionally, shards can be configured in a more granular way by modifying the
 340 file "custom_shard_configs.txt" in the same folder as this tool. Please see
 341 that file for more details.
 342
 343 Example::
 344
 345     bin/configure_cluster.sh 2 192.168.0.1 192.168.0.2 192.168.0.3
 346
 347 The above command will configure the member 2 (IP address 192.168.0.2) of a
 348 cluster made of 192.168.0.1 192.168.0.2 192.168.0.3.
 349
 350 Set Persistence Script
 351 ^^^^^^^^^^^^^^^^^^^^^^
 352
 353 This script is used to enable or disable the config datastore persistence. The
 354 default state is enabled but there are cases where persistence may not be
 355 required or even desired. The user should restart the node to apply the changes.
 356
 357 .. note::
 358
 359   The script can be used at any time, even before the controller is started
 360   for the first time.
 361
 362 Usage::
 363
 364     bin/set_persistence.sh <on/off>
 365
 366 Example::
 367
 368     bin/set_persistence.sh off
 369
 370 The above command will disable the config datastore persistence.
 371
 372 Cluster Monitoring
 373 ------------------
 374
 375 OpenDaylight exposes shard information via MBeans, which can be explored with
 376 JConsole, VisualVM, or other JMX clients, or exposed via a REST API using
 377 `Jolokia <https://jolokia.org/features-nb.html>`_, provided by the
 378 ``odl-jolokia`` Karaf feature. This is convenient, due to a significant focus
 379 on REST in OpenDaylight.
 380
 381 The basic URI that lists a schema of all available MBeans, but not their
 382 content itself is::
 383
 384     GET  /jolokia/list
 385
 386 To read the information about the shards local to the queried OpenDaylight
 387 instance use the following REST calls. For the config datastore::
 388
 389     GET  /jolokia/read/org.opendaylight.controller:type=DistributedConfigDatastore,Category=ShardManager,name=shard-manager-config
 390
 391 For the operational datastore::
 392
 393     GET  /jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational
 394
 395 The output contains information on shards present on the node::
 396
 397     {
 398       "request": {
 399         "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
 400         "type": "read"
 401       },
 402       "value": {
 403         "LocalShards": [
 404           "member-1-shard-default-operational",
 405           "member-1-shard-entity-ownership-operational",
 406           "member-1-shard-topology-operational",
 407           "member-1-shard-inventory-operational",
 408           "member-1-shard-toaster-operational"
 409         ],
 410         "SyncStatus": true,
 411         "MemberName": "member-1"
 412       },
 413       "timestamp": 1483738005,
 414       "status": 200
 415     }
 416
 417 The exact names from the "LocalShards" lists are needed for further
 418 exploration, as they will be used as part of the URI to look up detailed info
 419 on a particular shard. An example output for the
 420 ``member-1-shard-default-operational`` looks like this::
 421
 422     {
 423       "request": {
 424         "mbean": "org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore",
 425         "type": "read"
 426       },
 427       "value": {
 428         "ReadWriteTransactionCount": 0,
 429         "SnapshotIndex": 4,
 430         "InMemoryJournalLogSize": 1,
 431         "ReplicatedToAllIndex": 4,
 432         "Leader": "member-1-shard-default-operational",
 433         "LastIndex": 5,
 434         "RaftState": "Leader",
 435         "LastCommittedTransactionTime": "2017-01-06 13:19:00.135",
 436         "LastApplied": 5,
 437         "LastLeadershipChangeTime": "2017-01-06 13:18:37.605",
 438         "LastLogIndex": 5,
 439         "PeerAddresses": "member-3-shard-default-operational: akka.tcp://opendaylight-cluster-data@192.168.16.3:2550/user/shardmanager-operational/member-3-shard-default-operational, member-2-shard-default-operational: akka.tcp://opendaylight-cluster-data@192.168.16.2:2550/user/shardmanager-operational/member-2-shard-default-operational",
 440         "WriteOnlyTransactionCount": 0,
 441         "FollowerInitialSyncStatus": false,
 442         "FollowerInfo": [
 443           {
 444             "timeSinceLastActivity": "00:00:00.320",
 445             "active": true,
 446             "matchIndex": 5,
 447             "voting": true,
 448             "id": "member-3-shard-default-operational",
 449             "nextIndex": 6
 450           },
 451           {
 452             "timeSinceLastActivity": "00:00:00.320",
 453             "active": true,
 454             "matchIndex": 5,
 455             "voting": true,
 456             "id": "member-2-shard-default-operational",
 457             "nextIndex": 6
 458           }
 459         ],
 460         "FailedReadTransactionsCount": 0,
 461         "StatRetrievalTime": "810.5 μs",
 462         "Voting": true,
 463         "CurrentTerm": 1,
 464         "LastTerm": 1,
 465         "FailedTransactionsCount": 0,
 466         "PendingTxCommitQueueSize": 0,
 467         "VotedFor": "member-1-shard-default-operational",
 468         "SnapshotCaptureInitiated": false,
 469         "CommittedTransactionsCount": 6,
 470         "TxCohortCacheSize": 0,
 471         "PeerVotingStates": "member-3-shard-default-operational: true, member-2-shard-default-operational: true",
 472         "LastLogTerm": 1,
 473         "StatRetrievalError": null,
 474         "CommitIndex": 5,
 475         "SnapshotTerm": 1,
 476         "AbortTransactionsCount": 0,
 477         "ReadOnlyTransactionCount": 0,
 478         "ShardName": "member-1-shard-default-operational",
 479         "LeadershipChangeCount": 1,
 480         "InMemoryJournalDataSize": 450
 481       },
 482       "timestamp": 1483740350,
 483       "status": 200
 484     }
 485
 486 The output helps identifying shard state (leader/follower, voting/non-voting),
 487 peers, follower details if the shard is a leader, and other
 488 statistics/counters.
 489
 490 The Integration team is maintaining a Python based `tool
 491 <https://github.com/opendaylight/integration-test/tree/master/tools/clustering/cluster-monitor>`_,
 492 that takes advantage of the above MBeans exposed via Jolokia, and the
 493 *systemmetrics* project offers a DLUX based UI to display the same
 494 information.
 495
 496 Geo-distributed Active/Backup Setup
 497 -----------------------------------
 498
 499 An OpenDaylight cluster works best when the latency between the nodes is very
 500 small, which practically means they should be in the same datacenter. It is
 501 however desirable to have the possibility to fail over to a different
 502 datacenter, in case all nodes become unreachable. To achieve that, the cluster
 503 can be expanded with nodes in a different datacenter, but in a way that
 504 doesn't affect latency of the primary nodes. To do that, shards in the backup
 505 nodes must be in "non-voting" state.
 506
 507 The API to manipulate voting states on shards is defined as RPCs in the
 508 `cluster-admin.yang <https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/sal-cluster-admin-api/src/main/yang/cluster-admin.yang>`_
 509 file in the *controller* project, which is well documented. A summary is
 510 provided below.
 511
 512 .. note::
 513
 514   Unless otherwise indicated, the below POST requests are to be sent to any
 515   single cluster node.
 516
 517 To create an active/backup setup with a 6 node cluster (3 active and 3 backup
 518 nodes in two locations) there is an RPC to set voting states of all shards on
 519 a list of nodes to a given state::
 520
 521    POST  /restconf/operations/cluster-admin:change-member-voting-states-for-all-shards
 522
 523 This RPC needs the list of nodes and the desired voting state as input. For
 524 creating the backup nodes, this example input can be used::
 525
 526     {
 527       "input": {
 528         "member-voting-state": [
 529           {
 530             "member-name": "member-4",
 531             "voting": false
 532           },
 533           {
 534             "member-name": "member-5",
 535             "voting": false
 536           },
 537           {
 538             "member-name": "member-6",
 539             "voting": false
 540           }
 541         ]
 542       }
 543     }
 544
 545 When an active/backup deployment already exists, with shards on the backup
 546 nodes in non-voting state, all that is needed for a fail-over from the active
 547 "sub-cluster" to backup "sub-cluster" is to flip the voting state of each
 548 shard (on each node, active AND backup). That can be easily achieved with the
 549 following RPC call (no parameters needed)::
 550
 551     POST  /restconf/operations/cluster-admin:flip-member-voting-states-for-all-shards
 552
 553 If it's an unplanned outage where the primary voting nodes are down, the
 554 "flip" RPC must be sent to a backup non-voting node. In this case there are no
 555 shard leaders to carry out the voting changes. However there is a special case
 556 whereby if the node that receives the RPC is non-voting and is to be changed
 557 to voting and there's no leader, it will apply the voting changes locally and
 558 attempt to become the leader. If successful, it persists the voting changes
 559 and replicates them to the remaining nodes.
 560
 561 When the primary site is fixed and you want to fail back to it, care must be
 562 taken when bringing the site back up. Because it was down when the voting
 563 states were flipped on the secondary, its persisted database won't contain
 564 those changes. If brought back up in that state, the nodes will think they're
 565 still voting. If the nodes have connectivity to the secondary site, they
 566 should follow the leader in the secondary site and sync with it. However if
 567 this does not happen then the primary site may elect its own leader thereby
 568 partitioning the 2 clusters, which can lead to undesirable results. Therefore
 569 it is recommended to either clean the databases (i.e., ``journal`` and
 570 ``snapshots`` directory) on the primary nodes before bringing them back up or
 571 restore them from a recent backup of the secondary site (see section
 572 :ref:`cluster_backup_restore` below).
 573
 574 If is also possible to gracefully remove a node from a cluster, with the
 575 following RPC::
 576
 577     POST  /restconf/operations/cluster-admin:remove-all-shard-replicas
 578
 579 and example input::
 580
 581     {
 582       "input": {
 583         "member-name": "member-1"
 584       }
 585     }
 586
 587 or just one particular shard::
 588
 589     POST  /restconf/operations/cluster-admin:remove-shard-replica
 590
 591 with example input::
 592
 593     {
 594       "input": {
 595         "shard-name": "default",
 596         "member-name": "member-2",
 597         "data-store-type": "config"
 598       }
 599     }
 600
 601 Now that a (potentially dead/unrecoverable) node was removed, another one can
 602 be added at runtime, without changing the configuration files of the healthy
 603 nodes (requiring reboot)::
 604
 605     POST  /restconf/operations/cluster-admin:add-replicas-for-all-shards
 606
 607 No input required, but this RPC needs to be sent to the new node, to instruct
 608 it to replicate all shards from the cluster.
 609
 610 .. note::
 611
 612   While the cluster admin API allows adding and removing shards dynamically,
 613   the ``module-shard.conf`` and ``modules.conf`` files are still used on
 614   startup to define the initial configuration of shards. Modifications from
 615   the use of the API are not stored to those static files, but to the journal.
 616
 617 .. _cluster_backup_restore:
 618
 619 Backing Up and Restoring the Datastore
 620 --------------------------------------
 621
 622 The same cluster-admin API that is used above for managing shard voting states
 623 has an RPC allowing backup of the datastore in a single node, taking only the
 624 file name as a parameter::
 625
 626     POST  /restconf/operations/cluster-admin:backup-datastore
 627
 628 RPC input JSON::
 629
 630     {
 631       "input": {
 632         "file-path": "/tmp/datastore_backup"
 633       }
 634     }
 635
 636 .. note::
 637
 638   This backup can only be restored if the YANG models of the backed-up data
 639   are identical in the backup OpenDaylight instance and restore target
 640   instance.
 641
 642 To restore the backup on the target node the file needs to be placed into the
 643 ``$KARAF_HOME/clustered-datastore-restore`` directory, and then the node
 644 restarted. If the directory does not exist (which is quite likely if this is a
 645 first-time restore) it needs to be created. On startup, ODL checks if the
 646 ``journal`` and ``snapshots`` directories in ``$KARAF_HOME`` are empty, and
 647 only then tries to read the contents of the ``clustered-datastore-restore``
 648 directory, if it exists. So for a successful restore, those two directories
 649 should be empty. The backup file name itself does not matter, and the startup
 650 process will delete it after a successful restore.
 651
 652 The backup is node independent, so when restoring a 3 node cluster, it is best
 653 to restore it on each node for consistency. For example, if restoring on one
 654 node only, it can happen that the other two empty nodes form a majority and
 655 the cluster comes up with no data.