docs/getting-started-guide/clustering.rst

   1 Setting Up Clustering
   2 =====================
   3
   4 Clustering Overview
   5 -------------------
   6
   7 Clustering is a mechanism that enables multiple processes and programs to work
   8 together as one entity.  For example, when you search for something on
   9 google.com, it may seem like your search request is processed by only one web
  10 server. In reality, your search request is processed by may web servers
  11 connected in a cluster. Similarly, you can have multiple instances of
  12 OpenDaylight working together as one entity.
  13
  14 Advantages of clustering are:
  15
  16 * Scaling: If you have multiple instances of OpenDaylight running, you can
  17   potentially do more work and store more data than you could with only one
  18   instance. You can also break up your data into smaller chunks (shards) and
  19   either distribute that data across the cluster or perform certain operations
  20   on certain members of the cluster.
  21 * High Availability: If you have multiple instances of OpenDaylight running and
  22   one of them crashes, you will still have the other instances working and
  23   available.
  24 * Data Persistence: You will not lose any data stored in OpenDaylight after a
  25   manual restart or a crash.
  26
  27 The following sections describe how to set up clustering on both individual and
  28 multiple OpenDaylight instances.
  29
  30 Multiple Node Clustering
  31 ------------------------
  32
  33 The following sections describe how to set up multiple node clusters in OpenDaylight.
  34
  35 Deployment Considerations
  36 ^^^^^^^^^^^^^^^^^^^^^^^^^
  37
  38 To implement clustering, the deployment considerations are as follows:
  39
  40 * To set up a cluster with multiple nodes, we recommend that you use a minimum
  41   of three machines. You can set up a cluster with just two nodes. However, if
  42   one of the two nodes fail, the cluster will not be operational.
  43
  44   .. note:: This is because clustering in OpenDaylight requires a majority of the
  45              nodes to be up and one node cannot be a majority of two nodes.
  46
  47 * Every device that belongs to a cluster needs to have an identifier.
  48   OpenDaylight uses the node's ``role`` for this purpose. After you define the
  49   first node's role as *member-1* in the ``akka.conf`` file, OpenDaylight uses
  50   *member-1* to identify that node.
  51
  52 * Data shards are used to contain all or a certain segment of a OpenDaylight's
  53   MD-SAL datastore. For example, one shard can contain all the inventory data
  54   while another shard contains all of the topology data.
  55
  56   If you do not specify a module in the ``modules.conf`` file and do not specify
  57   a shard in ``module-shards.conf``, then (by default) all the data is placed in
  58   the default shard (which must also be defined in ``module-shards.conf`` file).
  59   Each shard has replicas configured. You can specify the details of where the
  60   replicas reside in ``module-shards.conf`` file.
  61
  62 * If you have a three node cluster and would like to be able to tolerate any
  63   single node crashing, a replica of every defined data shard must be running
  64   on all three cluster nodes.
  65
  66   .. note:: This is because OpenDaylight's clustering implementation requires a
  67             majority of the defined shard replicas to be running in order to
  68             function. If you define data shard replicas on two of the cluster nodes
  69             and one of those nodes goes down, the corresponding data shards will not
  70             function.
  71
  72 * If you have a three node cluster and have defined replicas for a data shard
  73   on each of those nodes, that shard will still function even if only two of
  74   the cluster nodes are running. Note that if one of those remaining two nodes
  75   goes down, the shard will not be operational.
  76
  77 * It is  recommended that you have multiple seed nodes configured. After a
  78   cluster member is started, it sends a message to all of its seed nodes.
  79   The cluster member then sends a join command to the first seed node that
  80   responds. If none of its seed nodes reply, the cluster member repeats this
  81   process until it successfully establishes a connection or it is shut down.
  82
  83 * After a node is unreachable, it remains down for configurable period of time
  84   (10 seconds, by default). Once a node goes down, you need to restart it so
  85   that it can rejoin the cluster. Once a restarted node joins a cluster, it
  86   will synchronize with the lead node automatically.
  87
  88 .. _getting-started-clustering-scripts:
  89
  90 Clustering Scripts
  91 ------------------
  92
  93 OpenDaylight includes some scripts to help with the clustering configuration.
  94
  95 .. note::
  96
  97     Scripts are stored in the OpenDaylight distribution/bin folder, and
  98     maintained in the distribution project
  99     `repository <https://git.opendaylight.org/gerrit/p/integration/distribution>`_
 100     in the folder distribution-karaf/src/main/assembly/bin/.
 101
 102 Configure Cluster Script
 103 ^^^^^^^^^^^^^^^^^^^^^^^^
 104
 105 This script is used to configure the cluster parameters (e.g. akka.conf,
 106 module-shards.conf) on a member of the controller cluster. The user should
 107 restart the node to apply the changes.
 108
 109 .. note::
 110
 111     The script can be used at any time, even before the controller is started
 112     for the first time.
 113
 114 Usage::
 115
 116     bin/configure_cluster.sh <index> <seed_nodes_list>
 117
 118 * index: Integer within 1..N, where N is the number of seed nodes. This indicates
 119   which controller node (1..N) is configured by the script.
 120 * seed_nodes_list: List of seed nodes (IP address), separated by comma or space.
 121
 122 The IP address at the provided index should belong to the member executing
 123 the script. When running this script on multiple seed nodes, keep the
 124 seed_node_list the same, and vary the index from 1 through N.
 125
 126 Optionally, shards can be configured in a more granular way by modifying the
 127 file "custom_shard_configs.txt" in the same folder as this tool. Please see
 128 that file for more details.
 129
 130 Example::
 131
 132     bin/configure_cluster.sh 2 192.168.0.1 192.168.0.2 192.168.0.3
 133
 134 The above command will configure the member 2 (IP address 192.168.0.2) of a
 135 cluster made of 192.168.0.1 192.168.0.2 192.168.0.3.
 136
 137 Setting Up a Multiple Node Cluster
 138 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 139
 140 To run OpenDaylight in a three node cluster, perform the following:
 141
 142 First, determine the three machines that will make up the cluster. After that,
 143 do the following on each machine:
 144
 145 #. Copy the OpenDaylight distribution zip file to the machine.
 146 #. Unzip the distribution.
 147 #. Open the following .conf files:
 148
 149    * configuration/initial/akka.conf
 150    * configuration/initial/module-shards.conf
 151
 152 #. In each configuration file, make the following changes:
 153
 154    Find every instance of the following lines and replace _127.0.0.1_ with the
 155    hostname or IP address of the machine on which this file resides and
 156    OpenDaylight will run::
 157
 158       netty.tcp {
 159         hostname = "127.0.0.1"
 160
 161    .. note:: The value you need to specify will be different for each node in the
 162              cluster.
 163
 164 #. Find the following lines and replace _127.0.0.1_ with the hostname or IP
 165    address of any of the machines that will be part of the cluster::
 166
 167       cluster {
 168         seed-nodes = ["akka.tcp://opendaylight-cluster-data@${IP_OF_MEMBER1}:2550",
 169                       <url-to-cluster-member-2>,
 170                       <url-to-cluster-member-3>]
 171
 172 #. Find the following section and specify the role for each member node. Here
 173    we assign the first node with the *member-1* role, the second node with the
 174    *member-2* role, and the third node with the *member-3* role::
 175
 176      roles = [
 177        "member-1"
 178      ]
 179
 180    .. note:: This step should use a different role on each node.
 181
 182 #. Open the configuration/initial/module-shards.conf file and update the
 183    replicas so that each shard is replicated to all three nodes::
 184
 185       replicas = [
 186           "member-1",
 187           "member-2",
 188           "member-3"
 189       ]
 190
 191    For reference, view a sample config files <<_sample_config_files,below>>.
 192
 193 #. Move into the +<karaf-distribution-directory>/bin+ directory.
 194 #. Run the following command::
 195
 196       JAVA_MAX_MEM=4G JAVA_MAX_PERM_MEM=512m ./karaf
 197
 198 #. Enable clustering by running the following command at the Karaf command line::
 199
 200       feature:install odl-mdsal-clustering
 201
 202 OpenDaylight should now be running in a three node cluster. You can use any of
 203 the three member nodes to access the data residing in the datastore.
 204
 205 Sample Config Files
 206 """""""""""""""""""
 207
 208 Sample ``akka.conf`` file::
 209
 210    odl-cluster-data {
 211      bounded-mailbox {
 212        mailbox-type = "org.opendaylight.controller.cluster.common.actor.MeteredBoundedMailbox"
 213        mailbox-capacity = 1000
 214        mailbox-push-timeout-time = 100ms
 215      }
 216
 217      metric-capture-enabled = true
 218
 219      akka {
 220        loglevel = "DEBUG"
 221        loggers = ["akka.event.slf4j.Slf4jLogger"]
 222
 223        actor {
 224
 225          provider = "akka.cluster.ClusterActorRefProvider"
 226          serializers {
 227                    java = "akka.serialization.JavaSerializer"
 228                    proto = "akka.remote.serialization.ProtobufSerializer"
 229                  }
 230
 231                  serialization-bindings {
 232                      "com.google.protobuf.Message" = proto
 233
 234                  }
 235        }
 236        remote {
 237          log-remote-lifecycle-events = off
 238          netty.tcp {
 239            hostname = "10.194.189.96"
 240            port = 2550
 241            maximum-frame-size = 419430400
 242            send-buffer-size = 52428800
 243            receive-buffer-size = 52428800
 244          }
 245        }
 246
 247        cluster {
 248          seed-nodes = ["akka.tcp://opendaylight-cluster-data@10.194.189.96:2550",
 249                        "akka.tcp://opendaylight-cluster-data@10.194.189.98:2550",
 250                        "akka.tcp://opendaylight-cluster-data@10.194.189.101:2550"]
 251
 252          auto-down-unreachable-after = 10s
 253
 254          roles = [
 255            "member-2"
 256          ]
 257
 258        }
 259      }
 260    }
 261
 262    odl-cluster-rpc {
 263      bounded-mailbox {
 264        mailbox-type = "org.opendaylight.controller.cluster.common.actor.MeteredBoundedMailbox"
 265        mailbox-capacity = 1000
 266        mailbox-push-timeout-time = 100ms
 267      }
 268
 269      metric-capture-enabled = true
 270
 271      akka {
 272        loglevel = "INFO"
 273        loggers = ["akka.event.slf4j.Slf4jLogger"]
 274
 275        actor {
 276          provider = "akka.cluster.ClusterActorRefProvider"
 277
 278        }
 279        remote {
 280          log-remote-lifecycle-events = off
 281          netty.tcp {
 282            hostname = "10.194.189.96"
 283            port = 2551
 284          }
 285        }
 286
 287        cluster {
 288          seed-nodes = ["akka.tcp://opendaylight-cluster-rpc@10.194.189.96:2551"]
 289
 290          auto-down-unreachable-after = 10s
 291        }
 292      }
 293    }
 294
 295 Sample ``module-shards.conf`` file::
 296
 297    module-shards = [
 298        {
 299            name = "default"
 300            shards = [
 301                {
 302                    name="default"
 303                    replicas = [
 304                        "member-1",
 305                        "member-2",
 306                        "member-3"
 307                    ]
 308                }
 309            ]
 310        },
 311        {
 312            name = "topology"
 313            shards = [
 314                {
 315                    name="topology"
 316                    replicas = [
 317                        "member-1",
 318                        "member-2",
 319                        "member-3"
 320                    ]
 321                }
 322            ]
 323        },
 324        {
 325            name = "inventory"
 326            shards = [
 327                {
 328                    name="inventory"
 329                    replicas = [
 330                        "member-1",
 331                        "member-2",
 332                        "member-3"
 333                    ]
 334                }
 335            ]
 336        },
 337        {
 338             name = "toaster"
 339             shards = [
 340                 {
 341                     name="toaster"
 342                     replicas = [
 343                        "member-1",
 344                        "member-2",
 345                        "member-3"
 346                     ]
 347                 }
 348             ]
 349        }
 350    ]
 351
 352 Cluster Monitoring
 353 ------------------
 354
 355 OpenDaylight exposes shard information via MBeans, which can be explored with
 356 JConsole, VisualVM, or other JMX clients, or exposed via a REST API using
 357 `Jolokia <https://jolokia.org/features-nb.html>`_, provided by the
 358 ``odl-jolokia`` Karaf feature. This is convenient, due to a significant focus
 359 on REST in OpenDaylight.
 360
 361 The basic URI that lists a schema of all available MBeans, but not their
 362 content itself is::
 363
 364     GET  /jolokia/list
 365
 366 To read the information about the shards local to the queried OpenDaylight
 367 instance use the following REST calls. For the config datastore::
 368
 369     GET  /jolokia/read/org.opendaylight.controller:type=DistributedConfigDatastore,Category=ShardManager,name=shard-manager-config
 370
 371 For the operational datastore::
 372
 373     GET  /jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational
 374
 375 The output contains information on shards present on the node::
 376
 377     {
 378       "request": {
 379         "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
 380         "type": "read"
 381       },
 382       "value": {
 383         "LocalShards": [
 384           "member-1-shard-default-operational",
 385           "member-1-shard-entity-ownership-operational",
 386           "member-1-shard-topology-operational",
 387           "member-1-shard-inventory-operational",
 388           "member-1-shard-toaster-operational"
 389         ],
 390         "SyncStatus": true,
 391         "MemberName": "member-1"
 392       },
 393       "timestamp": 1483738005,
 394       "status": 200
 395     }
 396
 397 The exact names from the "LocalShards" lists are needed for further
 398 exploration, as they will be used as part of the URI to look up detailed info
 399 on a particular shard. An example output for the
 400 ``member-1-shard-default-operational`` looks like this::
 401
 402     {
 403       "request": {
 404         "mbean": "org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore",
 405         "type": "read"
 406       },
 407       "value": {
 408         "ReadWriteTransactionCount": 0,
 409         "SnapshotIndex": 4,
 410         "InMemoryJournalLogSize": 1,
 411         "ReplicatedToAllIndex": 4,
 412         "Leader": "member-1-shard-default-operational",
 413         "LastIndex": 5,
 414         "RaftState": "Leader",
 415         "LastCommittedTransactionTime": "2017-01-06 13:19:00.135",
 416         "LastApplied": 5,
 417         "LastLeadershipChangeTime": "2017-01-06 13:18:37.605",
 418         "LastLogIndex": 5,
 419         "PeerAddresses": "member-3-shard-default-operational: akka.tcp://opendaylight-cluster-data@192.168.16.3:2550/user/shardmanager-operational/member-3-shard-default-operational, member-2-shard-default-operational: akka.tcp://opendaylight-cluster-data@192.168.16.2:2550/user/shardmanager-operational/member-2-shard-default-operational",
 420         "WriteOnlyTransactionCount": 0,
 421         "FollowerInitialSyncStatus": false,
 422         "FollowerInfo": [
 423           {
 424             "timeSinceLastActivity": "00:00:00.320",
 425             "active": true,
 426             "matchIndex": 5,
 427             "voting": true,
 428             "id": "member-3-shard-default-operational",
 429             "nextIndex": 6
 430           },
 431           {
 432             "timeSinceLastActivity": "00:00:00.320",
 433             "active": true,
 434             "matchIndex": 5,
 435             "voting": true,
 436             "id": "member-2-shard-default-operational",
 437             "nextIndex": 6
 438           }
 439         ],
 440         "FailedReadTransactionsCount": 0,
 441         "StatRetrievalTime": "810.5 μs",
 442         "Voting": true,
 443         "CurrentTerm": 1,
 444         "LastTerm": 1,
 445         "FailedTransactionsCount": 0,
 446         "PendingTxCommitQueueSize": 0,
 447         "VotedFor": "member-1-shard-default-operational",
 448         "SnapshotCaptureInitiated": false,
 449         "CommittedTransactionsCount": 6,
 450         "TxCohortCacheSize": 0,
 451         "PeerVotingStates": "member-3-shard-default-operational: true, member-2-shard-default-operational: true",
 452         "LastLogTerm": 1,
 453         "StatRetrievalError": null,
 454         "CommitIndex": 5,
 455         "SnapshotTerm": 1,
 456         "AbortTransactionsCount": 0,
 457         "ReadOnlyTransactionCount": 0,
 458         "ShardName": "member-1-shard-default-operational",
 459         "LeadershipChangeCount": 1,
 460         "InMemoryJournalDataSize": 450
 461       },
 462       "timestamp": 1483740350,
 463       "status": 200
 464     }
 465
 466 The output helps identifying shard state (leader/follower, voting/non-voting),
 467 peers, follower details if the shard is a leader, and other
 468 statistics/counters.
 469
 470 The Integration team is maintaining a Python based `tool
 471 <https://github.com/opendaylight/integration-test/tree/master/tools/clustering/cluster-monitor>`_,
 472 that takes advantage of the above MBeans exposed via Jolokia, and the
 473 *systemmetrics* project offers a DLUX based UI to display the same
 474 information.
 475
 476 .. _cluster_admin_api:
 477
 478 Geo-distributed Active/Backup Setup
 479 -----------------------------------
 480
 481 An OpenDaylight cluster works best when the latency between the nodes is very
 482 small, which practically means they should be in the same datacenter. It is
 483 however desirable to have the possibility to fail over to a different
 484 datacenter, in case all nodes become unreachable. To achieve that, the cluster
 485 can be expanded with nodes in a different datacenter, but in a way that
 486 doesn't affect latency of the primary nodes. To do that, shards in the backup
 487 nodes must be in "non-voting" state.
 488
 489 The API to manipulate voting states on shards is defined as RPCs in the
 490 `cluster-admin.yang <https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/sal-cluster-admin-api/src/main/yang/cluster-admin.yang>`_
 491 file in the *controller* project, which is well documented. A summary is
 492 provided below.
 493
 494 .. note::
 495
 496   Unless otherwise indicated, the below POST requests are to be sent to any
 497   single cluster node.
 498
 499 To create an active/backup setup with a 6 node cluster (3 active and 3 backup
 500 nodes in two locations) there is an RPC to set voting states of all shards on
 501 a list of nodes to a given state::
 502
 503    POST  /restconf/operations/cluster-admin:change-member-voting-states-for-all-shards
 504
 505 This RPC needs the list of nodes and the desired voting state as input. For
 506 creating the backup nodes, this example input can be used::
 507
 508     {
 509       "input": {
 510         "member-voting-state": [
 511           {
 512             "member-name": "member-4",
 513             "voting": false
 514           },
 515           {
 516             "member-name": "member-5",
 517             "voting": false
 518           },
 519           {
 520             "member-name": "member-6",
 521             "voting": false
 522           }
 523         ]
 524       }
 525     }
 526
 527 When an active/backup deployment already exists, with shards on the backup
 528 nodes in non-voting state, all that is needed for a fail-over from the active
 529 "sub-cluster" to backup "sub-cluster" is to flip the voting state of each
 530 shard (on each node, active AND backup). That can be easily achieved with the
 531 following RPC call (no parameters needed)::
 532
 533     POST  /restconf/operations/cluster-admin:flip-member-voting-states-for-all-shards
 534
 535 If it's an unplanned outage where the primary voting nodes are down, the
 536 "flip" RPC must be sent to a backup non-voting node. In this case there are no
 537 shard leaders to carry out the voting changes. However there is a special case
 538 whereby if the node that receives the RPC is non-voting and is to be changed
 539 to voting and there's no leader, it will apply the voting changes locally and
 540 attempt to become the leader. If successful, it persists the voting changes
 541 and replicates them to the remaining nodes.
 542
 543 When the primary site is fixed and you want to fail back to it, care must be
 544 taken when bringing the site back up. Because it was down when the voting
 545 states were flipped on the secondary, its persisted database won't contain
 546 those changes. If brought back up in that state, the nodes will think they're
 547 still voting. If the nodes have connectivity to the secondary site, they
 548 should follow the leader in the secondary site and sync with it. However if
 549 this does not happen then the primary site may elect its own leader thereby
 550 partitioning the 2 clusters, which can lead to undesirable results. Therefore
 551 it is recommended to either clean the databases (i.e., ``journal`` and
 552 ``snapshots`` directory) on the primary nodes before bringing them back up or
 553 restore them from a recent backup of the secondary site (see section
 554 :ref:`cluster_backup_restore`).
 555
 556 If is also possible to gracefully remove a node from a cluster, with the
 557 following RPC::
 558
 559     POST  /restconf/operations/cluster-admin:remove-all-shard-replicas
 560
 561 and example input::
 562
 563     {
 564       "input": {
 565         "member-name": "member-1"
 566       }
 567     }
 568
 569 or just one particular shard::
 570
 571     POST  /restconf/operations/cluster-admin:remove-shard-replica
 572
 573 with example input::
 574
 575     {
 576       "input": {
 577         "shard-name": "default",
 578         "member-name": "member-2",
 579         "data-store-type": "config"
 580       }
 581     }
 582
 583 Now that a (potentially dead/unrecoverable) node was removed, another one can
 584 be added at runtime, without changing the configuration files of the healthy
 585 nodes (requiring reboot)::
 586
 587     POST  /restconf/operations/cluster-admin:add-replicas-for-all-shards
 588
 589 No input required, but this RPC needs to be sent to the new node, to instruct
 590 it to replicate all shards from the cluster.
 591
 592 .. note::
 593
 594   While the cluster admin API allows adding and removing shards dynamically,
 595   the ``module-shard.conf`` and ``modules.conf`` files are still used on
 596   startup to define the initial configuration of shards. Modifications from
 597   the use of the API are not stored to those static files, but to the journal.
 598
 599 Extra Configuration Options
 600 ---------------------------
 601
 602 ============================================== ================= ======= ==============================================================================================================================================================================
 603 Name                                           Type              Default Description
 604 ============================================== ================= ======= ==============================================================================================================================================================================
 605 max-shard-data-change-executor-queue-size      uint32 (1..max)   1000    The maximum queue size for each shard's data store data change notification executor.
 606 max-shard-data-change-executor-pool-size       uint32 (1..max)   20      The maximum thread pool size for each shard's data store data change notification executor.
 607 max-shard-data-change-listener-queue-size      uint32 (1..max)   1000    The maximum queue size for each shard's data store data change listener.
 608 max-shard-data-store-executor-queue-size       uint32 (1..max)   5000    The maximum queue size for each shard's data store executor.
 609 shard-transaction-idle-timeout-in-minutes      uint32 (1..max)   10      The maximum amount of time a shard transaction can be idle without receiving any messages before it self-destructs.
 610 shard-snapshot-batch-count                     uint32 (1..max)   20000   The minimum number of entries to be present in the in-memory journal log before a snapshot is to be taken.
 611 shard-snapshot-data-threshold-percentage       uint8 (1..100)    12      The percentage of Runtime.totalMemory() used by the in-memory journal log before a snapshot is to be taken
 612 shard-hearbeat-interval-in-millis              uint16 (100..max) 500     The interval at which a shard will send a heart beat message to its remote shard.
 613 operation-timeout-in-seconds                   uint16 (5..max)   5       The maximum amount of time for akka operations (remote or local) to complete before failing.
 614 shard-journal-recovery-log-batch-size          uint32 (1..max)   5000    The maximum number of journal log entries to batch on recovery for a shard before committing to the data store.
 615 shard-transaction-commit-timeout-in-seconds    uint32 (1..max)   30      The maximum amount of time a shard transaction three-phase commit can be idle without receiving the next messages before it aborts the transaction
 616 shard-transaction-commit-queue-capacity        uint32 (1..max)   20000   The maximum allowed capacity for each shard's transaction commit queue.
 617 shard-initialization-timeout-in-seconds        uint32 (1..max)   300     The maximum amount of time to wait for a shard to initialize from persistence on startup before failing an operation (eg transaction create and change listener registration).
 618 shard-leader-election-timeout-in-seconds       uint32 (1..max)   30      The maximum amount of time to wait for a shard to elect a leader before failing an operation (eg transaction create).
 619 enable-metric-capture                          boolean           false   Enable or disable metric capture.
 620 bounded-mailbox-capacity                       uint32 (1..max)   1000    Max queue size that an actor's mailbox can reach
 621 persistent                                     boolean           true    Enable or disable data persistence
 622 shard-isolated-leader-check-interval-in-millis uint32 (1..max)   5000    the interval at which the leader of the shard will check if its majority followers are active and term itself as isolated
 623 ============================================== ================= ======= ==============================================================================================================================================================================
 624
 625 These configuration options are included in the etc/org.opendaylight.controller.cluster.datastore.cfg configuration file.