docs/getting-started-guide/clustering.rst

   1 Setting Up Clustering
   2 =====================
   3
   4 Clustering Overview
   5 -------------------
   6
   7 Clustering is a mechanism that enables multiple processes and programs to work
   8 together as one entity.  For example, when you search for something on
   9 google.com, it may seem like your search request is processed by only one web
  10 server. In reality, your search request is processed by may web servers
  11 connected in a cluster. Similarly, you can have multiple instances of
  12 OpenDaylight working together as one entity.
  13
  14 Advantages of clustering are:
  15
  16 * Scaling: If you have multiple instances of OpenDaylight running, you can
  17   potentially do more work and store more data than you could with only one
  18   instance. You can also break up your data into smaller chunks (shards) and
  19   either distribute that data across the cluster or perform certain operations
  20   on certain members of the cluster.
  21 * High Availability: If you have multiple instances of OpenDaylight running and
  22   one of them crashes, you will still have the other instances working and
  23   available.
  24 * Data Persistence: You will not lose any data stored in OpenDaylight after a
  25   manual restart or a crash.
  26
  27 The following sections describe how to set up clustering on both individual and
  28 multiple OpenDaylight instances.
  29
  30 Multiple Node Clustering
  31 ------------------------
  32
  33 The following sections describe how to set up multiple node clusters in OpenDaylight.
  34
  35 Deployment Considerations
  36 ^^^^^^^^^^^^^^^^^^^^^^^^^
  37
  38 To implement clustering, the deployment considerations are as follows:
  39
  40 * To set up a cluster with multiple nodes, we recommend that you use a minimum
  41   of three machines. You can set up a cluster with just two nodes. However, if
  42   one of the two nodes fail, the cluster will not be operational.
  43
  44   .. note:: This is because clustering in OpenDaylight requires a majority of the
  45              nodes to be up and one node cannot be a majority of two nodes.
  46
  47 * Every device that belongs to a cluster needs to have an identifier.
  48   OpenDaylight uses the node's ``role`` for this purpose. After you define the
  49   first node's role as *member-1* in the ``akka.conf`` file, OpenDaylight uses
  50   *member-1* to identify that node.
  51
  52 * Data shards are used to contain all or a certain segment of a OpenDaylight's
  53   MD-SAL datastore. For example, one shard can contain all the inventory data
  54   while another shard contains all of the topology data.
  55
  56   If you do not specify a module in the ``modules.conf`` file and do not specify
  57   a shard in ``module-shards.conf``, then (by default) all the data is placed in
  58   the default shard (which must also be defined in ``module-shards.conf`` file).
  59   Each shard has replicas configured. You can specify the details of where the
  60   replicas reside in ``module-shards.conf`` file.
  61
  62 * If you have a three node cluster and would like to be able to tolerate any
  63   single node crashing, a replica of every defined data shard must be running
  64   on all three cluster nodes.
  65
  66   .. note:: This is because OpenDaylight's clustering implementation requires a
  67             majority of the defined shard replicas to be running in order to
  68             function. If you define data shard replicas on two of the cluster nodes
  69             and one of those nodes goes down, the corresponding data shards will not
  70             function.
  71
  72 * If you have a three node cluster and have defined replicas for a data shard
  73   on each of those nodes, that shard will still function even if only two of
  74   the cluster nodes are running. Note that if one of those remaining two nodes
  75   goes down, the shard will not be operational.
  76
  77 * It is  recommended that you have multiple seed nodes configured. After a
  78   cluster member is started, it sends a message to all of its seed nodes.
  79   The cluster member then sends a join command to the first seed node that
  80   responds. If none of its seed nodes reply, the cluster member repeats this
  81   process until it successfully establishes a connection or it is shut down.
  82
  83 * After a node is unreachable, it remains down for configurable period of time
  84   (10 seconds, by default). Once a node goes down, you need to restart it so
  85   that it can rejoin the cluster. Once a restarted node joins a cluster, it
  86   will synchronize with the lead node automatically.
  87
  88 .. _getting-started-clustering-scripts:
  89
  90 Clustering Scripts
  91 ------------------
  92
  93 OpenDaylight includes some scripts to help with the clustering configuration.
  94
  95 .. note::
  96
  97     Scripts are stored in the OpenDaylight ``distribution/bin`` folder, and
  98     maintained in the distribution project
  99     `repository <https://git.opendaylight.org/gerrit/admin/repos/integration/distribution>`_
 100     in the folder ``karaf-scripts/src/main/assembly/bin/``.
 101
 102 Configure Cluster Script
 103 ^^^^^^^^^^^^^^^^^^^^^^^^
 104
 105 This script is used to configure the cluster parameters (e.g. ``akka.conf``,
 106 ``module-shards.conf``) on a member of the controller cluster. The user should
 107 restart the node to apply the changes.
 108
 109 .. note::
 110
 111     The script can be used at any time, even before the controller is started
 112     for the first time.
 113
 114 Usage::
 115
 116     bin/configure_cluster.sh <index> <seed_nodes_list>
 117
 118 * index: Integer within 1..N, where N is the number of seed nodes. This indicates
 119   which controller node (1..N) is configured by the script.
 120 * seed_nodes_list: List of seed nodes (IP address), separated by comma or space.
 121
 122 The IP address at the provided index should belong to the member executing
 123 the script. When running this script on multiple seed nodes, keep the
 124 seed_node_list the same, and vary the index from 1 through N.
 125
 126 Optionally, shards can be configured in a more granular way by modifying the
 127 file ``"custom_shard_configs.txt"`` in the same folder as this tool. Please see
 128 that file for more details.
 129
 130 Example::
 131
 132     bin/configure_cluster.sh 2 192.168.0.1 192.168.0.2 192.168.0.3
 133
 134 The above command will configure the member 2 (IP address 192.168.0.2) of a
 135 cluster made of 192.168.0.1 192.168.0.2 192.168.0.3.
 136
 137 Setting Up a Multiple Node Cluster
 138 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 139
 140 To run OpenDaylight in a three node cluster, perform the following:
 141
 142 First, determine the three machines that will make up the cluster. After that,
 143 do the following on each machine:
 144
 145 #. Copy the OpenDaylight distribution zip file to the machine.
 146 #. Unzip the distribution.
 147 #. Move into the ``<karaf-distribution-directory>/bin`` directory and run::
 148
 149       JAVA_MAX_MEM=4G JAVA_MAX_PERM_MEM=512m ./karaf
 150
 151 #. Enable clustering by running the following command at the Karaf command line::
 152
 153       feature:install odl-mdsal-distributed-datastore
 154
 155    After installation you will be able to see new folder ``configuration/initial/``
 156    with config files
 157
 158 #. Open the following configuration files:
 159
 160    * ``configuration/initial/akka.conf``
 161    * ``configuration/initial/module-shards.conf``
 162
 163 #. In each configuration file, make the following changes:
 164
 165    Find every instance of the following lines and replace _127.0.0.1_ with the
 166    hostname or IP address of the machine on which this file resides and
 167    OpenDaylight will run::
 168
 169       artery {
 170         canonical.hostname = "127.0.0.1"
 171
 172    .. note:: The value you need to specify will be different for each node in the
 173              cluster.
 174
 175 #. Find the following lines and replace _127.0.0.1_ with the hostname or IP
 176    address of any of the machines that will be part of the cluster::
 177
 178       cluster {
 179         seed-nodes = ["akka://opendaylight-cluster-data@${IP_OF_MEMBER1}:2550",
 180                       <url-to-cluster-member-2>,
 181                       <url-to-cluster-member-3>]
 182
 183 #. Find the following section and specify the role for each member node. Here
 184    we assign the first node with the *member-1* role, the second node with the
 185    *member-2* role, and the third node with the *member-3* role::
 186
 187      roles = [
 188        "member-1"
 189      ]
 190
 191    .. note:: This step should use a different role on each node.
 192
 193 #. Open the ``configuration/initial/module-shards.conf`` file and update the
 194    replicas so that each shard is replicated to all three nodes::
 195
 196       replicas = [
 197           "member-1",
 198           "member-2",
 199           "member-3"
 200       ]
 201
 202    For reference, view a sample config files below.
 203
 204 #. Restart bundle via command line::
 205
 206       opendaylight-user@root>restart org.opendaylight.controller.sal-distributed-datastore
 207
 208 OpenDaylight should now be running in a three node cluster. You can use any of
 209 the three member nodes to access the data residing in the datastore.
 210
 211 Sample Config Files
 212 """""""""""""""""""
 213
 214 Sample ``akka.conf`` file::
 215
 216    odl-cluster-data {
 217      akka {
 218        remote {
 219          artery {
 220            enabled = on
 221            transport = tcp
 222            canonical.hostname = "10.0.2.10"
 223            canonical.port = 2550
 224          }
 225        }
 226
 227        cluster {
 228          # Using artery.
 229          seed-nodes = ["akka://opendaylight-cluster-data@10.0.2.10:2550",
 230                        "akka://opendaylight-cluster-data@10.0.2.11:2550",
 231                        "akka://opendaylight-cluster-data@10.0.2.12:2550"]
 232
 233          roles = [
 234            "member-1"
 235          ]
 236
 237          # when under load we might trip a false positive on the failure detector
 238          # failure-detector {
 239            # heartbeat-interval = 4 s
 240            # acceptable-heartbeat-pause = 16s
 241          # }
 242        }
 243
 244        persistence {
 245          # By default the snapshots/journal directories live in KARAF_HOME. You can choose to put it somewhere else by
 246          # modifying the following two properties. The directory location specified may be a relative or absolute path.
 247          # The relative path is always relative to KARAF_HOME.
 248
 249          # snapshot-store.local.dir = "target/snapshots"
 250
 251          # Use lz4 compression for LocalSnapshotStore snapshots
 252          snapshot-store.local.use-lz4-compression = false
 253          # Size of blocks for lz4 compression: 64KB, 256KB, 1MB or 4MB
 254          snapshot-store.local.lz4-blocksize = 256KB
 255        }
 256        disable-default-actor-system-quarantined-event-handling = "false"
 257      }
 258    }
 259
 260 Sample ``module-shards.conf`` file::
 261
 262    module-shards = [
 263        {
 264            name = "default"
 265            shards = [
 266                {
 267                    name="default"
 268                    replicas = [
 269                        "member-1",
 270                        "member-2",
 271                        "member-3"
 272                    ]
 273                }
 274            ]
 275        },
 276        {
 277            name = "topology"
 278            shards = [
 279                {
 280                    name="topology"
 281                    replicas = [
 282                        "member-1",
 283                        "member-2",
 284                        "member-3"
 285                    ]
 286                }
 287            ]
 288        },
 289        {
 290            name = "inventory"
 291            shards = [
 292                {
 293                    name="inventory"
 294                    replicas = [
 295                        "member-1",
 296                        "member-2",
 297                        "member-3"
 298                    ]
 299                }
 300            ]
 301        },
 302        {
 303             name = "toaster"
 304             shards = [
 305                 {
 306                     name="toaster"
 307                     replicas = [
 308                        "member-1",
 309                        "member-2",
 310                        "member-3"
 311                     ]
 312                 }
 313             ]
 314        }
 315    ]
 316
 317 Cluster Monitoring
 318 ------------------
 319
 320 OpenDaylight exposes shard information via ``MBeans``, which can be explored
 321 with ``JConsole``, VisualVM, or other JMX clients, or exposed via a REST API using
 322 `Jolokia <https://jolokia.org/features-nb.html>`_, provided by the
 323 ``odl-jolokia`` Karaf feature. This is convenient, due to a significant focus
 324 on REST in OpenDaylight.
 325
 326 The basic URI that lists a schema of all available ``MBeans``, but not their
 327 content itself is::
 328
 329     GET  /jolokia/list
 330
 331 To read the information about the shards local to the queried OpenDaylight
 332 instance use the following REST calls. For the config datastore::
 333
 334     GET  /jolokia/read/org.opendaylight.controller:type=DistributedConfigDatastore,Category=ShardManager,name=shard-manager-config
 335
 336 For the operational datastore::
 337
 338     GET  /jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational
 339
 340 The output contains information on shards present on the node::
 341
 342     {
 343       "request": {
 344         "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
 345         "type": "read"
 346       },
 347       "value": {
 348         "LocalShards": [
 349           "member-1-shard-default-operational",
 350           "member-1-shard-entity-ownership-operational",
 351           "member-1-shard-topology-operational",
 352           "member-1-shard-inventory-operational",
 353           "member-1-shard-toaster-operational"
 354         ],
 355         "SyncStatus": true,
 356         "MemberName": "member-1"
 357       },
 358       "timestamp": 1483738005,
 359       "status": 200
 360     }
 361
 362 The exact names from the "LocalShards" lists are needed for further
 363 exploration, as they will be used as part of the URI to look up detailed info
 364 on a particular shard. An example output for the
 365 ``member-1-shard-default-operational`` looks like this::
 366
 367     {
 368       "request": {
 369         "mbean": "org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore",
 370         "type": "read"
 371       },
 372       "value": {
 373         "ReadWriteTransactionCount": 0,
 374         "SnapshotIndex": 4,
 375         "InMemoryJournalLogSize": 1,
 376         "ReplicatedToAllIndex": 4,
 377         "Leader": "member-1-shard-default-operational",
 378         "LastIndex": 5,
 379         "RaftState": "Leader",
 380         "LastCommittedTransactionTime": "2017-01-06 13:19:00.135",
 381         "LastApplied": 5,
 382         "LastLeadershipChangeTime": "2017-01-06 13:18:37.605",
 383         "LastLogIndex": 5,
 384         "PeerAddresses": "member-3-shard-default-operational: akka://opendaylight-cluster-data@192.168.16.3:2550/user/shardmanager-operational/member-3-shard-default-operational, member-2-shard-default-operational: akka://opendaylight-cluster-data@192.168.16.2:2550/user/shardmanager-operational/member-2-shard-default-operational",
 385         "WriteOnlyTransactionCount": 0,
 386         "FollowerInitialSyncStatus": false,
 387         "FollowerInfo": [
 388           {
 389             "timeSinceLastActivity": "00:00:00.320",
 390             "active": true,
 391             "matchIndex": 5,
 392             "voting": true,
 393             "id": "member-3-shard-default-operational",
 394             "nextIndex": 6
 395           },
 396           {
 397             "timeSinceLastActivity": "00:00:00.320",
 398             "active": true,
 399             "matchIndex": 5,
 400             "voting": true,
 401             "id": "member-2-shard-default-operational",
 402             "nextIndex": 6
 403           }
 404         ],
 405         "FailedReadTransactionsCount": 0,
 406         "StatRetrievalTime": "810.5 μs",
 407         "Voting": true,
 408         "CurrentTerm": 1,
 409         "LastTerm": 1,
 410         "FailedTransactionsCount": 0,
 411         "PendingTxCommitQueueSize": 0,
 412         "VotedFor": "member-1-shard-default-operational",
 413         "SnapshotCaptureInitiated": false,
 414         "CommittedTransactionsCount": 6,
 415         "TxCohortCacheSize": 0,
 416         "PeerVotingStates": "member-3-shard-default-operational: true, member-2-shard-default-operational: true",
 417         "LastLogTerm": 1,
 418         "StatRetrievalError": null,
 419         "CommitIndex": 5,
 420         "SnapshotTerm": 1,
 421         "AbortTransactionsCount": 0,
 422         "ReadOnlyTransactionCount": 0,
 423         "ShardName": "member-1-shard-default-operational",
 424         "LeadershipChangeCount": 1,
 425         "InMemoryJournalDataSize": 450
 426       },
 427       "timestamp": 1483740350,
 428       "status": 200
 429     }
 430
 431 The output helps identifying shard state (leader/follower, voting/non-voting),
 432 peers, follower details if the shard is a leader, and other
 433 statistics/counters.
 434
 435 The ODLTools team is maintaining a Python based `tool
 436 <https://github.com/opendaylight/odltools>`_,
 437 that takes advantage of the above ``MBeans`` exposed via ``Jolokia``.
 438
 439 .. _cluster_admin_api:
 440
 441 Failure handling
 442 ----------------
 443
 444 Overview
 445 --------
 446
 447 A fundamental problem in distributed systems is that network
 448 partitions (split brain scenarios) and machine crashes are indistinguishable
 449 for the observer, i.e. a node can observe that there is a problem with another
 450 node, but it cannot tell if it has crashed and will never be available again,
 451 if there is a network issue that might or might not heal again after a while or
 452 if process is unresponsive because of overload, CPU starvation or long garbage
 453 collection pauses.
 454
 455 When there is a crash, we would like to remove the affected node immediately
 456 from the cluster membership. When there is a network partition or unresponsive
 457 process we would like to wait for a while in the hope that it is a transient
 458 problem that will heal again, but at some point, we must give up and continue
 459 with the nodes on one side of the partition and shut down nodes on the other
 460 side. Also, certain features are not fully available during partitions so it
 461 might not matter that the partition is transient or not if it just takes too
 462 long. Those two goals are in conflict with each other and there is a trade-off
 463 between how quickly we can remove a crashed node and premature action on
 464 transient network partitions.
 465
 466 Split Brain Resolver
 467 --------------------
 468
 469 You need to enable the Split Brain Resolver by configuring it as downing
 470 provider in the configuration::
 471
 472     akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
 473
 474 You should also consider different downing strategies, described below.
 475
 476 .. note:: If no downing provider is specified, NoDowning provider is used.
 477
 478 All strategies are inactive until the cluster membership and the information about
 479 unreachable nodes have been stable for a certain time period. Continuously adding
 480 more nodes while there is a network partition does not influence this timeout, since
 481 the status of those nodes will not be changed to Up while there are unreachable nodes.
 482 Joining nodes are not counted in the logic of the strategies.
 483
 484 Setting ``akka.cluster.split-brain-resolver.stable-after`` to a shorter duration for having
 485 quicker removal of crashed nodes can be done at the price of risking a too early action on
 486 transient network partitions that otherwise would have healed. Do not set this to a shorter
 487 duration than the membership dissemination time in the cluster, which depends on the cluster size.
 488 Recommended minimum duration for different cluster sizes:
 489
 490 ============   ============
 491 Cluster size   stable-after
 492 ============   ============
 493 5              7 s
 494 10             10 s
 495 20             13 s
 496 50             17 s
 497 100            20 s
 498 1000           30 s
 499 ============   ============
 500
 501 .. note:: It is important that you use the same configuration on all nodes.
 502
 503 When reachability observations by the failure detector are changed, the SBR
 504 decisions are deferred until there are no changes within the stable-after
 505 duration. If this continues for too long it might be an indication of an
 506 unstable system/network and it could result in delayed or conflicting
 507 decisions on separate sides of a network partition.
 508
 509 As a precaution for that scenario all nodes are downed if no decision is
 510 made within stable-after + down-all-when-unstable from the first unreachability
 511 event. The measurement is reset if all unreachable have been healed, downed or
 512 removed, or if there are no changes within stable-after * 2.
 513
 514 Configuration::
 515
 516     akka.cluster.split-brain-resolver {
 517       # Time margin after which shards or singletons that belonged to a downed/removed
 518       # partition are created in surviving partition. The purpose of this margin is that
 519       # in case of a network partition the persistent actors in the non-surviving partitions
 520       # must be stopped before corresponding persistent actors are started somewhere else.
 521       # This is useful if you implement downing strategies that handle network partitions,
 522       # e.g. by keeping the larger side of the partition and shutting down the smaller side.
 523       # Decision is taken by the strategy when there has been no membership or
 524       # reachability changes for this duration, i.e. the cluster state is stable.
 525       stable-after = 20s
 526
 527       # When reachability observations by the failure detector are changed the SBR decisions
 528       # are deferred until there are no changes within the 'stable-after' duration.
 529       # If this continues for too long it might be an indication of an unstable system/network
 530       # and it could result in delayed or conflicting decisions on separate sides of a network
 531       # partition.
 532       # As a precaution for that scenario all nodes are downed if no decision is made within
 533       # `stable-after + down-all-when-unstable` from the first unreachability event.
 534       # The measurement is reset if all unreachable have been healed, downed or removed, or
 535       # if there are no changes within `stable-after * 2`.
 536       # The value can be on, off, or a duration.
 537       # By default it is 'on' and then it is derived to be 3/4 of stable-after, but not less than
 538       # 4 seconds.
 539       down-all-when-unstable = on
 540     }
 541
 542
 543 Keep majority
 544 ^^^^^^^^^^^^^
 545
 546 This strategy is used by default, because it works well for most systems.
 547 It will down the unreachable nodes if the current node is in the majority part
 548 based on the last known membership information. Otherwise down the reachable
 549 nodes, i.e. the own part. If the parts are of equal size the part containing the
 550 node with the lowest address is kept.
 551
 552 This strategy is a good choice when the number of nodes in the cluster change
 553 dynamically and you can therefore not use static-quorum.
 554
 555 * If there are membership changes at the same time as the network partition
 556   occurs, for example, the status of two members are changed to Up on one side
 557   but that information is not disseminated to the other side before the
 558   connection is broken, it will down all nodes on the side that could be in
 559   minority if the joining nodes were changed to Up on the other side.
 560   Note that if the joining nodes were not changed to Up and becoming a majority
 561   on the other side then each part will shut down itself, terminating the whole
 562   cluster.
 563
 564 * If there are more than two partitions and none is in majority each part will
 565   shut down itself, terminating the whole cluster.
 566
 567 * If more than half of the nodes crash at the same time the other running nodes
 568   will down themselves because they think that they are not in majority, and
 569   thereby the whole cluster is terminated.
 570
 571 The decision can be based on nodes with a configured role instead of all nodes
 572 in the cluster. This can be useful when some types of nodes are more valuable
 573 than others.
 574
 575 Configuration::
 576
 577     akka.cluster.split-brain-resolver.active-strategy=keep-majority
 578
 579 ::
 580
 581     akka.cluster.split-brain-resolver.keep-majority {
 582       # if the 'role' is defined the decision is based only on members with that 'role'
 583       role = ""
 584     }
 585
 586 Static quorum
 587 ^^^^^^^^^^^^^
 588
 589 The strategy named static-quorum will down the unreachable nodes if the number
 590 of remaining nodes are greater than or equal to a configured quorum-size.
 591 Otherwise, it will down the reachable nodes, i.e. it will shut down that side
 592 of the partition.
 593
 594 This strategy is a good choice when you have a fixed number of nodes in the
 595 cluster, or when you can define a fixed number of nodes with a certain role.
 596
 597 * If there are unreachable nodes when starting up the cluster, before reaching
 598   this limit, the cluster may shut itself down immediately.
 599   This is not an issue if you start all nodes at approximately the same time or
 600   use the ``akka.cluster.min-nr-of-members`` to define required number of
 601   members before the leader changes member status of ‘Joining’ members to ‘Up’.
 602   You can tune the timeout after which downing decisions are made using the
 603   stable-after setting.
 604
 605 * You should not add more members to the cluster than quorum-size * 2 - 1.
 606   If the exceeded cluster size remains when a SBR decision is needed it will
 607   down all nodes because otherwise there is a risk that both sides may down each
 608   other and thereby form two separate clusters.
 609
 610 * If the cluster is split into 3 (or more) parts each part that is smaller than
 611   then configured quorum-size will down itself and possibly shutdown the whole
 612   cluster.
 613
 614 * If more nodes than the configured quorum-size crash at the same time the other
 615   running nodes will down themselves because they think that they are not in the
 616   majority, and thereby the whole cluster is terminated.
 617
 618 The decision can be based on nodes with a configured role instead of all nodes
 619 in the cluster. This can be useful when some types of nodes are more valuable
 620 than others.
 621
 622 By defining a role for a few stable nodes in the cluster and using that in the
 623 configuration of static-quorum you will be able to dynamically add and remove
 624 other nodes without this role and still have good decisions of what nodes to
 625 keep running and what nodes to shut down in the case of network partitions.
 626 The advantage of this approach compared to keep-majority is that you do not risk
 627 splitting the cluster into two separate clusters, i.e. a split brain.
 628
 629 Configuration::
 630
 631     akka.cluster.split-brain-resolver.active-strategy=static-quorum
 632
 633 ::
 634
 635     akka.cluster.split-brain-resolver.static-quorum {
 636       # minimum number of nodes that the cluster must have
 637       quorum-size = undefined
 638
 639       # if the 'role' is defined the decision is based only on members with that 'role'
 640       role = ""
 641     }
 642
 643 Keep oldest
 644 ^^^^^^^^^^^
 645
 646 The strategy named keep-oldest will down the part that does not contain the oldest
 647 member. The oldest member is interesting because the active Cluster Singleton
 648 instance is running on the oldest member.
 649
 650 This strategy is good to use if you use Cluster Singleton and do not want to shut
 651 down the node where the singleton instance runs. If the oldest node crashes a new
 652 singleton instance will be started on the next oldest node.
 653
 654 * If down-if-alone is configured to on, then if the oldest node has partitioned
 655   from all other nodes the oldest will down itself and keep all other nodes running.
 656   The strategy will not down the single oldest node when it is the only remaining
 657   node in the cluster.
 658
 659 * If there are membership changes at the same time as the network partition occurs,
 660   for example, the status of the oldest member is changed to Exiting on one side but
 661   that information is not disseminated to the other side before the connection is
 662   broken, it will detect this situation and make the safe decision to down all nodes
 663   on the side that sees the oldest as Leaving. Note that this has the drawback that
 664   if the oldest was Leaving and not changed to Exiting then each part will shut down
 665   itself, terminating the whole cluster.
 666
 667 The decision can be based on nodes with a configured role instead of all nodes
 668 in the cluster.
 669
 670 Configuration::
 671
 672     akka.cluster.split-brain-resolver.active-strategy=keep-oldest
 673
 674
 675 ::
 676
 677     akka.cluster.split-brain-resolver.keep-oldest {
 678       # Enable downing of the oldest node when it is partitioned from all other nodes
 679       down-if-alone = on
 680
 681       # if the 'role' is defined the decision is based only on members with that 'role',
 682       # i.e. using the oldest member (singleton) within the nodes with that role
 683       role = ""
 684     }
 685
 686 Down all
 687 ^^^^^^^^
 688
 689 The strategy named down-all will down all nodes.
 690
 691 This strategy can be a safe alternative if the network environment is highly unstable
 692 with unreachability observations that can’t be fully trusted, and including frequent
 693 occurrences of indirectly connected nodes. Due to the instability there is an increased
 694 risk of different information on different sides of partitions and therefore the other
 695 strategies may result in conflicting decisions. In such environments it can be better
 696 to shutdown all nodes and start up a new fresh cluster.
 697
 698 * This strategy is not recommended for large clusters (> 10 nodes) because any minor
 699   problem will shutdown all nodes, and that is more likely to happen in larger clusters
 700   since there are more nodes that may fail.
 701
 702 Configuration::
 703
 704     akka.cluster.split-brain-resolver.active-strategy=down-all
 705
 706 Lease
 707 ^^^^^
 708
 709 The strategy named lease-majority is using a distributed lease (lock) to decide what
 710 nodes that are allowed to survive. Only one SBR instance can acquire the lease make
 711 the decision to remain up. The other side will not be able to acquire the lease and
 712 will therefore down itself.
 713
 714 This strategy is very safe since coordination is added by an external arbiter.
 715
 716 * In some cases the lease will be unavailable when needed for a decision from all
 717   SBR instances, e.g. because it is on another side of a network partition, and then
 718   all nodes will be downed.
 719
 720 Configuration::
 721
 722     akka {
 723       cluster {
 724         downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
 725         split-brain-resolver {
 726           active-strategy = "lease-majority"
 727           lease-majority {
 728             lease-implementation = "akka.coordination.lease.kubernetes"
 729           }
 730         }
 731       }
 732     }
 733
 734 ::
 735
 736     akka.cluster.split-brain-resolver.lease-majority {
 737       lease-implementation = ""
 738
 739       # This delay is used on the minority side before trying to acquire the lease,
 740       # as an best effort to try to keep the majority side.
 741       acquire-lease-delay-for-minority = 2s
 742
 743       # If the 'role' is defined the majority/minority is based only on members with that 'role'.
 744       role = ""
 745     }
 746
 747 Indirectly connected nodes
 748 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 749
 750 In a malfunctioning network there can be situations where nodes are observed as
 751 unreachable via some network links but they are still indirectly connected via
 752 other nodes, i.e. it’s not a clean network partition (or node crash).
 753
 754 When this situation is detected the Split Brain Resolvers will keep fully
 755 connected nodes and down all the indirectly connected nodes.
 756
 757 If there is a combination of indirectly connected nodes and a clean network
 758 partition it will combine the above decision with the ordinary decision,
 759 e.g. keep majority, after excluding suspicious failure detection observations.
 760
 761 Multi-DC cluster
 762 ----------------
 763
 764 An OpenDaylight cluster has an ability to run on multiple data centers in a way,
 765 that tolerates network partitions among them.
 766
 767 Nodes can be assigned to group of nodes by setting the
 768 ``akka.cluster.multi-data-center.self-data-center`` configuration property.
 769 A node can only belong to one data center and if nothing is specified a node will
 770 belong to the default data center.
 771
 772 The grouping of nodes is not limited to the physical boundaries of data centers,
 773 it could also be used as a logical grouping for other reasons, such as isolation
 774 of certain nodes to improve stability or splitting up a large cluster into smaller
 775 groups of nodes for better scalability.
 776
 777 Failure detection
 778 ^^^^^^^^^^^^^^^^^
 779
 780 Failure detection is performed by sending heartbeat messages to detect if a node
 781 is unreachable. This is done more frequently and with more certainty among the
 782 nodes in the same data center than across data centers.
 783
 784 Two different failure detectors can be configured for these two purposes:
 785
 786 * ``akka.cluster.failure-detector`` for failure detection within own data center
 787
 788 * ``akka.cluster.multi-data-center.failure-detector`` for failure detection across
 789   different data centers
 790
 791 Heartbeat messages for failure detection across data centers are only performed
 792 between a number of the oldest nodes on each side. The number of nodes is configured
 793 with ``akka.cluster.multi-data-center.cross-data-center-connections``.
 794
 795 This influences how rolling updates should be performed. Don’t stop all of the oldest nodes
 796 that are used for gossip at the same time. Stop one or a few at a time so that new
 797 nodes can take over the responsibility. It’s best to leave the oldest nodes until last.
 798
 799 Configuration::
 800
 801     multi-data-center {
 802       # Defines which data center this node belongs to. It is typically used to make islands of the
 803       # cluster that are colocated. This can be used to make the cluster aware that it is running
 804       # across multiple availability zones or regions. It can also be used for other logical
 805       # grouping of nodes.
 806       self-data-center = "default"
 807
 808
 809       # Try to limit the number of connections between data centers. Used for gossip and heartbeating.
 810       # This will not limit connections created for the messaging of the application.
 811       # If the cluster does not span multiple data centers, this value has no effect.
 812       cross-data-center-connections = 5
 813
 814       # The n oldest nodes in a data center will choose to gossip to another data center with
 815       # this probability. Must be a value between 0.0 and 1.0 where 0.0 means never, 1.0 means always.
 816       # When a data center is first started (nodes < 5) a higher probability is used so other data
 817       # centers find out about the new nodes more quickly
 818       cross-data-center-gossip-probability = 0.2
 819
 820       failure-detector {
 821         # FQCN of the failure detector implementation.
 822         # It must implement akka.remote.FailureDetector and have
 823         # a public constructor with a com.typesafe.config.Config and
 824         # akka.actor.EventStream parameter.
 825         implementation-class = "akka.remote.DeadlineFailureDetector"
 826
 827         # Number of potentially lost/delayed heartbeats that will be
 828         # accepted before considering it to be an anomaly.
 829         # This margin is important to be able to survive sudden, occasional,
 830         # pauses in heartbeat arrivals, due to for example garbage collect or
 831         # network drop.
 832         acceptable-heartbeat-pause = 10 s
 833
 834         # How often keep-alive heartbeat messages should be sent to each connection.
 835         heartbeat-interval = 3 s
 836
 837         # After the heartbeat request has been sent the first failure detection
 838         # will start after this period, even though no heartbeat message has
 839         # been received.
 840         expected-response-after = 1 s
 841       }
 842     }
 843
 844 Active/Backup Setup
 845 -------------------
 846
 847 It is desirable to have the possibility to fail over to a different
 848 data center, in case all nodes become unreachable. To achieve that
 849 shards in the backup data center must be in "non-voting" state.
 850
 851 The API to manipulate voting states on shards is defined as RPCs in the
 852 `cluster-admin.yang <https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/sal-cluster-admin-api/src/main/yang/cluster-admin.yang>`_
 853 file in the *controller* project, which is well documented. A summary is
 854 provided below.
 855
 856 .. note::
 857
 858   Unless otherwise indicated, the below POST requests are to be sent to any
 859   single cluster node.
 860
 861 To create an active/backup setup with a 6 node cluster (3 active and 3 backup
 862 nodes in two locations) such configuration is used:
 863
 864 * for member-1, member-2 and member-3 (active data center)::
 865
 866     akka.cluster.multi-data-center {
 867       self-data-center = "main"
 868     }
 869
 870 * for member-4, member-5, member-6 (backup data center)::
 871
 872     akka.cluster.multi-data-center {
 873       self-data-center = "backup"
 874     }
 875
 876 There is an RPC to set voting states of all shards on
 877 a list of nodes to a given state::
 878
 879    POST  /restconf/operations/cluster-admin:change-member-voting-states-for-all-shards
 880
 881    or
 882
 883    POST  /rests/operations/cluster-admin:change-member-voting-states-for-all-shards
 884
 885 This RPC needs the list of nodes and the desired voting state as input. For
 886 creating the backup nodes, this example input can be used::
 887
 888     {
 889       "input": {
 890         "member-voting-state": [
 891           {
 892             "member-name": "member-4",
 893             "voting": false
 894           },
 895           {
 896             "member-name": "member-5",
 897             "voting": false
 898           },
 899           {
 900             "member-name": "member-6",
 901             "voting": false
 902           }
 903         ]
 904       }
 905     }
 906
 907 When an active/backup deployment already exists, with shards on the backup
 908 nodes in non-voting state, all that is needed for a fail-over from the active
 909 data center to backup data center is to flip the voting state of each
 910 shard (on each node, active AND backup). That can be easily achieved with the
 911 following RPC call (no parameters needed)::
 912
 913     POST  /restconf/operations/cluster-admin:flip-member-voting-states-for-all-shards
 914
 915     or
 916
 917     POST /rests/operations/cluster-admin:flip-member-voting-states-for-all-shards
 918
 919 If it's an unplanned outage where the primary voting nodes are down, the
 920 "flip" RPC must be sent to a backup non-voting node. In this case there are no
 921 shard leaders to carry out the voting changes. However there is a special case
 922 whereby if the node that receives the RPC is non-voting and is to be changed
 923 to voting and there's no leader, it will apply the voting changes locally and
 924 attempt to become the leader. If successful, it persists the voting changes
 925 and replicates them to the remaining nodes.
 926
 927 When the primary site is fixed and you want to fail back to it, care must be
 928 taken when bringing the site back up. Because it was down when the voting
 929 states were flipped on the secondary, its persisted database won't contain
 930 those changes. If brought back up in that state, the nodes will think they're
 931 still voting. If the nodes have connectivity to the secondary site, they
 932 should follow the leader in the secondary site and sync with it. However if
 933 this does not happen then the primary site may elect its own leader thereby
 934 partitioning the 2 clusters, which can lead to undesirable results. Therefore
 935 it is recommended to either clean the databases (i.e., ``journal`` and
 936 ``snapshots`` directory) on the primary nodes before bringing them back up or
 937 restore them from a recent backup of the secondary site (see section
 938 :ref:`cluster_backup_restore`).
 939
 940 If is also possible to gracefully remove a node from a cluster, with the
 941 following RPC::
 942
 943     POST  /restconf/operations/cluster-admin:remove-all-shard-replicas
 944
 945     or
 946
 947     POST  /rests/operations/cluster-admin:remove-all-shard-replicas
 948
 949 and example input::
 950
 951     {
 952       "input": {
 953         "member-name": "member-1"
 954       }
 955     }
 956
 957 or just one particular shard::
 958
 959     POST  /restconf/operations/cluster-admin:remove-shard-replica
 960
 961     or
 962
 963     POST  /rests/operations/cluster-admin:remove-shard-replicas
 964
 965 with example input::
 966
 967     {
 968       "input": {
 969         "shard-name": "default",
 970         "member-name": "member-2",
 971         "data-store-type": "config"
 972       }
 973     }
 974
 975 Now that a (potentially dead/unrecoverable) node was removed, another one can
 976 be added at runtime, without changing the configuration files of the healthy
 977 nodes (requiring reboot)::
 978
 979     POST  /restconf/operations/cluster-admin:add-replicas-for-all-shards
 980
 981     or
 982
 983     POST  /rests/operations/cluster-admin:add-replicas-for-all-shards
 984
 985 No input required, but this RPC needs to be sent to the new node, to instruct
 986 it to replicate all shards from the cluster.
 987
 988 .. note::
 989
 990   While the cluster admin API allows adding and removing shards dynamically,
 991   the ``module-shard.conf`` and ``modules.conf`` files are still used on
 992   startup to define the initial configuration of shards. Modifications from
 993   the use of the API are not stored to those static files, but to the journal.
 994
 995 Extra Configuration Options
 996 ---------------------------
 997
 998 ============================================== ================= ======= ==============================================================================================================================================================================
 999 Name                                           Type              Default Description
1000 ============================================== ================= ======= ==============================================================================================================================================================================
1001 max-shard-data-change-executor-queue-size      uint32 (1..max)   1000    The maximum queue size for each shard's data store data change notification executor.
1002 max-shard-data-change-executor-pool-size       uint32 (1..max)   20      The maximum thread pool size for each shard's data store data change notification executor.
1003 max-shard-data-change-listener-queue-size      uint32 (1..max)   1000    The maximum queue size for each shard's data store data change listener.
1004 max-shard-data-store-executor-queue-size       uint32 (1..max)   5000    The maximum queue size for each shard's data store executor.
1005 shard-transaction-idle-timeout-in-minutes      uint32 (1..max)   10      The maximum amount of time a shard transaction can be idle without receiving any messages before it self-destructs.
1006 shard-snapshot-batch-count                     uint32 (1..max)   20000   The minimum number of entries to be present in the in-memory journal log before a snapshot is to be taken.
1007 shard-snapshot-data-threshold-percentage       uint8 (1..100)    12      The percentage of ``Runtime.totalMemory()`` used by the in-memory journal log before a snapshot is to be taken
1008 shard-heartbeat-interval-in-millis             uint16 (100..max) 500     The interval at which a shard will send a heart beat message to its remote shard.
1009 operation-timeout-in-seconds                   uint16 (5..max)   5       The maximum amount of time for akka operations (remote or local) to complete before failing.
1010 shard-journal-recovery-log-batch-size          uint32 (1..max)   5000    The maximum number of journal log entries to batch on recovery for a shard before committing to the data store.
1011 shard-transaction-commit-timeout-in-seconds    uint32 (1..max)   30      The maximum amount of time a shard transaction three-phase commit can be idle without receiving the next messages before it aborts the transaction
1012 shard-transaction-commit-queue-capacity        uint32 (1..max)   20000   The maximum allowed capacity for each shard's transaction commit queue.
1013 shard-initialization-timeout-in-seconds        uint32 (1..max)   300     The maximum amount of time to wait for a shard to initialize from persistence on startup before failing an operation (eg transaction create and change listener registration).
1014 shard-leader-election-timeout-in-seconds       uint32 (1..max)   30      The maximum amount of time to wait for a shard to elect a leader before failing an operation (eg transaction create).
1015 enable-metric-capture                          boolean           false   Enable or disable metric capture.
1016 bounded-mailbox-capacity                       uint32 (1..max)   1000    Max queue size that an actor's mailbox can reach
1017 persistent                                     boolean           true    Enable or disable data persistence
1018 shard-isolated-leader-check-interval-in-millis uint32 (1..max)   5000    the interval at which the leader of the shard will check if its majority followers are active and term itself as isolated
1019 ============================================== ================= ======= ==============================================================================================================================================================================
1020
1021 These configuration options are included in the ``etc/org.opendaylight.controller.cluster.datastore.cfg`` configuration file.