docs/cluster/carbon/scenarios.rst

   1
   2 Description of test scenarios
   3 :::::::::::::::::::::::::::::
   4
   5 This is a test plan written around M1 of Carbon cycle.
   6
   7 During the cycle several limitations were found,
   8 which resulted in tests which implement the scenarios
   9 is ways different from what is described here.
  10
  11 For list of limitations and differences, see `caveats page <caveats.html>`_.
  12 For more detailed descriptions of test cases as implemented, see `test description page <tests.html>`_.
  13
  14 Controller Cluster Service Functional Tests
  15 ===========================================
  16 The purpose of functional tests is to establish a known baseline behavior
  17 for basic services exposed to application plugins when the cluster member nodes encounter problems.
  18
  19 Isolation Mechanics
  20  Three-node scenarios executed in tests below need to be repeated for three distinct modes of isolation:
  21
  22  1) JVM freeze, initiated by 'kill -STOP <pid>' on the JVM process,
  23     followed by a 'kill -CONT <pid>' after three minutes. This simulates
  24     a long-running garbage collection cycle, VM suspension or similar,
  25     after which the JVM recovers without losing state and scheduled timers going off simultaneously.
  26  2) Network-level isolation via firewalling. Simulates a connectivity issue between member nodes,
  27     while all nodes continue to work as usual. This should be done
  28     by firewalling all traffic to and from the target node.
  29  3) JVM restart. This simulates a hard error, such as JVM error, VM reboot, and similar.
  30     The JVM loses its state and the scenario tests whether the failed node
  31     is able to result its operations as a member of the cluster.
  32
  33 Leader Shutdown
  34  The Shard implementation allows a leader to be shut down at run time,
  35  which is expected to perform a clean hand over to a new leader, elected from the remaining shard members.
  36
  37 DOMDataBroker
  38 ^^^^^^^^^^^^^
  39 Also known as 'the datastore', provides MVCC transaction and data change notifications.
  40
  41 Leader Stability
  42 ----------------
  43 The goal is to ensure that a single-established shard does not flap,
  44 i.e. does not trigger leader movement by causing crashes or timeouts.
  45 This is performed by having the BGP load generator
  46 run injection of 1 million prefixes, followed by their removal.
  47
  48 This test is executed in three scenarios:
  49
  50 + Single node
  51 + Three-node, with shard leader being local
  52 + Three-node, with shard leader being remote
  53
  54 Success criteria are:
  55
  56 + Both injection and removal succeed
  57 + No transaction errors reported to the generator
  58 + No leader movement on the backend
  59
  60 Clean Leader Shutdown
  61 ---------------------
  62 The goal is to ensure that applications do not observe disruption
  63 when a shard leader is shut down cleanly. This is performed by having
  64 a steady-stream producer execute operations against the shard
  65 and then initiate leader shard shutdown, then the producer is shut down cleanly.
  66
  67 This test is executed in two scenarios:
  68
  69 + Three-node, with shard leader being local
  70 + Three-node, with shard leader being remote
  71
  72 Success criteria are:
  73
  74 + No transaction errors occur
  75 + Producer shuts down cleanly (i.e. all transactions complete successfully)
  76
  77 Test tool: *test-transaction-producer*, running at 1K tps
  78
  79 + Steady, configurable producer started with:
  80
  81  + A transaction chain
  82  + Single transactions (note: these cannot overlap)
  83
  84 + Configurable transaction rate (i.e. transactions-per-second)
  85 + Single-operation transactions
  86 + Random mix across 1M entries
  87
  88 Explicit Leader Movement
  89 ------------------------
  90 The goal is to ensure that applications do not observe disruption
  91 when a shard leader is moved as the result of explicit application request.
  92 This is performed by having a steady-stream producer execute operations
  93 against the shard and then initiate shard leader shutdown,
  94 then the producer is shut down cleanly.
  95
  96 This test is executed in three scenarios:
  97
  98 + Three-node, with shard leader being local and becoming remote
  99 + Three-node, with shard leader being remote and remaining remote
 100 + Three-node, with shard leader being remote and becoming local
 101
 102 Success criteria are:
 103
 104 + No transaction errors occur
 105 + Producer shuts down cleanly (i.e. all transactions complete successfully)
 106
 107 Test tool: test-transaction-producer, running at 1K tps
 108 Test tool: *test-leader-mover*
 109
 110 + Uses cds-dom-api to request shard movement
 111
 112 Leader Isolation
 113 ----------------
 114 The goal is to ensure the datastore succeeds in basic isolation/rejoin scenario,
 115 simulating either a network partition, or a prolonged GC pause.
 116
 117 This test is executed in the following two scenarios:
 118
 119 + Three-node, partition heals within TRANSACTION_TIMEOUT
 120 + Three-node, partition heals after 2*TRANSACTION_TIMEOUT
 121
 122 Using following steps:
 123
 124 1) Start test-transaction producer, running at 1K tps, non-overlapping, from all nodes to a single shard
 125 2) Isolate leader
 126 3) Wait for followers to initiate election
 127 4) Un-isolate leader
 128 5) Wait for partition to heal
 129 6) Restart failed producer
 130
 131 Success criteria:
 132
 133 + Followers win election in 3
 134 + No transaction failures occur if the partition is healed within TRANSACTION_TIMEOUT
 135 + Producer on old leader works normally after step 6)
 136
 137 Test tool: test-transaction-producer
 138
 139 Client Isolation
 140 ----------------
 141 The purpose of this test is to ascertain that the failure modes of cds-access-client work as expected.
 142 This is performed by having a steady stream of transactions flowing from the frontend
 143 and isolating the node hosting the frontend from the rest of the cluster.
 144
 145 This test is executed in one scenario:
 146
 147 + Three node,  test-transaction-producer running on a non-leader
 148 + Three node,  test-transaction-producer running on the leader
 149
 150 Success criteria:
 151
 152 + After TRANSACTION_TIMEOUT failures occur
 153 + After HARD_TIMEOUT client aborts
 154
 155 Test tool: test-transaction-producer
 156
 157 Listener Isolation
 158 ------------------
 159 The goal is to ensure listeners do no observe disruption when the leader moves.
 160 This is performed by having a steady stream of transactions
 161 being observed by the listeners and having the leader move.
 162
 163 This test is executed in two scenarios:
 164
 165 + Three node,  test-transaction-listener running on the leader
 166 + Three node,  test-transaction-listener running on a non-leader
 167
 168 Using these steps:
 169
 170 + Start the listener on target node
 171 + Start test-transaction-producer on each node, with 1K tps, non-overlapping data
 172 + Trigger shard movement by shutting down shard leader
 173 + Stop producers without erasing data
 174 + Stop listener
 175
 176 Success criteria:
 177
 178 + Listener-internal data tree has to match data stored in the data tree
 179
 180 Test tool: *test-transaction-listener*
 181
 182 + Subscribes a DTCL to multiple subtrees (as specified)
 183 + DTCL applies reported changes to an internal DataTree
 184
 185 DOMRpcBroker
 186 ^^^^^^^^^^^^
 187 Responsible for routing RPC requests to their implementations and routing responses back to the caller.
 188
 189 RPC Provider Precedence
 190 -----------------------
 191 The aim is to establish that remote RPC implementations have lower priority
 192 than local ones, which is to say that any movement of RPCs on remote nodes
 193 does not affect routing as long as a local implementation is available.
 194
 195 Test is executed only in a three-node scenario, using the following steps:
 196
 197 1) Register an RPC implementation on each node
 198 2) Invoke RPC on each node
 199 3) Unregister implementation on one node
 200 4) Invoke RPC on that node
 201 5) Re-register implementation on than node
 202 6) Invoke RPC on that node
 203
 204 Success criteria:
 205
 206 + Invocation in steps 2) and 6) results in a response from local node
 207 + Invocation in step 4) results in a response from one of the other two nodes
 208
 209 RPC Provider Partition and Heal
 210 -------------------------------
 211 This tests establishes that the RPC service operates correctly when faced with node failures.
 212
 213 Test is executed only in a three-node scenario, using the following steps:
 214
 215 1) Register an RPC implementation on two nodes
 216 2) Invoke RPC on each node
 217 3) Isolate one of the nodes where RPC is registered
 218 4) Invoke RPC on each node
 219 5) Un-isolate the node
 220 6) Invoke RPC on all nodes
 221
 222 Success criteria:
 223
 224 + Step 2) routes the RPC the node nearest node (local or remote)
 225 + Step 4) works, routing the RPC request to the implementation in the same partition
 226 + Step 6) routes the RPC the node nearest node (local or remote)
 227
 228 Action Provider Precedence
 229 --------------------------
 230 The aim is to establish that remote action implementations have lower priority than local ones,
 231 which is to say that any movement of actions on remote nodes does not affect routing
 232 as long as a local implementation is available.
 233
 234 Test is executed only in a three-node scenario, using the following steps:
 235
 236 1) Register an action implementation on each node
 237 2) Invoke action on each node
 238 3) Unregister implementation on one node
 239 4) Invoke action on that node
 240 5) Re-register implementation on than node
 241 6) Invoke action on that node
 242
 243 Success criteria:
 244
 245 + Invocation in steps 2) and 6) results in a response from local node
 246 + Invocation in step 4) results in a response from one of the other two nodes
 247
 248 Action Provider Partition and Heal
 249 ----------------------------------
 250 This tests establishes that the RPC service for actions operates correctly when faced with node failures.
 251
 252 Test is executed only in a three-node scenario, using the following steps:
 253
 254 1) Register an action implementation on two nodes
 255 2) Invoke action on each node
 256 3) Isolate one of the nodes where RPC is registered
 257 4) Invoke action on each node
 258 5) Un-isolate the node
 259 6) Invoke action on all nodes
 260
 261 Success criteria:
 262
 263 + Step 2) routes the action request the node nearest node (local or remote)
 264 + Step 4) works, routing the action request to the implementation in the same partition
 265 + Step 6) routes the RPC the node nearest node (local or remote)
 266
 267 DOMNotificationBroker
 268 ^^^^^^^^^^^^^^^^^^^^^
 269 Provides routing of YANG notifications from publishers to subscribers.
 270
 271 No-loss rate
 272 ------------
 273 The purpose of this test is to determine the broker can forward messages without loss.
 274 We do this on a single-node setup by incrementally adding publishers and subscribers.
 275
 276 This test is executed in one scenario:
 277
 278 + Single-node
 279
 280 Steps:
 281
 282 + Start test-notification-subscriber
 283 + Start test-notification-publisher at 5K notifications/sec
 284 + Run for 5 minutes, verify no notifications lost
 285 + Add another pair of publisher/subscriber, repeat for rate of 60K notifications/sec
 286
 287 Success criteria:
 288
 289 + No notifications lost at rate of 60K notifications/sec
 290
 291 Test tool: *test-notification-publisher*
 292
 293 + Publishes notifications containing instance id and sequence number
 294 + Configurable rate (i.e. notifications-per-second)
 295
 296 Test tool: *test-notification-subscriber*
 297
 298 + Subscribes to specified notifications from publisher
 299 + Verifies notification sequence numbers
 300 + Records total number of notifications received and number of sequence errors
 301
 302 Cluster Singleton
 303 ^^^^^^^^^^^^^^^^^
 304 Cluster Singleton service is designed to ensure that
 305 only one instance of an application is registered globally in the cluster.
 306
 307 Master Stability
 308 ----------------
 309 The goal is to establish the service operates correctly in face of application registration changing
 310 without moving the active instance.
 311
 312 The test is performed in a three-node cluster using following steps:
 313
 314 1) Register candidate on each node
 315 2) Wait for master activation
 316 3) Remove non-master candidate,
 317 4) Wait one minute
 318 5) Restore the removed candidate
 319
 320 Success criteria:
 321
 322 + After step 2) there is exactly one master in the cluster
 323 + The master does not move to a different node for the duration of the test
 324
 325 Partition and Heal
 326 ------------------
 327 The goal is to establish the service operates correctly in face of node failures.
 328
 329 The test is performed in a three-node cluster using following steps:
 330
 331 1) Register candidate on each node
 332 2) Wait for master activation
 333 3) Isolate master node
 334 4) Wait two minutes
 335 5) Un-isolate (former) master node
 336 6) Wait one minute
 337
 338 Success criteria:
 339
 340 + After step 3), master instance is brought down on isolated node
 341 + During step 4) majority partition elects a new master
 342 + Until 5) occurs, old master remains deactivated
 343 + After 6) old master remains deactivated
 344
 345 Chasing the Leader
 346 ------------------
 347 This test aims to establish the service operates correctly
 348 when faced with rapid application transitions without having a stabilized application.
 349
 350 This test is performed in a three-node setup using the following steps:
 351
 352 1) Register a candidate on each node
 353 2) Wait for master activation
 354 3) Newly activated master unregisters itself
 355 4) Repeat 2
 356
 357 Success criteria:
 358
 359 + No failures occur for 5 minutes
 360 + Transition speed is at least 100 movements per second
 361
 362 Controller Cluster Services Longevity Tests
 363 ===========================================
 364
 365 1) Run No-Loss Rate test for 24 hours. No message loss, instability or memory leaks may occur.
 366 2) Repeat Leader Stability test for 24 hours. No transaction failures, instability, leader movement or memory leaks may occur.
 367 3) Repeat Explicit Leader Movement test for 24 hours. No transaction failures, instability, leader movement or memory leaks may occur.
 368 4) Repeat RPC Provider Precedence test for 24 hours. No failures or memory leaks may occur.
 369 5) Repeat RPC partition and Heal test for 24 hours. No failures or memory leaks may occur.
 370 6) Repeat Chasing the Leader test for 24 hours. No memory leaks or failures may occur.
 371 7) Repeat Partition and Heal test for 24 hours. No memory leaks or failures may occur.
 372
 373 NETCONF System Tests
 374 ====================
 375 Netconf is an MD-SAL application, which listens to config datastore changes,
 376 registers a singleton for every configured device, instantiated singleton is updating device connection data
 377 in operational datastore, maintaining a mount point and handling access to the mounted device.
 378
 379 Basic configuration and mount point access
 380 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 381 No disruptions, ordinary netconf operation with restconf calls to different cluster members.
 382
 383 Test is executed in a three-node scenario, using the following steps:
 384
 385 1) Configure connection to test device on member-1.
 386 2) Create, update and delete data on the device using calls to member-2.
 387 3) Each state change confirmed by reading device data on member-3.
 388 4) De-configure the device connection.
 389
 390 Success criteria:
 391
 392 + All reads confirm data operations are applied correctly.
 393
 394 Device owner killed
 395 ^^^^^^^^^^^^^^^^^^^
 396 Killing current device owner leads to electing new owner. Operations are still applied.
 397
 398 The test is performed in a three-node cluster using following steps:
 399
 400 1) Configure connection to test device on member-1.
 401 2) Create data on the device using a call to member-2.
 402 3) Locate and kill the device owner member.
 403 4) Wait for a new owner to get elected.
 404 5) Update data on the device using a call to one of the surviving members.
 405 6) Restart the killed member.
 406 7) Update the data again using a call to the restarted member.
 407
 408 Success criteria:
 409
 410 + Each operation (including restart) is confirmed by reads on all members currently up.
 411
 412 Rolling restarts
 413 ^^^^^^^^^^^^^^^^
 414 Each member is restarted (start is waiting for cluster sync) in succession,
 415 this is to guarantee each Leader is affected.
 416
 417 The test is performed in a three-node cluster using following steps:
 418
 419 1)  Configure connection to test device on member-1.
 420 2)  Kill member-1.
 421 3)  Create data on the device using a call to member-2.
 422 4)  Start member-1.
 423 5)  Kill member-2.
 424 6)  Update data on the device using a call to member-3.
 425 7)  Start member-2.
 426 8)  Kill member-3.
 427 9)  Delete data on the device using a call to member-1.
 428 10) Start member-3.
 429
 430 Success criteria:
 431
 432 + After every operation, reads on both living members confirm it was applied.
 433 + After every start, a read on the started node confirms it sees the device data from the previous operation.