2 Description of test scenarios
3 :::::::::::::::::::::::::::::
5 This is a test plan written around M1 of Carbon cycle.
7 During the cycle several limitations were found,
8 which resulted in tests which implement the scenarios
9 is ways different from what is described here.
11 For list of limitations and differences, see `caveats page <caveats.html>`_.
12 For more detailed descriptions of test cases as implemented, see `test description page <tests.html>`_.
14 Controller Cluster Service Functional Tests
15 ===========================================
16 The purpose of functional tests is to establish a known baseline behavior
17 for basic services exposed to application plugins when the cluster member nodes encounter problems.
20 Three-node scenarios executed in tests below need to be repeated for three distinct modes of isolation:
22 1) JVM freeze, initiated by 'kill -STOP <pid>' on the JVM process,
23 followed by a 'kill -CONT <pid>' after three minutes. This simulates
24 a long-running garbage collection cycle, VM suspension or similar,
25 after which the JVM recovers without losing state and scheduled timers going off simultaneously.
26 2) Network-level isolation via firewalling. Simulates a connectivity issue between member nodes,
27 while all nodes continue to work as usual. This should be done
28 by firewalling all traffic to and from the target node.
29 3) JVM restart. This simulates a hard error, such as JVM error, VM reboot, and similar.
30 The JVM loses its state and the scenario tests whether the failed node
31 is able to result its operations as a member of the cluster.
34 The Shard implementation allows a leader to be shut down at run time,
35 which is expected to perform a clean hand over to a new leader, elected from the remaining shard members.
39 Also known as 'the datastore', provides MVCC transaction and data change notifications.
43 The goal is to ensure that a single-established shard does not flap,
44 i.e. does not trigger leader movement by causing crashes or timeouts.
45 This is performed by having the BGP load generator
46 run injection of 1 million prefixes, followed by their removal.
48 This test is executed in three scenarios:
51 + Three-node, with shard leader being local
52 + Three-node, with shard leader being remote
56 + Both injection and removal succeed
57 + No transaction errors reported to the generator
58 + No leader movement on the backend
62 The goal is to ensure that applications do not observe disruption
63 when a shard leader is shut down cleanly. This is performed by having
64 a steady-stream producer execute operations against the shard
65 and then initiate leader shard shutdown, then the producer is shut down cleanly.
67 This test is executed in two scenarios:
69 + Three-node, with shard leader being local
70 + Three-node, with shard leader being remote
74 + No transaction errors occur
75 + Producer shuts down cleanly (i.e. all transactions complete successfully)
77 Test tool: *test-transaction-producer*, running at 1K tps
79 + Steady, configurable producer started with:
82 + Single transactions (note: these cannot overlap)
84 + Configurable transaction rate (i.e. transactions-per-second)
85 + Single-operation transactions
86 + Random mix across 1M entries
88 Explicit Leader Movement
89 ------------------------
90 The goal is to ensure that applications do not observe disruption
91 when a shard leader is moved as the result of explicit application request.
92 This is performed by having a steady-stream producer execute operations
93 against the shard and then initiate shard leader shutdown,
94 then the producer is shut down cleanly.
96 This test is executed in three scenarios:
98 + Three-node, with shard leader being local and becoming remote
99 + Three-node, with shard leader being remote and remaining remote
100 + Three-node, with shard leader being remote and becoming local
102 Success criteria are:
104 + No transaction errors occur
105 + Producer shuts down cleanly (i.e. all transactions complete successfully)
107 Test tool: test-transaction-producer, running at 1K tps
108 Test tool: *test-leader-mover*
110 + Uses cds-dom-api to request shard movement
114 The goal is to ensure the datastore succeeds in basic isolation/rejoin scenario,
115 simulating either a network partition, or a prolonged GC pause.
117 This test is executed in the following two scenarios:
119 + Three-node, partition heals within TRANSACTION_TIMEOUT
120 + Three-node, partition heals after 2*TRANSACTION_TIMEOUT
122 Using following steps:
124 1) Start test-transaction producer, running at 1K tps, non-overlapping, from all nodes to a single shard
126 3) Wait for followers to initiate election
128 5) Wait for partition to heal
129 6) Restart failed producer
133 + Followers win election in 3
134 + No transaction failures occur if the partition is healed within TRANSACTION_TIMEOUT
135 + Producer on old leader works normally after step 6)
137 Test tool: test-transaction-producer
141 The purpose of this test is to ascertain that the failure modes of cds-access-client work as expected.
142 This is performed by having a steady stream of transactions flowing from the frontend
143 and isolating the node hosting the frontend from the rest of the cluster.
145 This test is executed in one scenario:
147 + Three node, test-transaction-producer running on a non-leader
148 + Three node, test-transaction-producer running on the leader
152 + After TRANSACTION_TIMEOUT failures occur
153 + After HARD_TIMEOUT client aborts
155 Test tool: test-transaction-producer
159 The goal is to ensure listeners do no observe disruption when the leader moves.
160 This is performed by having a steady stream of transactions
161 being observed by the listeners and having the leader move.
163 This test is executed in two scenarios:
165 + Three node, test-transaction-listener running on the leader
166 + Three node, test-transaction-listener running on a non-leader
170 + Start the listener on target node
171 + Start test-transaction-producer on each node, with 1K tps, non-overlapping data
172 + Trigger shard movement by shutting down shard leader
173 + Stop producers without erasing data
178 + Listener-internal data tree has to match data stored in the data tree
180 Test tool: *test-transaction-listener*
182 + Subscribes a DTCL to multiple subtrees (as specified)
183 + DTCL applies reported changes to an internal DataTree
187 Responsible for routing RPC requests to their implementations and routing responses back to the caller.
189 RPC Provider Precedence
190 -----------------------
191 The aim is to establish that remote RPC implementations have lower priority
192 than local ones, which is to say that any movement of RPCs on remote nodes
193 does not affect routing as long as a local implementation is available.
195 Test is executed only in a three-node scenario, using the following steps:
197 1) Register an RPC implementation on each node
198 2) Invoke RPC on each node
199 3) Unregister implementation on one node
200 4) Invoke RPC on that node
201 5) Re-register implementation on than node
202 6) Invoke RPC on that node
206 + Invocation in steps 2) and 6) results in a response from local node
207 + Invocation in step 4) results in a response from one of the other two nodes
209 RPC Provider Partition and Heal
210 -------------------------------
211 This tests establishes that the RPC service operates correctly when faced with node failures.
213 Test is executed only in a three-node scenario, using the following steps:
215 1) Register an RPC implementation on two nodes
216 2) Invoke RPC on each node
217 3) Isolate one of the nodes where RPC is registered
218 4) Invoke RPC on each node
219 5) Un-isolate the node
220 6) Invoke RPC on all nodes
224 + Step 2) routes the RPC the node nearest node (local or remote)
225 + Step 4) works, routing the RPC request to the implementation in the same partition
226 + Step 6) routes the RPC the node nearest node (local or remote)
228 Action Provider Precedence
229 --------------------------
230 The aim is to establish that remote action implementations have lower priority than local ones,
231 which is to say that any movement of actions on remote nodes does not affect routing
232 as long as a local implementation is available.
234 Test is executed only in a three-node scenario, using the following steps:
236 1) Register an action implementation on each node
237 2) Invoke action on each node
238 3) Unregister implementation on one node
239 4) Invoke action on that node
240 5) Re-register implementation on than node
241 6) Invoke action on that node
245 + Invocation in steps 2) and 6) results in a response from local node
246 + Invocation in step 4) results in a response from one of the other two nodes
248 Action Provider Partition and Heal
249 ----------------------------------
250 This tests establishes that the RPC service for actions operates correctly when faced with node failures.
252 Test is executed only in a three-node scenario, using the following steps:
254 1) Register an action implementation on two nodes
255 2) Invoke action on each node
256 3) Isolate one of the nodes where RPC is registered
257 4) Invoke action on each node
258 5) Un-isolate the node
259 6) Invoke action on all nodes
263 + Step 2) routes the action request the node nearest node (local or remote)
264 + Step 4) works, routing the action request to the implementation in the same partition
265 + Step 6) routes the RPC the node nearest node (local or remote)
267 DOMNotificationBroker
268 ^^^^^^^^^^^^^^^^^^^^^
269 Provides routing of YANG notifications from publishers to subscribers.
273 The purpose of this test is to determine the broker can forward messages without loss.
274 We do this on a single-node setup by incrementally adding publishers and subscribers.
276 This test is executed in one scenario:
282 + Start test-notification-subscriber
283 + Start test-notification-publisher at 5K notifications/sec
284 + Run for 5 minutes, verify no notifications lost
285 + Add another pair of publisher/subscriber, repeat for rate of 60K notifications/sec
289 + No notifications lost at rate of 60K notifications/sec
291 Test tool: *test-notification-publisher*
293 + Publishes notifications containing instance id and sequence number
294 + Configurable rate (i.e. notifications-per-second)
296 Test tool: *test-notification-subscriber*
298 + Subscribes to specified notifications from publisher
299 + Verifies notification sequence numbers
300 + Records total number of notifications received and number of sequence errors
304 Cluster Singleton service is designed to ensure that
305 only one instance of an application is registered globally in the cluster.
309 The goal is to establish the service operates correctly in face of application registration changing
310 without moving the active instance.
312 The test is performed in a three-node cluster using following steps:
314 1) Register candidate on each node
315 2) Wait for master activation
316 3) Remove non-master candidate,
318 5) Restore the removed candidate
322 + After step 2) there is exactly one master in the cluster
323 + The master does not move to a different node for the duration of the test
327 The goal is to establish the service operates correctly in face of node failures.
329 The test is performed in a three-node cluster using following steps:
331 1) Register candidate on each node
332 2) Wait for master activation
333 3) Isolate master node
335 5) Un-isolate (former) master node
340 + After step 3), master instance is brought down on isolated node
341 + During step 4) majority partition elects a new master
342 + Until 5) occurs, old master remains deactivated
343 + After 6) old master remains deactivated
347 This test aims to establish the service operates correctly
348 when faced with rapid application transitions without having a stabilized application.
350 This test is performed in a three-node setup using the following steps:
352 1) Register a candidate on each node
353 2) Wait for master activation
354 3) Newly activated master unregisters itself
359 + No failures occur for 5 minutes
360 + Transition speed is at least 100 movements per second
362 Controller Cluster Services Longevity Tests
363 ===========================================
365 1) Run No-Loss Rate test for 24 hours. No message loss, instability or memory leaks may occur.
366 2) Repeat Leader Stability test for 24 hours. No transaction failures, instability, leader movement or memory leaks may occur.
367 3) Repeat Explicit Leader Movement test for 24 hours. No transaction failures, instability, leader movement or memory leaks may occur.
368 4) Repeat RPC Provider Precedence test for 24 hours. No failures or memory leaks may occur.
369 5) Repeat RPC partition and Heal test for 24 hours. No failures or memory leaks may occur.
370 6) Repeat Chasing the Leader test for 24 hours. No memory leaks or failures may occur.
371 7) Repeat Partition and Heal test for 24 hours. No memory leaks or failures may occur.
375 Netconf is an MD-SAL application, which listens to config datastore changes,
376 registers a singleton for every configured device, instantiated singleton is updating device connection data
377 in operational datastore, maintaining a mount point and handling access to the mounted device.
379 Basic configuration and mount point access
380 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381 No disruptions, ordinary netconf operation with restconf calls to different cluster members.
383 Test is executed in a three-node scenario, using the following steps:
385 1) Configure connection to test device on member-1.
386 2) Create, update and delete data on the device using calls to member-2.
387 3) Each state change confirmed by reading device data on member-3.
388 4) De-configure the device connection.
392 + All reads confirm data operations are applied correctly.
396 Killing current device owner leads to electing new owner. Operations are still applied.
398 The test is performed in a three-node cluster using following steps:
400 1) Configure connection to test device on member-1.
401 2) Create data on the device using a call to member-2.
402 3) Locate and kill the device owner member.
403 4) Wait for a new owner to get elected.
404 5) Update data on the device using a call to one of the surviving members.
405 6) Restart the killed member.
406 7) Update the data again using a call to the restarted member.
410 + Each operation (including restart) is confirmed by reads on all members currently up.
414 Each member is restarted (start is waiting for cluster sync) in succession,
415 this is to guarantee each Leader is affected.
417 The test is performed in a three-node cluster using following steps:
419 1) Configure connection to test device on member-1.
421 3) Create data on the device using a call to member-2.
424 6) Update data on the device using a call to member-3.
427 9) Delete data on the device using a call to member-1.
432 + After every operation, reads on both living members confirm it was applied.
433 + After every start, a read on the started node confirms it sees the device data from the previous operation.