docs/cluster/carbon/caveats.rst

   1 =======
   2 Caveats
   3 =======
   4
   5 This sub-page describes ways the test implementation (or results) differs
   6 from the `original specification <scenarios.html>`_ and which information motivates the difference.
   7
   8 Jenkins job structure
   9 ~~~~~~~~~~~~~~~~~~~~~
  10
  11 + Information
  12
  13 At the start of test implementation, all the Controller 3node test cases were added into an existing Jenkins job.
  14
  15 During test development it was become clear, that adding all possible tests would make the job to run too long.
  16
  17 Dividing the job into several smaller ones is possible, but most likely the history would be lost,
  18 unless Linux Foundation admins figure out a way to create multiple job clones with history copied.
  19
  20 + Testing consequence
  21
  22 Even with number of test cases reduced (see below), the job duration is around three and half hours.
  23
  24 + How to fix
  25
  26 After Carbon SR2 release, the jobs can be split, as there will be enough time
  27 to generate new history till Carbon SR3.
  28
  29 Akka bugs
  30 ~~~~~~~~~
  31
  32 These are bugs which need either a fix in Akka codebase,
  33 or a workaround which would be too time-consuming to implement in ODL.
  34
  35 Both bugs manifest as UnreachableMember event (without intentional isolation).
  36
  37 Slow heartbeats
  38 ---------------
  39
  40 + Information
  41
  42 Akka sends periodic heartbeats in order to detect when the other member is being unresponsive.
  43
  44 The heartbeats are being serialized into the same TCP channel as ordinary data,
  45 which means if ODL is processing big amount of data, the heartbeats can spend a long time
  46 in TCP (or other) buffers before being processed. When this time exceeds a specific value
  47 (currently 6 seconds), the peer memeber is declared unreachable, generally leading to leader movement.
  48
  49 This affects BGP test results on 3node setup, as ODL is processing BGP data as quickly as possible,
  50 but the current BGP implementation does not handle rib owner movement gracefully (and leader movement
  51 is explicitly checked by the test, as the scenario dictates it should not happen).
  52 This does not affect other data broker tests, 1000 transactions per second do not generate critical throughput.
  53
  54 + Testing consequence
  55
  56 Three test cases are failing due to `Bug 8318 <https://bugs.opendaylight.org/show_bug.cgi?id=8318>`__.
  57
  58 + How to fix
  59
  60 Possibly, a different akka configuration could be applied to separate akka cluster status messages
  61 into a different TCP stream than ordinary data stream.
  62
  63 Otherwise, a contribution to Akka project would be needed.
  64
  65 Reachability gossip
  66 -------------------
  67
  68 + Information
  69
  70 Akka uses a gossip protocol to advertize one member's reachability to other members.
  71 There is a logic which allows for faster detection of unreachable members,
  72 when a member can declare its peer unreachable if it got information from another peer
  73 which is considered more up-to-date.
  74
  75 Ocassionally, this logic results in undesired behavior. This is when the supposedly up-to-date peer
  76 has been isolated and now it is rejoining. Depending on timing, this can introduce additional leader movement,
  77 or a very brief moment when a member "forgets" RPC registrations from other member.
  78
  79 This is causing bugs `8420 <https://bugs.opendaylight.org/show_bug.cgi?id=8420>`__
  80 and `8430 <https://bugs.opendaylight.org/show_bug.cgi?id=8430>`__.
  81
  82 + Testing consequence
  83
  84 This affects "partition and heal" scenarios in singleton testing.
  85 In functional tests, the failures are infrequent enough to consider the test mostly stable overall,
  86 but the corresponding longevity jobs are failing consistently.
  87
  88 The tests for "partition and heal" scenarios in RPC testing have been changed
  89 to tolerate wrong RPC results for 10 seconds to work around this Akka bug.
  90
  91 + How to fix
  92
  93 This does not seem fixable on ODL level, contribution to Akka project is needed.
  94
  95 Missing features
  96 ~~~~~~~~~~~~~~~~
  97
  98 Cluster yang notifications
  99 --------------------------
 100
 101 + Information
 102
 103 Yang notifications are not delivered to peer members.
 104 `Bug 2139 <https://bugs.opendaylight.org/show_bug.cgi?id=2139>`__
 105 is only fixed for data change notifications, not Yang notifications.
 106
 107 `Bug 2140 <https://bugs.opendaylight.org/show_bug.cgi?id=2140>`__ tracks adding this missing functionality.
 108
 109 + Testing consequence
 110
 111 Notification suites are running on 1-node setup only.
 112
 113 + How to fix
 114
 115 After the funtionality is added, it will be straightforward to add 3node tests.
 116
 117 New features
 118 ~~~~~~~~~~~~
 119
 120 Tell-based protocol
 121 -------------------
 122
 123 + Information
 124
 125 Tell-based protocol is an alternative to ask-based protocol from Boron.
 126 Which protocol to use is decided by a line in a configuration file
 127 (org.opendaylight.controller.cluster.datastore.cfg).
 128
 129 Some scenarios are expected to fail due to known limitations of ask-based protocol.
 130 More specifically, if a shard leader moves while a transaction is open in ask-based protocol,
 131 the transaction will fail (AskTimeoutException).
 132
 133 This affects only data broker tests, not RPC calls.
 134
 135 + Testing consequence
 136
 137 In principle, this doubles the number of configurations to be tested, but see below.
 138
 139 + How to fix
 140
 141 It is planned for tell-based protocol to become the default setting after Carbon SR2.
 142 After that, tests for ask-based protocol can be converted or removed.
 143
 144 Prefix-based shards
 145 -------------------
 146
 147 + Information
 148
 149 Tell-based shards are an alternative to module-based shards from Boron.
 150 Tell-based shards can be only created dynamically (as opposed to being read from a configuration file at startup).
 151 It is possible to use both types of shards, but data writes and reads use different API,
 152 so any Mdsal application needs to know which API to use.
 153
 154 The implementation of prefix-based shards is hardwired to tell-based protocol
 155 (even if ask-based protocol is configured as the default).
 156
 157 + Testing consequence
 158
 159 This doubles the number of configurations to be tested, for tests related to data droker (RPCs are unaffected).
 160
 161 + How to fix
 162
 163 ODL contains great many applications which use APIs for module-based shards.
 164 It is expected that multiple releases would still need both types of tests cases.
 165 Module-based shards will be deprecated and removed eventually.
 166
 167 Producer options
 168 ----------------
 169
 170 + Information
 171
 172 Data producers for module-based shards can produce either chained transactions or standalone transactions.
 173 Data producers for prefix-based shards can produce either non-isolated transactions (change notifications
 174 can combine several transactions together) or isolated transactions.
 175
 176 + Testing consequence
 177
 178 In principle, this results in multiple Robot test cases for the same documented scenario case, but see below.
 179
 180 + How to fix
 181
 182 All test cases will be needed in forseeable future.
 183 Instead, more negative test cases may need be added to verify different options lead to different behavior.
 184
 185 Initial leader placement
 186 ~~~~~~~~~~~~~~~~~~~~~~~~
 187
 188 + Information
 189
 190 Some scenarios do not specify initial locations of relevant shard leaders.
 191 Test results can depend on it in presence of bugs.
 192
 193 This is mostly relevant to BGP test, which has three relevant members:
 194 Rib owner, default operation shard leader and topology operational shard leader.
 195
 196 + Testing consequence
 197
 198 Two test cases are tested. The two shard leaders are always together, rib owner is either co-located or not.
 199 This is done by suite moving shard leaders after detecting rib owner location.
 200
 201 + How to fix
 202
 203 Even more placements can be tested when job duration stops being the limiting factor.
 204
 205 Reduced BGP scaling
 206 ~~~~~~~~~~~~~~~~~~~
 207
 208 + Information
 209
 210 Rib owner maintains de-duplicated data structures.
 211 Other members get serialized copies and they do not de-duplicate.
 212
 213 Even single node strugless to fit into 6GB heap with tell-based protocol,
 214 see `Bug 8649 <https://bugs.opendaylight.org/show_bug.cgi?id=8649>`__.
 215
 216 + Testing consequence
 217
 218 Scale from reported tests reduced from 1 million prefixes to 300 thousand prefixes.
 219
 220 + How to fix
 221
 222 Other members should be able to perform de-duplication, but developing that takes effort.
 223
 224 In the meantime, Linux Foundation could be convinced to allow for bigger VMs,
 225 currently limited by infrastructure available.
 226
 227 Increased timeouts
 228 ~~~~~~~~~~~~~~~~~~
 229
 230 RequestTimeoutException
 231 -----------------------
 232
 233 + Information
 234
 235 With tell-based protocol, restconf requests might stay open up to 120 seconds before returning an error.
 236 Even shard state reads using Jolokia can take long time if the shard actor is busy processing other messages.
 237
 238 + Testing consequence
 239
 240 This increases duration for tests which need to verify transaction errors do happen
 241 after sufficiently long isolation. Also, duration is increased if a test fails on a read which is otherwise quick.
 242
 243 + How to fix
 244
 245 This involves a trade-off between stability and responsiveness.
 246 As MD-SAL applications rarely tolerate transaction failures, users would prefer stability.
 247 That means relatively longer timeouts are there to stay, which means test case duration
 248 will stay high in negative (or failing positive) tests.
 249
 250 Client abort timeout
 251 --------------------
 252
 253 + Information
 254
 255 Client abort timeout is currently set to 15 minutes. The operational consequence is
 256 just an inability to start another data producer on a member isolated for that long.
 257 This test has too long duration compared to its usefulness.
 258
 259 + Testing consequence
 260
 261 This test case has never been implemented.
 262
 263 Instead a test with isolation shorter than 120 seconds is implemented,
 264 the test verifies the data producer continues its operation without RequestTimeoutException.
 265
 266 + How to fix
 267
 268 It is straighforward to add the missing test cases when job duration stops being a limiting factor.
 269
 270 No shard shutdown
 271 ~~~~~~~~~~~~~~~~~
 272
 273 + Common information.
 274
 275 There are multiple RPCs offering different "severity" of shard shutdown.
 276 For technical details see comments on `change 58580 <https://git.opendaylight.org/gerrit/58580>`__.
 277
 278 If tests perform rigorous teardown, the shard replica should be re-activated,
 279 which is an operation not every RPC supports.
 280
 281 Listener stability suite
 282 ------------------------
 283
 284 + Information
 285
 286 Current implementation of data listeners relies on a shard replica to be active on a member
 287 which is to receive the notification. Until that is imroved,
 288 `Bug 8629 <https://bugs.opendaylight.org/show_bug.cgi?id=8629>`__ prevents this scenario
 289 from being tested as described.
 290
 291 + Testing consequence
 292
 293 The suite uses become-leader RPC instead. This has an added benefit of test case being able to pick which member
 294 is to become the new leader (adding one more test case when the old leader was not co-located with the listener).
 295
 296 Also, no teardown step is needed, the final cluster state is not missing any shard replica.
 297
 298 + How to fix
 299
 300 The original test can be implemented when listener implementation changes.
 301 But the test which uses become-leader might be better overall.
 302
 303 Clean leader shutdown suite
 304 ---------------------------
 305
 306 + Information
 307
 308 Some implementations of shutdown RPCs have a side effect of also shutting down shard state notifier.
 309 For details see `Bug 8794 <https://bugs.opendaylight.org/show_bug.cgi?id=8794>`__.
 310
 311 The remove-shard-replica RPC does not have this downside, but it changes shard configuration,
 312 which was not intended by the original scenario definition.
 313
 314 + Testing consequence
 315
 316 Test cases for this scenario were switched to use remove-shard-replica.
 317
 318 + How to fix
 319
 320 There is an open debate on whether "shard shutdown" RPC with less operations (compared to remove-shard-replica)
 321 is something user wants and should be given access to.
 322
 323 If yes, tests can be switched to such an RPC, assuming the shard notifier issue is also fixed.
 324
 325 Hard reboots between test cases
 326 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 327
 328 + Information
 329
 330 Timing errors in Robot code lead to Robot being unable to restore original state without restarts.
 331
 332 During development, we started without any hard reboots, and that was finding bugs in teardown steps of scenarios.
 333 But test independence was more important at that time, so current tests are less sensitive to teardown failures.
 334
 335 + Testing consequence
 336
 337 Around 115 second per ODL reboot, this time is added to every test case running time.
 338 Together with increased timeouts, this motivates leaving out some test cases to allow faster change verification.
 339
 340 + How to fix
 341
 342 Ideally, we would want both jobs with hard resets and jobs without them.
 343 The jobs without resets can be added gradually after splitting the current single job.
 344
 345 Isolation mechanics
 346 ~~~~~~~~~~~~~~~~~~~
 347
 348 + Information
 349
 350 During development, it was found that freeze and kill mechanics affect the co-located java test driver
 351 without exposing any new bugs.
 352
 353 Turns out AAA functionality attempts to read from datastore, so isolated member returns http status code 401.
 354
 355 + Testing consequence
 356
 357 Only iptables filtering is used in order to reduce test job duration.
 358
 359 Isolated members are never queried directly. A leader member is considered isolated
 360 when other members elect a lew leader. A member is considered rejoined
 361 when it responds reporting itself as a follower.
 362
 363 + How to fix
 364
 365 It is straightforward to add test cases for kill and freeze where appropriate,
 366 but once again this can be done gradually when job duration is not a limiting factor.
 367
 368 Reduced number of combinations
 369 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 370
 371 + Information
 372
 373 Prefix-based shards always use tell-based protocol, so suites which test them
 374 with ask-based protocol configuration can be skipped.
 375
 376 Ask-based protocol is known to fail on AskTimeoutException on leader movement,
 377 so suites which produce transactions constantly can be skipped.
 378
 379 Most test cases are not sensitive to data producer options.
 380
 381 + Testing consequence
 382
 383 BGP tests and singleton tests use module-based shards only, both protocols.
 384 Other suites related to data broker are testing only tell-based protocol, both shard types.
 385 Netconf tests and RPC tests use module-based shards with ask-based protocol only.
 386 Only client isolaton suite tests different producer options.
 387
 388 + How to fix
 389
 390 More ests can be added gradually (see above).
 391
 392 Possibly, not every combination is worth the duration it takes,
 393 but that could be alleviated if Linux Foundation infrastructure grows in size significantly.
 394
 395 Reduced performance
 396 ~~~~~~~~~~~~~~~~~~~
 397
 398 + Information
 399
 400 In order to reduce test job duration, suites wait for minimal functionality (jolokia reporting shards are in sync)
 401 after restarting ODL. That means unrelated karaf features might still being installed
 402 whet test is in progress. This should not affect functional tests, but it can reduce performance observed.
 403
 404 The only suite observing strong enough performance inpact is `chasing the leader`_.
 405
 406 + Testing consequence
 407
 408 Functional tests for `chasing the leader`_ suite tolerate frequencies higher than 50 un-registrations per second.
 409 Longevity suite still requires full 100 unregistrations per second.
 410
 411 + How to fix
 412
 413 Suite can wait for better symptom of ODL being ready, for example by requiring CPU usage to become less
 414 that a chosen threshold.
 415
 416 Missing logs
 417 ~~~~~~~~~~~~
 418
 419 + Information
 420
 421 Robot VM has only 2GB of RAM and longevity jobs tend to produce large output.xml files.
 422
 423 Ocasionally, a job can create karaf.log files so large they fail to download,
 424 in extreme cases filling ODL VM disk and causing failures.
 425
 426 This affects mostly longevity jobs (and runs with verbose logging) if they pass.
 427
 428 + Testing consequence
 429
 430 Robot data stored is reduced to avoid this issue, sometimes leading to less details available.
 431 This issue is still not fully resolved, so ocassionally Robot log or karaf log is still missing
 432 if the job in question fails in an unexpected way.
 433
 434 + How to fix
 435
 436 It is possible for Robot test to put additional data into separate files.
 437 Unnecessarily verbose logs could be fixed where needed.
 438
 439 As this limitation only hurts in newly occuring bugs, it is not really possible to entirely avoid this.
 440
 441 Weekend outages
 442 ~~~~~~~~~~~~~~~
 443
 444 + Information
 445
 446 Linux foundation ifrastructure teem occasionally needs to perform changes which affect running jobs.
 447 To reduce this impact, such changes are usually done over weekend.
 448
 449 Cluster testing currently contains seve longevity jobs which block resources for 23 hours.
 450 As that is a significant portion of available resources, the longevity jobs are only run on weekend
 451 where the impact on frequency of other job is less critical.
 452
 453 + Testing consequence
 454
 455 Sometimes, the longevity jobs are affected by infrastructure team activities,
 456 leading to lost results or spurious failures.
 457 One such symptom is tracked as `Bug 8959 <https://bugs.opendaylight.org/show_bug.cgi?id=8959>`__.
 458
 459 + How to fix
 460
 461 It might be possible to spread longevity jobs over work days. As distributing jobs manually
 462 is not a scalable option, a considerable work would be needed to create an automatic way.
 463
 464 Infrastructure changes are not very frequent, and having jobs run at the same predictable time
 465 is convenient from reporting point of view, so perhaps it is okay to keep the current setup.
 466
 467 .. _`chasing the leader`: scenarios.html#chasing-the-leader