Bug 6540: EOS - Rework behavior of onPeerDown 29/45129/5
authorTom Pantelis <tpanteli@brocade.com>
Sun, 4 Sep 2016 01:08:41 +0000 (21:08 -0400)
committerTom Pantelis <tpanteli@brocade.com>
Wed, 7 Sep 2016 13:28:39 +0000 (13:28 +0000)
commit8119659681a6814d257314178e759a6ef1b49766
tree2d9de9fef08951280f8531405932208fa61741f3
parent5636554dc6180c4a6aee6d4423a7f0a1ed30d9e2
Bug 6540: EOS - Rework behavior of onPeerDown

https://git.opendaylight.org/gerrit/#/c/26808/ modified the behavior of
onPeerDown to remove all the down node's candidates. However this behavior is
problematic in the case when the shard leader is isolated. The majority partition
will elect a new leader which temporarily results in split-brain and 2 leaders
which independently attempt to remove the other side's candidates. When the partition
is healed, all hell breaks loose trying to reconcile their differences. This is
compounded with the singleton service because it uses 2 entities that are related
to one another.

To alleviate this, I reverted back to the behavior of selecting a new owner for
the entities owned by the down node and leaving the down node as a candidate.
In the case where the down node is the only candidate, it leaves it as the owner.
This doesn't hurt anything and avoids complications with having to re-instate the
down node as owner when it re-joins if it was actually isolated. The idea here is
to keep its candidacy to minimize disruption until proven otherwise since we don't
know if the downed node's process is actually still alive. If another node registers
a candidate it will replace the down node as the owner.

To handle the case where the down node actually restarted, after startup when it
first hears from the leader, it sends a RemoveAllCandidates message to the leader to
remove it from all entities. This cleans out stale candidates should no local client
register a candidate in the new incarnation.

The unit tests revealed an orthogonal issue with the PreLeader state. The PreLeader
switches to Leader when the commit index is up to date but before applying the entries
to the state. However the EOS may commit modifications immediately before the
ApplyState message for prior entries is received. This can result in the "Store tree X  and candidate base Y differ" exception. So I modified the PreLeader behavior to
switch to Leader when the last applied index is up to date. This makes sense b/c
the PreLeader bevavior is intended to protect the state from inconsistencies.

I also fixed a couple bugs where the downPeerMemberNames was accessed with a String
rather than a MemberName instance. This was a remnant of changing downPeerMemberNames
to store MemberName.

Change-Id: I326660c172353539146a2216cc8a70a4b842affe
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
opendaylight/md-sal/sal-akka-raft/src/main/java/org/opendaylight/controller/cluster/raft/RaftActor.java
opendaylight/md-sal/sal-akka-raft/src/main/java/org/opendaylight/controller/cluster/raft/behaviors/Candidate.java
opendaylight/md-sal/sal-akka-raft/src/main/java/org/opendaylight/controller/cluster/raft/behaviors/PreLeader.java
opendaylight/md-sal/sal-akka-raft/src/test/java/org/opendaylight/controller/cluster/raft/behaviors/CandidateTest.java
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/entityownership/EntityOwnershipShard.java
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/entityownership/EntityOwnershipShardCommitCoordinator.java
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/entityownership/messages/RemoveAllCandidates.java [new file with mode: 0644]
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/ShardTest.java
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/entityownership/DistributedEntityOwnershipIntegrationTest.java
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/entityownership/EntityOwnershipShardTest.java