BUG-9145: rework singleton service group state tracking
The beef of the issue here is that there are multiple codepaths
which end up sharing state transitions -- hence we could erroneously
end up waiting for entity release, when if fact we have already
lost it.
Rather than going the explicit FSM route, which would require quite
a few more states, rework state tracking such that each event is
evaluated against current state. This has the nice feature of making
duplicate events idempotent, as the code is all about reconciling
current and intended state.
This also disentangles the loss of ownership codepaths, which means
we can optimize behavior when partitions are involved -- specifically
jeopardy service entity ownership does not result in service shutdown,
because service liveness is actually guaranteed by cleanup entity.
Hence we can keep the services and the cleanup entity intact, leading
to less churn during partitions.
Should the cleanup entity report jeopardy, we need to bring the services
down, but we do not need to unregister the cleanup entity -- if the
partition is healed, we can start services again without the delay
incurred by EOS. If we end up losing partition healing, cleanup entity
ownership will be taken from us, at which point we also unregister.
Change-Id: Id5c8a97722b9a21114135554d92c17898dd0150e
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>