From: Robert Varga <rovarga@cisco.com>
Date: Wed, 13 Apr 2016 13:28:29 +0000 (+0200)
Subject: BUG-5280: add asciidoc outline of CDS implementation
X-Git-Tag: release/boron~216
X-Git-Url: https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=commitdiff_plain;h=69f4d2b0c8f89b715cad50e378a79f940fb11962

BUG-5280: add asciidoc outline of CDS implementation

Adds the basic document outlining requirements place on the data store,
its architecture and basic tradeoffs taken. This is not a final revision,
but rather an initial cut at it. It is expected this document will evolve
further and will remain updated as the design/implementation changes.

Change-Id: I77de9971beaf8303a7ae0e0d1fc0d4d86ae64e5c
Signed-off-by: Robert Varga <rovarga@cisco.com>
---

diff --git a/opendaylight/md-sal/sal-distributed-datastore/src/site/asciidoc/distributed-data-store.adoc b/opendaylight/md-sal/sal-distributed-datastore/src/site/asciidoc/distributed-data-store.adoc
new file mode 100644
index 0000000000..5120e728ec
--- /dev/null
+++ b/opendaylight/md-sal/sal-distributed-datastore/src/site/asciidoc/distributed-data-store.adoc
@@ -0,0 +1,278 @@
+= Distributed Data Store implementation
+
+[NOTE]
+.This is evolving document
+
+This document is in no sense complete, it describes the intenteded implementation.
+It is subject to revisions as feedback is received and as the implementation evolves.
+
+== Overview
+The implementation builds on `sal-akka-raft` and adds business logic to
+create a strongly-consistent, replicated, `DataTree`. It does this by introducing
+a finite state machine to interact with local and remote clients and replicates
+this using RAFT. The FSM is built around an internal `InMemoryDataTree` instance,
+which holds the actual data being replicated and takes care of performing
+individual operations (bundled in transactions) on the data.
+
+It exposes an MD-SAL `DOMDataBroker` interface, which individual applications use,
+directly, or via the Binding Adaptor, to perform operations.
+
+The implementation is split into two major blocks:
+
+Backend::
+  The actual implementation of the replicated data tree. It is pivoted
+  around Shard, a subclass of `RaftActor`, and provides a set of messages for the
+  *Frontend* to submit work to the Leader, which is then replicated to Followers.
+
+Frontend::
+  The MD-SAL layer, which is co-located with the application accessing
+  data stored in the data store. It is responsible for implementing MD-SAL
+  `DOMDataBroker` and related interfaces and translate them to messages, which drive
+  the Backend. It is also responsible with dealing with common failures occuring
+  in communication with the Backend, such as messages getting lost and Shard
+  Leader moving in the system, to the extent reasonably possible. In current
+  implementation this maps to a ConcurrentDataBroker instance.
+
+Both Backend and Frontend share a common Actor System on the local node. Nodes are
+connected together using Akka Clustering. There can be at most one Frontend instance
+running in an Actor System. For each individual shard, there can be at most one
+instance running in an Actor System.
+
+== Concepts
+
+Shard Name::
+  The logical name of a part of the Conceptual Data Tree. It is
+  associated with a Shard instance. Each member can run at most one such instance
+  for a particular name. For current implementation it is a String, but that can
+  change to a more structured identifier (such as YangInstanceIdentifier) in future.
+
+Transaction::
+  A set of operations on the data tree which are treated as an
+  atomic unit. All operations in a transaction share fate, e.g. either abort or
+  complete successfully.
+
+Member Name::
+  The name for a logical node participating in Akka Clustering and
+  is synonymous to a Actor System instance. For deployment purposes, it is a simple
+  String, which uniquely identifies an OpenDaylight instance. This name does not
+  change for as long as an instance exists. From deployment perspective, a member
+  name can be changed only by removing that member and re-adding it. Any uncommitted
+  transactions from a removed member will be lost.
+
+Global History::
+  Is the aggregated history of all transactions successfully
+  committed in the datastore. It is resilient to concurrent failure of N members,
+  if total number of members is (2N + 1).
+
+Local History::
+  This is a generalized concept behind a `TransactionChain`. It
+  includes all transactions committed in the *Global History* at the point when
+  it was created and any transactions submitted in the `TransactionChain`. Local
+  History is always tied to a single Frontend incarnation and shares its fate.
+
+[NOTE]
+.When does Local History get updated with Global History so that it reflects transactions committed from other Local Histories?
+
+TransactionChain implementations we have today deal with this by rebasing their
+Local History when they have no outstanding transactions (e.g. have been
+fully merged into Global History). This is problematic under load, because if
+a TransactionChain is too busy to run out of outstanding transaction and will
+keep retaining the view of Global History as it existed when the TransactionChain
+was created -- preventing old state from ever being garbage-collected.
+
+== Client/Leader locality
+
+Akka Clustering provides location independence for actor message exchanges and
+the data store could be implemented using a single messaging pattern. This would
+(and in fact did) result in an implementation, which performs reasonably well.
+Unfortunately it also prevents us from using features available in InMemoryDataTree,
+which allows for better splitting of responsibilities between the Frontend and
+the Backend by performing many preparatory tasks in the Frontend thread, without
+the need for message passing.
+
+Elimination of messaging provides significant improvement in operation throughput
+and also application developer experience, as invalid operations are detected
+immediately in the calling thread rather than being delivered when an attempt
+to commit the transaction is made.
+
+With this in mind, the Frontend/Backend internactions have two distinct patterns,
+one used for when the Leader is known to be local and the other used when its
+locality is either uncertain or known to be remote.
+
+== Failure modes
+
+As noted in the overview section, all frontend and backend instances operate in
+a single Akka Cluster, which is logically composed of per-member actor systems.
+
+This setup results in multiple distinct failure modes which need to be analyzed
+separately for ways how they are detected, their effect on both Frontend and Backend
+state and recovery strategies taken.
+
+=== Shard Leader Failure
+  A shard failure occurs when a Shard actor instance is stopped while it is in
+  the RAFT leader state, isolated or not. This failure is detected by the Actor
+  System, which will restart the actor. Upon restart, Shard starts in RAFT follower
+  state, unable to process Frontend transactions as it could before. Both Frontend
+  and Backend actors can observe message loss.
+
+=== Loss of Actor System connectivity
+  This failure mode occurs when connectivity between nodes is lost. This is primarily
+  detected by message timeouts, where messages to and from remote actors fail
+  to arrive. At the logic level this failure cannot be discerned from the remote
+  actors being too busy to process the request within the deadline. When a timeout
+  occurs the party which sent the request does not know if the request has been
+  processed, nor wether that request will be executed at some point in the future.
+
+=== Shard Leader component actor failure
+  Shard implementation makes use of child actors to allow horizontal scaling,
+  effectively creating a threadpool. Some of these actors track Frontend state
+  without persisting it, hence a loss of such an actor means that we have lost
+  some state Frontend expects us to have. The failure is detected by the Shard
+  actor.
+
+== Dealing with failure modes
+
+=== Backend
+  The Backend deals with these by running RAFT, which uses heartbeats and reacts
+  accordingly by electing a new leader if necessary. A leader which cannot communicate
+  with its followers is prevented from making forward progress. State is reconciled
+  when the partition heals. This is possible because each state transition in
+  RAFT has a unique identifier (the journal index).
+
+  Shard Leader component actor failures are dealt with in a fail-fast manner, where
+  a tombstone for the Frontend component corresponding to the failed actor. Failure
+  cause is recorded in the tombstone and all subsequent Frontend messages towards
+  that actor are immediately rejected, stating the failure cause.
+
+=== Frontend
+  The Frontend needs to deal with this mode by eliminating unknowns to the extent
+  reasonably possible, or by propagating any unresolved unknowns to the user
+  application.
+
+  In order to achieve this, all state transitions need to have a globally-unique
+  identifier and the Backend needs to track which transitions have been acted on.
+  This way all state transitions can be made idempotent, e.g. the Backend will ignore
+  any state transition if its internal state indicate the transition has already been
+  performed.
+
+  Idempotent transitions allow the Frontend to re-send them until it receives
+  an authoritative reply from the Backend -- either confirming the state transition
+  or rejecting it, hence eliminating the unknown and allowing the Frontend to
+  continue operating on well-defined state.
+
+== Identifying state transitions
+  Transition identifiers form a namespace from which each transition needs to get
+  a unique identifier, which makes allocation a performance-critical part
+  of operation and needs to be scaled.
+
+  Transitions are always initiated from the Frontend, hence the responsibility
+  for allocating and retiring each transition identifier lies ultimately with
+  the Frontend. This matter is slightly complicated by the fact that the Backend
+  needs to track transitions efficiently, hence we will define additional rules
+  when particular transition identifiers are implicitly retired based on events
+  observed by the Backend.
+
+  Due to the scaling requirement, the Transition Identifier namespace is a hierarchical,
+  where each level in the hierarchy is treated as a separate namespace, lending
+  itself to delegation.
+
+  The first level in the hierarchy is the Member Name where the originating Frontend
+  resides. Since Member Names are unique within a cluster, this allows for allocation
+  and retirement to be tied to member lifecycle. Allocation and retirement does not
+  need to be communicated across members. If a member is removed from a cluster,
+  all its Transition Identifiers are immediately retired. A Member may not reuse
+  a transition identifier it has used for communication until it has confirmed
+  that the Backend has acknowledged its implicit or explicit retirement.
+
+  The second level in the hierarchy is the Frontend Generation number. It is used
+  to distiguish Frontend instances on the same member across Frontend restarts.
+  We could use Akka Persistence or similar to persist Frontend state, all of the state
+  is inherently transitive and high-churn, so persisting it would lead to inferior
+  degradation. Every time a Frontend instance is started it is assigned a unique,
+  generation number. There are multiple mechanisms how such a number could be obtained,
+  such as UUID generation, but it is advantageous to have this number increasing
+  monotonically. For this reason we will use a locally-persisted number, which is
+  incremented (and persisted once) everytime a Frontend instance is started. We will
+  use an unsigned 64bit number, stored as a simple long, but treated as an unsigned
+  by using Long.*Unsigned*() methods to interact with it.
+
+  These two levels are combined into a FrontendIdentifier and embedded in every
+  message the Frontend sends to the Backend.
+
+  The third level in the hierarchy is Local History identifier, which identifies
+  a particular Local History managed by the Frontend. This is a 64bit unsigned
+  integer, which is unique for a particular Frontend Generation. It is monotonically
+  increasing from 1. Zero is treated as a 'no Local History' marker and is made
+  special only to support free-standing transactions without the need to define
+  a set of dedicated messages.
+
+  The fourth level is the transaction number. This 64bit number uniquely identifies
+  a particular transaction within a Local History. It is also monotonically incremented
+  for each transaction created in the Local History.
+
+  These two levels are combined into a LocalTransactionIdentifier, which uniquely
+  identifies any transaction originating from a Frontend instance. When combined
+  with FrontendIdentifier, it also uniquely identifies any transaction that has
+  ever executed in the cluster, e.g. forming a GlobalTransactionIdentifier.
+
+== Frontend persistence
+  As noted above, implementation of the Frontend Generation requires each member
+  to persistently store a single 64 bit counter. To keep the persistence interfaces
+  within a member consistent, Frontend uses Akka Persistence to maintain this
+  counter across restarts.
+
+  State for running transactions is kept in the Frontend until the backend confirms
+  it being completed. If an inconsistency occurs (due to messages getting lost
+  or Shard Leader failing), the Frontend will retransmit this state to the Backend.
+
+== Backend persistence
+  The Backend needs to persist and replicate enough state to ensure that Shard
+  Leader movement does not render Local Histories inoperable, as that has
+  a direct impact on applications, which need to have a state reconciliation
+  strategy to deal with this situation. Development of complex strategies cannot
+  be expected from most applications, as their developers will implement
+  a simple strategy of performing a full state dump into the data store, for
+  example by flapping a BGP session. If the Shard Leader movement was caused
+  by the leader experiencing performance problems, such a dump could result
+  in the new Shard Leader failing for the same reason, which would lead to
+  the application not being able to make forward progress.
+
+  In order to achieve this, the Backend needs to persist both the data stored
+  by the application and enough metadata about Local Histories and Transactions
+  to make state recovery across shard leadership movement possible.
+
+=== User Data persistence
+  The Snapshot needs to identify all transactions which have been integrated
+  into it. Enumerating all of them is clearly not practical, hence we exploit
+  the monotonic nature of identifiers' relationship to Global History. For each
+  valid Local History, we only need to store the last transaction number, as
+  it implies all previous transactions from that Local History have been accepted
+  into Global History. This does not include any Local Histories which have
+  been retired by the Frontend, as any attempt to access their state would
+  mean use-after-free on Frontend's part. The same holds true of any metadata
+  about a Frontend generation -- as soon as the Shard Leader receives a message
+  from a newer generation, it can safely discard any metadata mentioning the
+  prior generation and consider all its identifiers as retired.
+
+  Each DataTreeCandidatePayload entry needs to contain
+  the GlobalTransactionIdentifier of the transaction which has caused the
+  delta carried in that entry. This information is used to summarize transaction
+  metadata when this entry is snapshotted.
+
+=== Internal state persistence
+  The Backend creates a hierarchy of actors to track interactions with the Frontend.
+  Information necessary to recreate these actors needs to be persisted within
+  the journal. This boils down to existence of Local Histories and the last
+  transaction committed for a particular history.
+
+  Whenever a Local History is created on the Frontend, a corresponding actor is
+  created on the Shard Leader. Before the leader acknowledges its creation, it
+  needs to persist a state transition, identifying the Local History. Whenever
+  a Local History is closed down, a corresponding tombstone is stored in the journal.
+  The last transaction identifier for a local history is tracked via snapshots
+  and DataTreeCandidatePayloads.
+
+  In order to minimize overhead on Followers, the actor hierarchy is not created
+  until the actor transitions to Shard Leader state. It only tracks the information
+  required.
+