Fix RemoteTransactionContext limiter accounting 57/68757/6
authorRobert Varga <robert.varga@pantheon.tech>
Fri, 9 Feb 2018 15:55:27 +0000 (16:55 +0100)
committerTom Pantelis <tompantelis@gmail.com>
Wed, 28 Feb 2018 16:09:16 +0000 (16:09 +0000)
commit7925d904ffd56c13ddde53e0e7bf6b08b437757d
treeec0eb3c980c5a39bb23f0e3a7d58e08ac05265f9
parent61280eab802fa28338f8f264f958f97c1e16e3ca
Fix RemoteTransactionContext limiter accounting

In case we lose connectivity between the frontend and backend
at the early stages of a big transaction, e.g. after the transaction
is created at the backend and before it is submitted, we can run into
OperationLimiter preventing recovery.

The reason for this is that OperationLimiter itself does not know
how many permits a BatchedModification request contained, hence
on AskTimeoutException it would only decrement permits by one
and the operations would remain throttled. With large transactions
this means the application will suddenly become bogged down
by the OperationLimiter, preventing it from submitting the transaction
or otherwise recovering.

Once any BatchedModifications request fails, the transaction is
doomed anyway, as the message counts on frontend and backend will not
match. Furthermore we must not issue any reads -- the backed does
not have all the modifications, hence it could return an incorrect
result.

Move permit tracking to RemoteTransactionContext, where we can capture
the number of permits in the OnComplete that gets invoked, properly
returning permits which correspond to the BatchedModifications message.
If we have failed to acquire a permit, we also note that and do not
underflow the semaphore.

In case a BatchedModifications message fails, we mark that fact and
turn into a bypass mode: we fail any subsequent reads and do not send
any further BatchedModifications until we see ready being set -- at
which point we coordinate with backend to shoot down the transaction.

An alternative strategy would be to continue transmitting
BatchedModifications, but that would incur an AskTimeout during split,
slowing down the time it takes us to kill flush the doomed transaction
out of the system.

JIRA: CONTROLLER-1814
Change-Id: I919bae0e7173910665e8ec2342d076a710c1c7bf
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/OperationLimiter.java
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/RemoteTransactionContext.java
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/OperationLimiterTest.java [deleted file]
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/RemoteTransactionContextTest.java [new file with mode: 0644]