Fix RemoteTransactionContext limiter accounting 00/68900/2
authorRobert Varga <robert.varga@pantheon.tech>
Fri, 9 Feb 2018 15:55:27 +0000 (16:55 +0100)
committerRobert Varga <nite@hq.sk>
Wed, 28 Feb 2018 18:18:30 +0000 (18:18 +0000)
commit52730e59e2aec76026699911a9799f039a18bba1
tree6af93cf16cdd128cb406c2150530814168f2ebd0
parent7ad09c74ae5716c660513554ee6789833d2971ad
Fix RemoteTransactionContext limiter accounting

In case we lose connectivity between the frontend and backend
at the early stages of a big transaction, e.g. after the transaction
is created at the backend and before it is submitted, we can run into
OperationLimiter preventing recovery.

The reason for this is that OperationLimiter itself does not know
how many permits a BatchedModification request contained, hence
on AskTimeoutException it would only decrement permits by one
and the operations would remain throttled. With large transactions
this means the application will suddenly become bogged down
by the OperationLimiter, preventing it from submitting the transaction
or otherwise recovering.

Once any BatchedModifications request fails, the transaction is
doomed anyway, as the message counts on frontend and backend will not
match. Furthermore we must not issue any reads -- the backed does
not have all the modifications, hence it could return an incorrect
result.

Move permit tracking to RemoteTransactionContext, where we can capture
the number of permits in the OnComplete that gets invoked, properly
returning permits which correspond to the BatchedModifications message.
If we have failed to acquire a permit, we also note that and do not
underflow the semaphore.

In case a BatchedModifications message fails, we mark that fact and
turn into a bypass mode: we fail any subsequent reads and do not send
any further BatchedModifications until we see ready being set -- at
which point we coordinate with backend to shoot down the transaction.

An alternative strategy would be to continue transmitting
BatchedModifications, but that would incur an AskTimeout during split,
slowing down the time it takes us to kill flush the doomed transaction
out of the system.

JIRA: CONTROLLER-1814
Change-Id: I919bae0e7173910665e8ec2342d076a710c1c7bf
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 7925d904ffd56c13ddde53e0e7bf6b08b437757d)
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/OperationLimiter.java
opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/RemoteTransactionContext.java
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/OperationLimiterTest.java [deleted file]
opendaylight/md-sal/sal-distributed-datastore/src/test/java/org/opendaylight/controller/cluster/datastore/RemoteTransactionContextTest.java [new file with mode: 0644]