controller.git
6 years agoFix RemoteTransactionContext limiter accounting 01/68901/3
Robert Varga [Fri, 9 Feb 2018 15:55:27 +0000 (16:55 +0100)]
Fix RemoteTransactionContext limiter accounting

In case we lose connectivity between the frontend and backend
at the early stages of a big transaction, e.g. after the transaction
is created at the backend and before it is submitted, we can run into
OperationLimiter preventing recovery.

The reason for this is that OperationLimiter itself does not know
how many permits a BatchedModification request contained, hence
on AskTimeoutException it would only decrement permits by one
and the operations would remain throttled. With large transactions
this means the application will suddenly become bogged down
by the OperationLimiter, preventing it from submitting the transaction
or otherwise recovering.

Once any BatchedModifications request fails, the transaction is
doomed anyway, as the message counts on frontend and backend will not
match. Furthermore we must not issue any reads -- the backed does
not have all the modifications, hence it could return an incorrect
result.

Move permit tracking to RemoteTransactionContext, where we can capture
the number of permits in the OnComplete that gets invoked, properly
returning permits which correspond to the BatchedModifications message.
If we have failed to acquire a permit, we also note that and do not
underflow the semaphore.

In case a BatchedModifications message fails, we mark that fact and
turn into a bypass mode: we fail any subsequent reads and do not send
any further BatchedModifications until we see ready being set -- at
which point we coordinate with backend to shoot down the transaction.

An alternative strategy would be to continue transmitting
BatchedModifications, but that would incur an AskTimeout during split,
slowing down the time it takes us to kill flush the doomed transaction
out of the system.

JIRA: CONTROLLER-1814
Change-Id: I919bae0e7173910665e8ec2342d076a710c1c7bf
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 7925d904ffd56c13ddde53e0e7bf6b08b437757d)

6 years agoFixup test referring to description statement 02/68902/1
Robert Varga [Tue, 6 Feb 2018 18:25:32 +0000 (19:25 +0100)]
Fixup test referring to description statement

This was broken by yangtools properly processing whitespace,
which retains the newline.

Change-Id: I5796275a80bc1989061fd745270d51a4a37f97bd
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
6 years agoFix intermittent RemoteRpcRegistryMXBeanImplTest failures 52/67652/3
Tom Pantelis [Fri, 26 Jan 2018 05:48:41 +0000 (00:48 -0500)]
Fix intermittent RemoteRpcRegistryMXBeanImplTest failures

testFindRpcByRoute(org.opendaylight.controller.remote.rpc.registry.mbeans.RemoteRpcRegistryMXBeanImplTest)  Time elapsed: 0.98 sec  <<< ERROR!
java.lang.IllegalStateException: Attempted to access local bucket before recovery completed
at com.google.common.base.Preconditions.checkState(Preconditions.java:501)
at org.opendaylight.controller.remote.rpc.registry.gossip.BucketStoreActor.getLocalBucket(BucketStoreActor.java:384)
at org.opendaylight.controller.remote.rpc.registry.gossip.BucketStoreActor.getLocalData(BucketStoreActor.java:110)
at org.opendaylight.controller.remote.rpc.registry.mbeans.RemoteRpcRegistryMXBeanImpl.findRpcByRoute(RemoteRpcRegistryMXBeanImpl.java:91)
at org.opendaylight.controller.remote.rpc.registry.mbeans.RemoteRpcRegistryMXBeanImplTest.testFindRpcByRoute(RemoteRpcRegistryMXBeanImplTest.java:142)

The problem is that the RemoteRpcRegistryMXBeanImpl access the enclosing
RpcRegistry Actor instance directly and violates actor encapsulation.
RemoteRpcRegistryMXBeanImpl should access the RpcRegistry via messages
sent to its ActorRef.

Change-Id: Icfd67c38e5d1bc3de283949207009d7aa34ab855
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
(cherry picked from commit 10427641dc0a75ec62e78ecfc4a7a0a7d438d462)
(cherry picked from commit c9aab0231686bc6aa06dcfef51fca8c272fb9382)

6 years agoregister RemoteRpcRegistryMXBean 51/67651/3
wusandi [Wed, 20 Dec 2017 07:09:12 +0000 (15:09 +0800)]
register RemoteRpcRegistryMXBean

The RemoteRpcRegistryMXBean had not registered because of code refactoring.
I'm sure that it had been removed from one of history commits,
But I think it should be added back to support jmx for query rpcs
when one of rpcs breaks down.

Change-Id: I506ed5c25c7615b8bb7ac9c0102bf671ff40bb78
Signed-off-by: wusandi <wusandi@163.com>
(cherry picked from commit 105587c7c4068aa8a0721669cff6aae7f28f6492)
(cherry picked from commit fafb6cc1d45f0ef060cec7fe5bad2f6acbca94c4)

6 years agoFix ModificationType.APPEARED mapping 65/67665/3
Robert Varga [Mon, 29 Jan 2018 11:02:41 +0000 (12:02 +0100)]
Fix ModificationType.APPEARED mapping

When a node appears, it is an event equivalent to a WRITE,
not SUBTREE_MODIFIED, otherwise we are logically crossing
a non-existent node.

Change-Id: I0876a18ec4af799db30c384fe4a7e38b9b2833c7
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
6 years agoFix ReadyLocalTransactionSerializer 13/68413/2
Robert Varga [Mon, 8 Jan 2018 12:22:08 +0000 (13:22 +0100)]
Fix ReadyLocalTransactionSerializer

The following exception was seen in the field:

2017-12-20 19:37:05,507 | ERROR | ult-dispatcher-2 | Remoting                         | 174 - com.typesafe.akka.slf4j - 2.4.7 | java.lang.ClassNotFoundException: org.opendaylight.controller.cluster.datastore.messages.BatchedModifications
org.apache.commons.lang3.SerializationException: java.lang.ClassNotFoundException: org.opendaylight.controller.cluster.datastore.messages.BatchedModifications
        at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:229)
        at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:267)
        at org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransactionSerializer.fromBinaryJava(ReadyLocalTransactionSerializer.java:49)
        at akka.serialization.JSerializer.fromBinary(Serializer.scala:177)
        at akka.serialization.Serialization$$anonfun$deserialize$2.apply(Serialization.scala:124)
        at scala.util.Try$.apply(Try.scala:192)
        at akka.serialization.Serialization.deserialize(Serialization.scala:114)
        at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:80)
        at akka.serialization.Serialization$$anonfun$deserialize$2.apply(Serialization.scala:124)
        at scala.util.Try$.apply(Try.scala:192)
        at akka.serialization.Serialization.deserialize(Serialization.scala:114)
        at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:24)
        at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:60)
        at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:60)
        at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:78)
        at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:978)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:484)
        at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:447)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassNotFoundException: org.opendaylight.controller.cluster.datastore.messages.BatchedModifications
        at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:501)
        at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:421)
        at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:412)
        at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:683)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1863)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2037)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
        at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:223)
        ... 26 more

As it turns out, ReadyLocalTransactionSerializer is not following JSerializer
documentation recommendations of loading classes via ExtendedActorSystem.

Change-Id: Idef62f8c7a50d607ef152083693fac63c7e92447
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit c64b1e26da272928abe57648757d578c2ac33869)

6 years agoGuards iteration against concurrent modification 83/68083/2
Jie Han [Mon, 13 Nov 2017 03:26:39 +0000 (11:26 +0800)]
Guards iteration against concurrent modification

- fix or it would throw a ConcurrentModificationException.
Change-Id: I39664b2238ef62d2add96cb76ac8c2113cfc2265
Signed-off-by: Jie Han <han.jie@zte.com.cn>
(cherry picked from commit a373371d34293ce0e436700ac328a58e9ea37f2e)

6 years agoConcurrentDOMDataBroker LOG debug instead of error 81/68081/2
Michael Vorburger [Mon, 11 Dec 2017 22:24:39 +0000 (23:24 +0100)]
ConcurrentDOMDataBroker LOG debug instead of error

for "Tx: {} Error during phase {}, starting Abort"

see CONTROLLER-1802 and NETVIRT-916 for the full background story to
why this will help reduce confusion when we analyze logs.

Change-Id: I7f791bc92d3c22d96462381d6b966755134647d4
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
(cherry picked from commit 1c717bbf117d3486196a0fdd73ac650721f9c557)

6 years agoFix infinite loop on cancel transaction 78/68078/2
Jaime Caamaño Ruiz [Thu, 1 Feb 2018 09:02:46 +0000 (10:02 +0100)]
Fix infinite loop on cancel transaction

This patch fixes a problem where you would run into an infinite loop
after cancelling DOMForwardedWriteTransaction following an exception
thrown by the backed transaction ready or submit methods.

Change-Id: I24ce3706dcc52e35890246b4796090cd6b1c99b9
JIRA: CONTROLLER-1812
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@suse.com>
(cherry picked from commit 7ab6f974861e01daa16ff56658eeb1be163cbfec)

6 years agoBump versions by x.y.(z+1) 05/68605/1
jenkins-releng [Fri, 23 Feb 2018 14:26:50 +0000 (14:26 +0000)]
Bump versions by x.y.(z+1)

Change-Id: I304b1b5dba9f883d4a4a2f6238c865dfebd13cec
Signed-off-by: jenkins-releng <jenkins-releng@opendaylight.org>
6 years agoBug 9060: Minor mdsaltrace_config.xml /this/will/never/exist 22/62622/5
Michael Vorburger [Mon, 4 Sep 2017 13:24:42 +0000 (15:24 +0200)]
Bug 9060: Minor mdsaltrace_config.xml /this/will/never/exist

Found during use, and avoids having to do this workaround:

log:set ERROR org.opendaylight.controller.md.sal.trace.dom.impl

Change-Id: Iff5fb5eee8d938f1ec6dcb33d5d8a6ec58f2a2b9
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
6 years agoCorrect logging in FrontendClientMetadataBuilder 58/66058/1
Martin Dindoffer [Tue, 21 Nov 2017 17:43:01 +0000 (18:43 +0100)]
Correct logging in FrontendClientMetadataBuilder

Change-Id: I7851d72119607dbb354e213e82091354792b063b
Signed-off-by: Martin Dindoffer <martin.dindoffer@pantheon.tech>
(cherry picked from commit e5d4949c74b1d6cf50c16eaabf5600d255a743f4)

6 years agoTracing Transaction wrappers delegate equals/hashCode/equals 34/65834/1
Michael Vorburger [Tue, 21 Nov 2017 17:13:06 +0000 (18:13 +0100)]
Tracing Transaction wrappers delegate equals/hashCode/equals

to fix IllegalStateException due to DOMBrokerTransactionChain !equals
TracingTransactionChain (only affected odl-mdsal-trace for
trace:transaction anyway)

see https://jira.opendaylight.org/browse/CONTROLLER-1792

Change-Id: I079ff9e99edfd55bec2acbe1984a5c2b7667c2de
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
6 years agoToaster is shardless 56/65756/1
Robert Varga [Thu, 24 Aug 2017 23:33:54 +0000 (01:33 +0200)]
Toaster is shardless

It's not like we broke it into shards. Nothing like that, our toaster
is fully working. Nevertheless it is a sample and has no place
in production code nor its configuration.

Change-Id: Ie14c698c1ea45a5fe201d1b6227eeb4f2d9790a5
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 9b5705d5d4b396e4076d2bc46d554644ac3f150b)

6 years agoForwardingRead[Only]/WriteTransaction implementations 02/65602/2
Michael Vorburger [Fri, 22 Sep 2017 22:27:29 +0000 (00:27 +0200)]
ForwardingRead[Only]/WriteTransaction implementations

this came up during https://git.opendaylight.org/gerrit/#/c/63372/

a bit similar (but otherwise unrelated) to my earlier
https://git.opendaylight.org/gerrit/#/c/63135/

Change-Id: If669f94b47bf41f49e54c66c6024aeff9805f8d6
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
(cherry picked from commit 702a44e462672d6e4f7e5bcd94def61b70be8009)

6 years agoForwardingDataBroker 04/65604/1
Michael Vorburger [Thu, 14 Sep 2017 12:25:57 +0000 (14:25 +0200)]
ForwardingDataBroker

as discussed on https://git.opendaylight.org/gerrit/#/c/63120/

Change-Id: I374d2a9f6cbad89f33e453eca270e5cc5f75a224
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
(cherry picked from commit bc13f52e618510599b0dfcd5cd439f1a9b9bf1b5)

6 years agoTracingBroker: collapse ellipses 59/65559/1
Stephen Kitt [Thu, 7 Sep 2017 11:38:14 +0000 (13:38 +0200)]
TracingBroker: collapse ellipses

This avoids printing multiple "(...)" lines in succession.

(The test is made a little more change-resistant by removing the
expected line number.)

Change-Id: I9ede5d0d15afecb06c61cbe2b2c5a70967616280
Signed-off-by: Stephen Kitt <skitt@redhat.com>
(cherry picked from commit d3d5d329914eddb066680f7e22ce5dd7c09616e7)

7 years agoBump versions by x.y.(z+1) 34/64334/1
jenkins-releng [Tue, 17 Oct 2017 01:39:41 +0000 (01:39 +0000)]
Bump versions by x.y.(z+1)

Change-Id: I61b0906932814c38f9b99aa9e04f2e5a75c2fa46
Signed-off-by: jenkins-releng <jenkins-releng@opendaylight.org>
7 years agoLower verbosity in SimpletxDomRead 04/63904/1
Vratko Polak [Mon, 14 Aug 2017 15:45:22 +0000 (17:45 +0200)]
Lower verbosity in SimpletxDomRead

Info on such big objects really makes the log hard to navigate.

Change-Id: I794e06766377ddd6e09e3f2f4142719d6049ac84
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
(cherry picked from commit db81f42c68508645f3a7d81f8041693b208d9a03)

7 years agoBug 9165: Log config subsystem readiness as INFO 38/63138/2
Vratko Polak [Thu, 14 Sep 2017 13:10:11 +0000 (15:10 +0200)]
Bug 9165: Log config subsystem readiness as INFO

Change-Id: I487760e19ac317f7246ac9b9b47f2a65df100e6b
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
7 years agoBug 8829: Ignore error when initializing dsbenchmark 41/60141/2
Vratko Polak [Mon, 27 Feb 2017 14:49:52 +0000 (15:49 +0100)]
Bug 8829: Ignore error when initializing dsbenchmark

+ More capabilities listed in the config file,
  the list is probably still not complete.
+ Also fix ParenPad violations.

Change-Id: I6f4902bb8236fc1560e1e38554465aefadb775ee
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
7 years agoBug 9060: Filter TracingBroker stack trace elements 35/62635/2
Michael Vorburger [Mon, 4 Sep 2017 15:24:49 +0000 (17:24 +0200)]
Bug 9060: Filter TracingBroker stack trace elements

Just to make them a lot easier to read, because what is really
interesting in them is the "middle part" (before the trace close
tracking infra classes and after the lower level e.g. BP set up class
stack frames).

Change-Id: I5f90b69a10ec0ea3f3e3407279c523751813418d
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBUG-8639: always invalidate primary info cache 73/62973/2
Robert Varga [Mon, 11 Sep 2017 14:03:58 +0000 (16:03 +0200)]
BUG-8639: always invalidate primary info cache

When we remove local shard, make sure we invalidate the associated
cache entry.

Change-Id: I83d6320e7308fe9bdf9c66c928fa91198674eae1
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-9054: add a ClusterSingletonService integration test 49/62449/5
Robert Varga [Wed, 30 Aug 2017 14:02:55 +0000 (16:02 +0200)]
BUG-9054: add a ClusterSingletonService integration test

This adds the equivalent of the chasing-the-leader in unit test
format, so we can check if the integration works as expected.

Change-Id: I53a89172e8fd750532ee8d13c62ee6dbb94ffb59
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8858: remove sleeps from test driver 40/62140/3
Robert Varga [Tue, 22 Aug 2017 08:51:58 +0000 (10:51 +0200)]
BUG-8858: remove sleeps from test driver

This is a follow-up patch to speed up test driver, properly
chasing leader, without any sleeps incurred.

Change-Id: I55ed680ad3f45813b3ee3d8b948046c4ae34e273
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBug 9008: Fix the error of the persisted journal data format 29/62529/1
HeYunBo [Fri, 18 Aug 2017 12:31:39 +0000 (20:31 +0800)]
Bug 9008: Fix the error of the persisted journal data format

We have to clear the lastLeafSetQName while processing the end event for node
in NormalizedNodeInputStreamReader and AbstractNormalizedNodeDataOutput.

Otherwise while processing the leaf-list node, the leaf-list entry node
may use the other LeafSetQName as it's node identifier incorrectly.
The DataTree reconstructed from the persisted journal after the controller
restart will be not equal to the DataTree before restart under certain
circumstances.

Change-Id: I4ee823f59fe477d08f982ae73e3850433dfea8ee
Signed-off-by: HeYunBo <he.yunbo@zte.com.cn>
(cherry picked from commit 0077859d16ed922af1449f075033069f4d9dbffe)

7 years agoFix intermitent testFollowerResyncWith*LeaderRestart failure 09/62509/2
Michael Vorburger [Thu, 31 Aug 2017 18:42:44 +0000 (20:42 +0200)]
Fix intermitent testFollowerResyncWith*LeaderRestart failure

This is a back-port of 88e2974b8d391d6e91a6338b0a1b8dbf966a8a71 from
master to stable/carbon.  It was done by manually copy/pasting, not a
real cherry-pick, as line numbers were too different.

Change-Id: Ic3815a694a8531d9f7f42f19ad8978d52fc902b3
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoFix intermittent testOwnerChangesOnPeerAvailabilityChanges failure 10/62510/2
Tom Pantelis [Tue, 29 Aug 2017 19:42:12 +0000 (15:42 -0400)]
Fix intermittent testOwnerChangesOnPeerAvailabilityChanges failure

EntityOwnershipShardTest.testOwnerChangesOnPeerAvailabilityChanges:647->AbstractEntityOwnershipTest.verifyRaftState:280->lambda$testOwnerChangesOnPeerAvailabilityChanges$2:648 getRaftState expected:<[]Leader> but was:<[Pre]Leader>

It seems this was indirectly introduced by the addition of the
PurgeTransactionPayload - changes the timing of things a bit. I added
code to ensure peer2's lastAppliedIndex is up-to-date with the leader's
prior to stopping the leader to make it deterministic (ie peer2 should
be able to go straight to Leader).

Change-Id: I9abb950c7dc67b2d481d07b9b421ae46421b6510
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
(cherry picked from commit 1b0f84c4957e464bad6f7cb7350a8171c3d1621b)

7 years agoBUG-9054: do not use BatchedModifications needlessly 53/62453/2
Robert Varga [Wed, 30 Aug 2017 14:28:28 +0000 (16:28 +0200)]
BUG-9054: do not use BatchedModifications needlessly

Transaction identifier, which is a required parameter for
BatchedModifications is a resource tracked on the backend and is
assumed to be allocated contiguously. Using BatchedModifications
to transport only a list of modifications means we are allocating
transactions IDs which we then never use.

This patch reworks the logic so it tracks modifications in a list
and allocates BatchedModifications only when we are ready to actually
commit something.

Change-Id: I3f71511cfd68e96e80790e69d28d083f195e5e12
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBug 9060: Karaf CLI command to print open transactions 04/62504/2
Michael Vorburger [Tue, 29 Aug 2017 12:22:59 +0000 (14:22 +0200)]
Bug 9060: Karaf CLI command to print open transactions

This is not a 1:1 cherry-pick from master, but includes manual work to:
  1. fix versions and more in mdsal-trace/cli/pom.xml
  2. rework PrintOpenTransactionsCommand.java from Karaf 4 API to v3
  3. incl. for ^^^ a new BP commands.xml (not needed anymore on Karaf 4)
  4. features-mdsal-trace POM and features.xml include new cli bundle

including some minor changes to make output more pretty / readable.

This is, for now, the last in a serious of commits which is part of a
solution I'm proposing in order to be able to detect OOM issues such as
Bug 9034, based on using the mdsal-trace DataBroker.

Change-Id: I83af00a0713be4e8fab3085942b7b57d7183a20c
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 9060: TracingBroker printOpenTransactions 03/62503/1
Michael Vorburger [Mon, 28 Aug 2017 16:38:50 +0000 (18:38 +0200)]
Bug 9060: TracingBroker printOpenTransactions

This method is intended to be used from a Karaf CLI command in the next
change (and maybe JMX or something else like that later), which can be
invoked during future automated testing to detect Tx leaks during CSIT.

This is one of a serious of commits which is part of a solution I'm
proposing in order to be able to detect OOM issues such as Bug 9034,
based on using the mdsal-trace DataBroker.

Change-Id: I682700bef9644834e8b4ca36b21729f021a76bf0
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 9060: Remove un-used Instant getObjectCreated() from CloseTracked 02/62502/1
Michael Vorburger [Mon, 28 Aug 2017 15:27:17 +0000 (17:27 +0200)]
Bug 9060: Remove un-used Instant getObjectCreated() from CloseTracked

I initially thought that it would be "interesting" to be able to do some
sort of output sorted by the age of the object creating kind of UX in
the CLI I'm planning to propose next, but ultimately realized that
keeping an extra Instant fields in EACH CloseTracked (e.g. Tx) is just
overhead and not really adding much value (because the NUMBER of
non-closed objects is MUCH more interesting than this timestamp..), thus
removing this again after all.

This is one of a serious of commits which is part of a solution I'm
proposing in order to be able to detect OOM issues such as Bug 9034,
based on using the mdsal-trace DataBroker.

Change-Id: Ie40fe23ce2af670902ff8e44a6757ebdf9ef915e
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 9060: mdsal-trace tooling with getAllUnique() to find Tx leaks 01/62501/1
Michael Vorburger [Mon, 28 Aug 2017 11:22:43 +0000 (13:22 +0200)]
Bug 9060: mdsal-trace tooling with getAllUnique() to find Tx leaks

This is one of a serious of commits which is part of a solution I'm
proposing in order to be able to detect OOM issues such as Bug 9034,
based on using the mdsal-trace DataBroker.

Change-Id: I9cf4d8d9965468d77a0d82455655b9445535f0b0
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 9060: TracingBroker with transaction-debug-context-enabled 00/62500/1
Michael Vorburger [Thu, 24 Aug 2017 20:37:51 +0000 (22:37 +0200)]
Bug 9060: TracingBroker with transaction-debug-context-enabled

This is one of a serious of commits which is part of a solution I'm
proposing in order to be able to detect OOM issues such as Bug 9034,
based on using the mdsal-trace DataBroker.

Change-Id: If62b7f76ea03d8cabe0c5a2088983275cfe50e44
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 9034: TracingBroker with TracingReadOnlyTransaction 99/62499/1
Michael Vorburger [Thu, 24 Aug 2017 17:31:23 +0000 (19:31 +0200)]
Bug 9034: TracingBroker with TracingReadOnlyTransaction

The new TracingReadOnlyTransaction wrapper doesn't do anything
interesting yet - but it will, in the related upcoming next change.

This is one of a serious of (small, easy to review) commits which is
part of a solution I'm proposing in order to be able to detect OOM
issues such as Bug 9034, based on using the mdsal-trace DataBroker.

Change-Id: Ifa82c50d9c9eac76af99bf6a58e5e1955ee7429c
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 9034: TracingBroker with TracingTransactionChain 98/62498/1
Michael Vorburger [Thu, 24 Aug 2017 17:24:59 +0000 (19:24 +0200)]
Bug 9034: TracingBroker with TracingTransactionChain

This is one of a serious of (small, easy to review) commits which is
part of a solution I'm proposing in order to be able to detect OOM
issues such as Bug 9034, based on using the mdsal-trace DataBroker.

Change-Id: I098c48a1fce1da2fdd0aafdc82fd3bef5626988a
Signed-off-by: Michael Vorburger <vorburger@redhat.com>
7 years agoBug 8885: Fix DistributedShardedDOMDataTree initialization 32/61132/2
Tom Pantelis [Wed, 26 Jul 2017 16:16:48 +0000 (12:16 -0400)]
Bug 8885: Fix DistributedShardedDOMDataTree initialization

DistributedShardedDOMDataTree initialization expects the prefix
configuration shard to be present and ready with leader however
the latter isn't the case when the static module-shards is
bootstrapped without the local member so it can be dynamically
joined into an existing cluster. So I modified the ConfigShardLookupTask
to elide the ConfigShardReadinessTask.

Once past that, creation of the prefix-based default shard is attempted
as there isn't a local module-based shard however this fails b/c the
local prefix configuration shard is not connected to a leader. To alleviate
this I just commented out the code to create the shard. Since the default
shard configuration is present in the out-of-box modules.conf and is
expected to be present, we can assume at this point that the local member
isn't in the replica list with the intention of dynamically joining it to
an existing cluster, at which time the shard will be created.

These changes at least fix the regression with the boostrapping scenario.
We can revisit this iniialization later w.r.t. prefix-based shards.

Change-Id: I1faf531f4c79914d45203ee132dd4e65ad2f18ba
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
(cherry picked from commit fa60e0fbe54f1604d3b68dcd5df14ba3aed7183f)

7 years agoBUG-9028: make NonPersistentDataProvider schedule invocation 69/62169/3
Robert Varga [Tue, 22 Aug 2017 17:59:21 +0000 (19:59 +0200)]
BUG-9028: make NonPersistentDataProvider schedule invocation

We need to make NonPersistentDataProvider behave in a fashion
similar to what PersistentDataProvider does for asynchronous
persistence calls, which is schedule execution of the provided
procedure rather than direct execution (which is fair for synchronous
execution).

In order to make that work we introduce ExecuteInSelfActor, which
has an executeInSelf() method, which uses internal mechanics to
schedule the call at a later point.

Change-Id: I116708d98154c8244ea80b4a1a1aa615abc3075d
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoAdd debug to pinpoint lastApplied movement 96/62096/1
Robert Varga [Mon, 21 Aug 2017 16:35:20 +0000 (18:35 +0200)]
Add debug to pinpoint lastApplied movement

This method is called from multiple call sites, only one of which
is actually logging the change. Make sure we catch all transitions
by adding a LOG.debug() into the setter.

Change-Id: Ie777f8047a0893f9450fb132faa8adea235fbc5f
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoMake testTransactionForwardedToLeaderAfterRetry purge-aware 20/61720/1
Robert Varga [Mon, 14 Aug 2017 20:53:15 +0000 (22:53 +0200)]
Make testTransactionForwardedToLeaderAfterRetry purge-aware

At the point where we are waiting for transaction replication
to fully propagate, we need to account for the purge request,
as otherwise the configuration could interfere with index
sequencing.

Change-Id: I13f93e306e5b77304916e4c05f39dc28fb9cc049
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoMake ShardTest.testCommitWhenTransactionHasModifications() wait a bit 26/61626/1
Robert Varga [Mon, 14 Aug 2017 16:03:51 +0000 (18:03 +0200)]
Make ShardTest.testCommitWhenTransactionHasModifications() wait a bit

Committed transactions involve also a purge payload, which is persisted
asynchronously, hence it may or may not be visible in the journal just
after the transaction is reported as committed. Wait for two heartbeat
intervals before looking at the stats.

Change-Id: Ibe699edced12d006bf5ea8cd99aa821ab56d115d
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8941: enqueue purges once ask-based transactions resolve 33/61433/1
Robert Varga [Mon, 7 Aug 2017 16:16:23 +0000 (18:16 +0200)]
BUG-8941: enqueue purges once ask-based transactions resolve

Backend state tracking relies on the transaction log to propagate
transaction state from the leader to followers. This includes purging
of transactions, i.e. the information that the frontend will not need
the state (and the final resolution of the transaction).

Tell-based protocol handles this on the frontend, ask-based needs to
do this on the backend (as it has no notion of transaction continuation).

Change-Id: I49e787b38998ef67b4a9ef504a70822263e1a340
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8733: eliminate ProxyRegistration 46/61046/2
Robert Varga [Wed, 2 Aug 2017 14:53:09 +0000 (16:53 +0200)]
BUG-8733: eliminate ProxyRegistration

This class does not serve any real purpose and just clutters the code.
Get rid of it.

Change-Id: I43b88bc8eb777199a43283c3b232a299436cd74d
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8733: refactor IdInts listeners 70/60270/26
Robert Varga [Thu, 13 Jul 2017 00:40:34 +0000 (02:40 +0200)]
BUG-8733: refactor IdInts listeners

Before doing any heavy work, this patch removes code duplication
between the two classes.

Change-Id: Ia17bf9fa31247f881a112dbb71c536e4ec7513ba
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8898: prioritize InternalCommand 35/60935/2
Robert Varga [Mon, 31 Jul 2017 14:08:22 +0000 (16:08 +0200)]
BUG-8898: prioritize InternalCommand

InternalCommand requests should be processed as soon as possible,
and since we are already using ControlAwareMailbox, this is as simple
as marking InternalCommand as a ControlMessage.

Change-Id: Ic6025f4254da47801676c0c474d03e18abbf8f50
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoSwitch from config-parent to bundle-parent in mdsal-trace 24/60624/5
xygeng [Fri, 21 Jul 2017 06:28:45 +0000 (14:28 +0800)]
Switch from config-parent to bundle-parent in mdsal-trace

Change-Id: I5f2d1d7345845c00d0ac595606b96037530216ae
Signed-off-by: Geng Xingyuan <geng.xingyuan@zte.com.cn>
7 years agoBUG-8898: do not invoke timeouts directly 34/60934/1
Robert Varga [Mon, 31 Jul 2017 13:54:12 +0000 (15:54 +0200)]
BUG-8898: do not invoke timeouts directly

Request timeouts are occuring with the connection lock held,
at which point the connection can be at the tail of a successor
chain:

oldestConnection -> olderConnection -> connection

If the callback being invoked attempts to transmit an entry,
we will end up attempting to lock the entire chain. This would not
be a problem except that if there is a concurrent attempt to lock
the entire chain it ends up holding the lock of oldestConnection
and it is waiting for the lock on connection -- which will only be
released once the callback finishes executing, but that in turn
waits for oldestConnection to be unlocked -- a classic AB/BA deadlock.

This patch alleviates the problem by deferring callback execution
via executeInActor, i.e. the timeout will be delivered at as part
of normal message processing.

Change-Id: I237908cf214bcdfd477fe0212d09b207a0c2cdbf
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoRevert "Revert "BUG-7464: use yangtools.triemap"" 19/60919/1
Robert Varga [Sun, 30 Jul 2017 17:04:55 +0000 (17:04 +0000)]
Revert "Revert "BUG-7464: use yangtools.triemap""

This reverts commit 4bc5f74ae256566cfdb3cdb577d773edde99bd0b.

Change-Id: Ia650489e9620d615b77b39704edfc4f23a0ae686
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoRevert "BUG-7464: use yangtools.triemap" 17/60917/1
Robert Varga [Sun, 30 Jul 2017 13:04:55 +0000 (15:04 +0200)]
Revert "BUG-7464: use yangtools.triemap"

This reverts commit 5e00c9fdb216f5d7c1c0dc432e32bb15fd8ad337.
Partial revert to allow upstream API change to pass through
distcheck.

Change-Id: Ie44afd5b89a1e1e0aafb4ac6299043404ccd6669
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBug 8494: Separate writing and completion threads 06/60606/11
Vratko Polak [Fri, 21 Jul 2017 10:24:49 +0000 (12:24 +0200)]
Bug 8494: Separate writing and completion threads

If AbstractTransactionHandler uses only one executor thread,
future completion callbacks are delayed by throttling on writes.
CSIT aims to detect RequestTimeoutException within a narrow window,
so a separate executor for callbacks is used now.

The delay would not be that critical, but the problem is the timing
between a scheduled execution which exceeds scheduling gaps. These
seem to hold up normally-submitted tasks, leading to futures never
completing.

Therefore we use two Executors and synchronize state modification
call sites. Hence the two tasks (throttled producer) and future
completions can run concurrently (aside from state synchronization).

Change-Id: I642c5295ab6188b2d7e1b5feae62ab7ef52d41eb
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoExplicitly load the real DataBroker with component-name 80/59980/4
Josh [Tue, 4 Jul 2017 10:02:32 +0000 (13:02 +0300)]
Explicitly load the real DataBroker with component-name

It seems that karaf4 has "better" wiring so the
TracingBroker was being wired to itself, resulting
in stack overflows.

Change-Id: Iedb2e9dcfd53acf384ed3130cfcd78f313d76e1e
Signed-off-by: Josh <jhershbe@redhat.com>
7 years agofix config file for mdsal-trace and filtering mechanism 79/59979/4
Josh [Thu, 11 May 2017 13:43:07 +0000 (16:43 +0300)]
fix config file for mdsal-trace and filtering mechanism

Initially missed this but this patch fixes the initial config
xml file. Also, this code contains a fix for an issue with
filtering paths that CODEC could not handle and were reconstructed.

Change-Id: I34da4ce9e78c075439b0047407c75aa0b86feb16
Signed-off-by: Josh <jhershbe@redhat.com>
7 years agoBUG-8733: use DataTreeCandidateNodes.empty() 82/60682/1
Robert Varga [Fri, 21 Jul 2017 08:52:37 +0000 (10:52 +0200)]
BUG-8733: use DataTreeCandidateNodes.empty()

Removes code duplication, making DistributedShardChangePublisher
a bit smaller.

Change-Id: I67ab71c4344c1a61ebda929d6fceb1ebb3fbb376
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8619: do not touch forward path during purge enqueue 14/60614/2
Robert Varga [Thu, 20 Jul 2017 15:51:52 +0000 (17:51 +0200)]
BUG-8619: do not touch forward path during purge enqueue

In case of a purge request, the request is sent from the head
of a connection chain (i.e. the original connection which created
the transaction) and propagated via forwarders. This path needs
to make sure it does not go via throttling, as it is an internal
detail.

Separate the transmit paths a bit more, so that TransmitQueue
can push messages to forwarders' replay path.

Change-Id: I5e146b8d11e8654b4beae3959207efb9c2f18315
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-7464: use yangtools.triemap 91/60591/1
Robert Varga [Wed, 19 Jul 2017 12:08:04 +0000 (14:08 +0200)]
BUG-7464: use yangtools.triemap

Yangtools is moving away from using upstreap Triemap to its
internal fork of that codebase. Switch this code, too.

Change-Id: I0d60ccc8927505a83a35631333203817484da9e0
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 354ad30e58618b4bdec256d7a78bd80284cccc77)

7 years agoBUG-8618: refresh transaction access when isolated 92/60492/3
Robert Varga [Mon, 17 Jul 2017 15:11:48 +0000 (17:11 +0200)]
BUG-8618: refresh transaction access when isolated

When we are isolated leader we stop accepting messages from
the frontend. If we remain in this state for more than 15 seconds
this can result in a timeout -- which is obvious, but it really
is our fault.

Since we cannot make forward progress anyway, there is no point
in purging the transaction. Update its access time with whatever
the last mark for that frontend was.

Change-Id: I9ff56c91e4fda4b68cd34c05609dc88d6d65fd32
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8792: allow transactions to not time out after reconnect 90/60490/3
Robert Varga [Mon, 17 Jul 2017 12:41:47 +0000 (14:41 +0200)]
BUG-8792: allow transactions to not time out after reconnect

During reconnect churn, the frontend may be catching up with previous
transactions, hence we should hold off timing it out until it does.

When we arrive at a timed out transaction, we allow the access time to
be updated to connect time -- effectively saying the transaction was
touched at the time of reconnect.

Change-Id: I3930b5782579f50931b204d8579c2aee51e2bc55
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: record LeaderFrontendState time 31/60431/3
Robert Varga [Sat, 15 Jul 2017 21:33:25 +0000 (23:33 +0200)]
BUG-8618: record LeaderFrontendState time

In order to deal with IsolatedLeader state and transaction timeouts,
we need to maintain an accurate view of when we have seen the frontend
even if we are not accepting messages from it.

Add correspoding field and maintain it whenever we interact with
LeaderFrontend state. Also record last connect ticks for the same use.

Change-Id: I8e49037507fcd01470a03be8c0d611efca55dabf
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBug 8619: Introduce inheritance of progress trackers 06/59506/35
Vratko Polak [Fri, 14 Jul 2017 15:50:32 +0000 (17:50 +0200)]
Bug 8619: Introduce inheritance of progress trackers

+ Introduce cancelDebt method.
+ Use the newly introduced functionality in client code.
+ Delete unused copy constructors (including unit test).

Change-Id: Ib976343ed5f50c649ea08206c897cb70dead8b86
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoProgressTracker: Decrease delay due nearestAllowed 71/60071/6
Vratko Polak [Fri, 7 Jul 2017 11:14:57 +0000 (13:14 +0200)]
ProgressTracker: Decrease delay due nearestAllowed

If nearestAllowed is in past, that means we have
a temporary interval of relatively small demand for tasks.
We can reduce delay, as if the time since nearestAllowed
was a "delay in advance".

This way the queue stays closer to the intended capacity.

Change-Id: I40f95ea9cb25ea62d8c65ee78cafc79e9b56cc11
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
7 years agoBUG-8618: fix test driver 37/60137/8
Robert Varga [Mon, 10 Jul 2017 14:08:57 +0000 (16:08 +0200)]
BUG-8618: fix test driver

Since the test can produce bursts of completions, which in turn can
get slowed down by writout of new messages, offload future completion
to the executor we have internally. This in turn simplifies things,
as we can rely on state being manipulated (mostly) from a single thread.

Also change ArrayDeque to a HashSet to ensure removal of tasks completes
quickly even in face of misordered responses.

Change-Id: Ia5341633af2dbe3e26e7208436405daf7632a876
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: add pause/unpause mechanics for tell-based protocol 33/60033/9
Robert Varga [Thu, 6 Jul 2017 16:26:44 +0000 (18:26 +0200)]
BUG-8618: add pause/unpause mechanics for tell-based protocol

When we are transitioning to/from paused state, we need to remove
all frontend-related state, including pending transactions, to ensure
ShardDataTree does not track them.

When we change to unpaused leader, we can reconstruct the state
from the journal -- the rest will be forwarded from the frontend anyway.

Change-Id: I28d486d1a6695e21dd7e6518609680d54e5a15eb
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoFix .1 version references 89/60389/1
Robert Varga [Fri, 14 Jul 2017 17:21:49 +0000 (19:21 +0200)]
Fix .1 version references

Post version-bump, need to fix up versions.

Change-Id: I7cb982c7d8744f70bf15d9c3c0736a34cdb6da69
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agomdsaltrace utility for debugging 78/59978/2
Josh [Sun, 11 Dec 2016 08:47:46 +0000 (10:47 +0200)]
mdsaltrace utility for debugging

Moved back to controller per decision from kernel meeting
a few weeks ago.

TracingBroker logs 'write' operations
and listener registrations to the md-sal. It logs the instance
identifier path, the objects themselves, as well as the stack trace of
the call invoking the registration or write operation. It works by
operating as a "bump on the stack" between the application and actual
DataBroker, intercepting write and registration calls and writing to the
log.`

+ karaf4

Change-Id: Ie7d27901429f6e7bcac7ff62e49e4e3115f5915f
Signed-off-by: Josh <jhershbe@redhat.com>
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: introduce RaftActor.unpauseLeader() 32/60032/4
Robert Varga [Thu, 6 Jul 2017 15:15:21 +0000 (17:15 +0200)]
BUG-8618: introduce RaftActor.unpauseLeader()

This is a preparatory patch, which notifies RaftActor when
the operation hooked to pauseLeader() fails to complete and the
leader should resume its normal operation.

This is needed to correctly resume operations of tell-based protocol
after a pauseLeader() completes without actually changing the leader.

Change-Id: Ia00e52ebb327575a484af62bf0c31131a33303b3
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: eliminate SimpleShardDataTreeCohort subclasses 98/59898/5
Robert Varga [Mon, 3 Jul 2017 17:57:38 +0000 (19:57 +0200)]
BUG-8618: eliminate SimpleShardDataTreeCohort subclasses

Now that we handle pre-cancommit failures useing reportFailure(),
there is no need to have specialized subclasses for cohorts, as
the initial failure can cleanly be handled via nextFailure.

This also places a guard in reportFailure() so we do not override
a failure once it is set -- which should only happen in the case
of a dead-on-arrival transaction and it timing out in READY state.

Change-Id: I057c5b36006843f51d60034d30af83bac4e02cd7
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: rework AbstractProxyTransaction.flushState() 98/59998/4
Robert Varga [Thu, 6 Jul 2017 07:04:10 +0000 (09:04 +0200)]
BUG-8618: rework AbstractProxyTransaction.flushState()

Instead of directly forwarding state use ModifyTransactionRequest
to encapsulate state and forward it separately to the successor.

This eliminates sendRequest() from replay path, ensuring the replay
thread is not blocked.

Change-Id: Ice86791d417b7487b9d3b1df06341dd028cde7f8
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: reconnect connections more aggressively 96/59896/2
Robert Varga [Mon, 3 Jul 2017 17:10:22 +0000 (19:10 +0200)]
BUG-8618: reconnect connections more aggressively

Given that the timeout period on backend for an existing transaction
is 15 seconds, sleeping for 5 seconds between reconnect attempts seems
excessive. Lower the timer to 1 second, which should give us a slightly
better chance to avoid timeouts.

Change-Id: Ib74480f5630865cb7a11ca7027e0495443d1d14e
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: turn timeouts in READY state into canCommit failures 95/59895/2
Robert Varga [Mon, 3 Jul 2017 16:55:18 +0000 (18:55 +0200)]
BUG-8618: turn timeouts in READY state into canCommit failures

This patch adds more details to the TimeoutException reported when
we prune a transaction while it is in the queue. It also peels the
READY case from the defaults and makes sure we send an authoritative
reply back to the frontend when it requests the transaction to be
committed.

Change-Id: I21364ff7e7103af8be6988b8483adc112c3c1d25
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: improve logging 90/59890/3
Robert Varga [Mon, 3 Jul 2017 15:30:43 +0000 (17:30 +0200)]
BUG-8618: improve logging

While target sequence is important, we also need to log transmit
sequence, too.

Since this issue involves a state mismatch on the backend, improve
ShardDataTreeCohort logging to include transaction identifier
and state.

Change-Id: I21735870a9ae7983dc14a8f8f4d7464d3448ca60
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoFix Verify/Preconditions string format 39/59839/3
Robert Varga [Mon, 3 Jul 2017 08:17:54 +0000 (10:17 +0200)]
Fix Verify/Preconditions string format

These methods take a String.format() string, not a logging one, hence
we are not getting the information we want.

Change-Id: I46de0d64c85594e3d7b8be97951f1cf5249bca8f
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBump versions by x.y.(z+1) 41/60341/1
jenkins-releng [Fri, 14 Jul 2017 12:49:19 +0000 (12:49 +0000)]
Bump versions by x.y.(z+1)

Change-Id: I2d9547cc35fadb4f0a2cca4f55164006fd4378b5
Signed-off-by: jenkins-releng <jenkins-releng@opendaylight.org>
7 years agoBUG-8704: rework seal mechanics to not wait during replay 54/59654/7
Robert Varga [Wed, 28 Jun 2017 14:47:13 +0000 (16:47 +0200)]
BUG-8704: rework seal mechanics to not wait during replay

AbstractProxyTransaction.seal() and most notably internalSeal()
can end up pushing down messages down the connection hence they
can end up slowing down the replay process.

The replay paths end up enqueing subsequent requests anyway, so
rework the structure to split the 'seal only' and 'seal and flush'
codepaths.

Change-Id: Ie75c1ef8aa0d3d5d7ca482d383fd516077ca50b4
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBug 8768: Close itemProducer for every code path 49/59649/5
Vratko Polak [Thu, 29 Jun 2017 11:50:07 +0000 (13:50 +0200)]
Bug 8768: Close itemProducer for every code path

Change-Id: Ib87de13e2a0e6f128f74a05b80ffb4331e345d2c
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
7 years agoBUG-8494: rework AbstractTransactionHandler 04/59604/1
Robert Varga [Wed, 28 Jun 2017 09:34:34 +0000 (11:34 +0200)]
BUG-8494: rework AbstractTransactionHandler

If we have a transaction failure while we are producing transactions,
we could end up adding a delay until the failure is detected as we
would continue jamming in transactions.

Rework internal logic to halt processing as soon as a failure is seen,
speeding up detection and simplifying code.

Change-Id: I19d13c78d94bb39481abde477ec4e3df03a6aa57
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoImprove ShardBackendInfo.toString() 01/59601/1
Robert Varga [Wed, 28 Jun 2017 08:24:17 +0000 (10:24 +0200)]
Improve ShardBackendInfo.toString()

Slight update to eliminate a space from the property name and
an explicit present/absent string.

Change-Id: I9cb3a57049737c8ea25d22263140ff9974e23502
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8445: ignore responses from mismatched sessions 99/59599/1
Robert Varga [Wed, 28 Jun 2017 08:11:47 +0000 (10:11 +0200)]
BUG-8445: ignore responses from mismatched sessions

We have to check the session ID of the response in order not to
wreck transmit consistency if face of leader changes and reconnects.

If we reconnect the connection to the new leader before we saw all
responses from the old leader, we end up in a situation where the
old leader completes some of the replayed messages before we either
send them to the new leader or receive (the correct) reply.

Guard against this by checking the session ID before attempting to
pair a response to a request.

Change-Id: I28fa98b89c679715c3a0c546962d00533e76aa5d
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8494: fix failure path thinko 76/59576/1
Robert Varga [Tue, 27 Jun 2017 14:53:09 +0000 (16:53 +0200)]
BUG-8494: fix failure path thinko

The check should be to see if the failure has *not* been set,
hence invert the check.

Change-Id: I2c3893924f1c985687beedbfae0889388fad15c7
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8445: check sessionId before propagating failures 27/59527/1
Robert Varga [Mon, 26 Jun 2017 14:31:30 +0000 (16:31 +0200)]
BUG-8445: check sessionId before propagating failures

When we have leader movement ocurring, based on timing details we
can re-establish a connection to the new leader and then start
receiving responses from the old leader telling us it no longer
is the leader.

To stop this from happening we need to check connection session ID
against the incoming failure.

Change-Id: If9a891016c7f213f2552283e3ec13485e598f5a4
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8494: Cleanup clustering-it-provider 65/59465/6
Robert Varga [Fri, 23 Jun 2017 11:02:40 +0000 (13:02 +0200)]
BUG-8494: Cleanup clustering-it-provider

Fixes various warnings and refactors MdsalLowLevelTestProvider
to be slightly cleaner in terms of number of classes.

It also eliminates synchronous thread blocking on future collection
and instead schedules task which performs the cleanup if the system
gets stuck.

Change-Id: I657f3df60c620284538bdf39ab1536eac8448801
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG 8629: Try to allow notification processing to finish in unsubscribe of listeners. 52/58952/5
Tomas Cere [Wed, 14 Jun 2017 13:42:07 +0000 (15:42 +0200)]
BUG 8629: Try to allow notification processing to finish in unsubscribe of listeners.

Change-Id: I8638c6066b86b101484d3d80cd0fed146a478778
Signed-off-by: Tomas Cere <tcere@cisco.com>
7 years agoBug 8621 - Add shutdown-prefix-shard-replica rpc to MdsalLowLevelTestProvider 80/58580/6
Jakub Morvay [Fri, 9 Jun 2017 07:45:14 +0000 (09:45 +0200)]
Bug 8621 - Add shutdown-prefix-shard-replica rpc to MdsalLowLevelTestProvider

csit testing scenarios require clean shutdown of shard's local replica
funcionality. This introduces shutdown-prefix-shard-replica rpc to
MdsalLowLevelTestProvider. Upon invoking this rpc, local replica of
specified prefix-based shard is gracefully stopped.

Change-Id: I620b7ae2dbc9978dd155c64f703d421d46108e3d
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>
7 years agoBug 8621 - Add shutdown-shard-replica rpc to MdsalLowLevelTestProvider 79/58579/7
Jakub Morvay [Fri, 9 Jun 2017 07:12:20 +0000 (09:12 +0200)]
Bug 8621 - Add shutdown-shard-replica rpc to MdsalLowLevelTestProvider

csit testing scenarios require clean shutdown of shard's local replica
funcionality. This introduces shutdown-shard-replica rpc to
MdsalLowLevelTestProvider. Upon invoking this rpc, local replica of
specified module-based shard is gracefully stopped.

Change-Id: Ia8e0be65ecc99f9e208ff4ffd737b210437a9f51
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>
7 years agoBUG-8494: propagate submit failure immediately 39/59239/2
Robert Varga [Tue, 20 Jun 2017 14:06:01 +0000 (16:06 +0200)]
BUG-8494: propagate submit failure immediately

Rather than waiting for abort to complete, which cannot happen
during isolation for example, propagate timeout immediately.

Change-Id: I90333938cb951f3b478320c682c65be219660fdf
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoFix format string mismatch 25/59225/1
Robert Varga [Sun, 4 Jun 2017 18:37:59 +0000 (20:37 +0200)]
Fix format string mismatch

String expects two objects, not one.

Change-Id: I5cc37336236e88c13d569c656910d7fd969bb655
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 45c422e0af781c2fa6978103afb14de9fbdc1c55)

7 years agoBUG-8445: Guard against NPE 60/59160/2
Robert Varga [Mon, 19 Jun 2017 12:15:10 +0000 (14:15 +0200)]
BUG-8445: Guard against NPE

We have observed this NPE:

[...]
Caused by: java.lang.NullPointerException
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:810)
        at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:105)
        at org.opendaylight.controller.cluster.datastore.ChainedCommitCohort.canCommit(ChainedCommitCohort.java:58)
        at org.opendaylight.controller.cluster.datastore.FrontendReadWriteTransaction.directCommit(FrontendReadWriteTransaction.java:384)
        at org.opendaylight.controller.cluster.datastore.FrontendReadWriteTransaction.handleModifyTransaction(FrontendReadWriteTransaction.java:527)
        at org.opendaylight.controller.cluster.datastore.FrontendReadWriteTransaction.doHandleRequest(FrontendReadWriteTransaction.java:174)
        at org.opendaylight.controller.cluster.datastore.FrontendTransaction.handleRequest(FrontendTransaction.java:141)

Which is quite weird, as the FrontendReadWriteTransaction state seems
to indicate the transaction is ready to be committed, yet ShardDataTree
does not seem to have a record of it.

While we are investigating the root cause, this patch adds an explicit
warning when this happens.

Change-Id: I2ddff76357c33d7df2b3f25a2703c69715fbd871
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoLower UnboundedDequeBasedControlAwareMailbox logging 59/59159/2
Robert Varga [Mon, 19 Jun 2017 11:53:26 +0000 (13:53 +0200)]
Lower UnboundedDequeBasedControlAwareMailbox logging

Using debug logging seems excessive, leading to a lot of messages
at debug level. I think we can downgrade to trace instead.

Change-Id: I2a7f87760a1eefe9794eac3b4025b6a3891c30a3
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoCleanup ProduceTransactionsHandler 04/59004/6
Robert Varga [Thu, 15 Jun 2017 08:39:55 +0000 (10:39 +0200)]
Cleanup ProduceTransactionsHandler

Shuffle invariants around to reduce overheads. Also adds better debugs
around futures completing.

Change-Id: I01f940de08e9e0b7fc0e95b48b2d5fecdfd78f86
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoOptimize Follower.isOutOfSync() 38/59038/2
Robert Varga [Thu, 15 Jun 2017 16:10:22 +0000 (18:10 +0200)]
Optimize Follower.isOutOfSync()

This is a fast-path method which does a few duplicate checks
and calculations that may end up being unnecessary.

Restructure it so we check each partial condition just once
and compute required inputs only when we are going to need them.

Change-Id: I67a0089693a2ba1cd8c06c43504266534090545b
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: update sync status only after processing 37/59037/2
Robert Varga [Thu, 15 Jun 2017 15:45:50 +0000 (17:45 +0200)]
BUG-8618: update sync status only after processing

Since the commitIndex may move in chunks we really want to update
our sync status after we have gone through the AppendEntries message
so our commitIndex reflects the state after processing.

Change-Id: I49c72a21f8d9c3efb7ae9cc1b64276220057f2e2
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: make sync threshold tuneable 91/58991/6
Robert Varga [Thu, 15 Jun 2017 01:13:47 +0000 (03:13 +0200)]
BUG-8618: make sync threshold tuneable

We are observing quite a few of these transitions, which may be coming
from batching scenarios. Introduce sync-index-threshold config knob
to expose control over it.

Change-Id: Ief4c89c2fe5b95cebaf3fb83cbcdda37cac126b6
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: improve debug logs 90/58990/2
Robert Varga [Thu, 15 Jun 2017 01:12:56 +0000 (03:12 +0200)]
BUG-8618: improve debug logs

We can have a reasonable ID prepended, add that. Also improve range
of threshold parameter, as we are addressing journal entries here.

Change-Id: I86aac1be04df8b72bfa6ffaa2b7a7e3b4cbfad6e
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: refactor SyncStatusTracker state 86/58986/2
Robert Varga [Wed, 14 Jun 2017 23:29:01 +0000 (01:29 +0200)]
BUG-8618: refactor SyncStatusTracker state

Introducing a leader target encapsulation allows us to
enfore state transitions (i.e. state is guaranteed to be
non-null when we need its bits).

This enables us to eliminate the need for a magic constant.

Change-Id: Iab7178694edc3c62032e32c4386c371630f67b6f
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: make sure we refresh backend info 36/58936/3
Robert Varga [Wed, 14 Jun 2017 10:38:20 +0000 (12:38 +0200)]
BUG-8618: make sure we refresh backend info

When we are performing a reconnection attempt we must never use
previous backend info, but rather have to refresh it.

Fix this by removing state when resolution fails.

Change-Id: I65592f2101547a606a15d9c8030c7d8c58afe8a5
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8618: add threshold crossing debugs 85/58985/1
Robert Varga [Wed, 14 Jun 2017 23:00:43 +0000 (01:00 +0200)]
BUG-8618: add threshold crossing debugs

We are observing messages about sync status changing on the order
of 10s a second (14ms between messages). This looks awfully like
inter-node latency, hence it needs to be tuneable.

We do not have an understading of what sort of jumps are we talking
about, so add logging to the source of this events at debug, so these
can be diagnosed.

Change-Id: I9e2d78629f8808914cdb664cb28afcd47a55ee80
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoLog data after in IdIntsDOMDataTreeLIstener 97/58797/2
Vratko Polak [Tue, 13 Jun 2017 09:49:54 +0000 (11:49 +0200)]
Log data after in IdIntsDOMDataTreeLIstener

At TRACE level.

Change-Id: Ic71aeec4c121d5cfb53a09762c9845e3e94f4f04
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
7 years agoImprove timeout message 14/58314/4
Robert Varga [Tue, 6 Jun 2017 09:19:07 +0000 (11:19 +0200)]
Improve timeout message

Rather than reporing nanoseconds, convert them to fraction seconds.

Change-Id: I9052462990f8c6b99349ed123f682ce3f0e23461
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBUG-8665: fix memory leak around RangeSets 00/58800/1
Robert Varga [Tue, 13 Jun 2017 10:13:58 +0000 (12:13 +0200)]
BUG-8665: fix memory leak around RangeSets

This is a thinko on my part, where I was thinking in terms of a
discrete set (UnsignedLong) and assumed RangeSets will coalesce
individual items.

Unfortunately TreeRangeSet has no way of knowing that that the
domain it operates on is discrete and hence will not merge invididual
ranges.

This patch fixes the problem by using [N,N+1) ranges to address
the problem. A follow-up patch should address this in a more
efficient manner.

Change-Id: Iecc313e09ae0cdd51a42f7d39281f7634f0358a7
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
7 years agoBug 8606: Continue leadership transfer on pauseLeader timeout 94/58794/2
Tom Pantelis [Mon, 12 Jun 2017 13:42:38 +0000 (09:42 -0400)]
Bug 8606: Continue leadership transfer on pauseLeader timeout

Modified it to continue with leadership transfer if pauseLeader times out
instead of aborting. The shard may have a lot of transactions queued up
which it can't finish in time but there may still be a follower that is
caught up (ie whose matchIndex equals the leader's lastIndex) or would be
caught up if leadership transfer continued. Worst case is no follower is
available and the "catch up" phase of leadership transfer also times out
which would lengthen shut down time but that should be fine.

Change-Id: I1ec1ef43bb556e50416bb7239ce3c267265db9b3
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
(cherry picked from commit dac16f0d464eff3325b3800a803e81b303964e4b)

7 years agoFix intermittent PreLeaderScenarioTest failure 91/58791/1
Tom Pantelis [Tue, 13 Jun 2017 00:28:06 +0000 (20:28 -0400)]
Fix intermittent PreLeaderScenarioTest failure

java.lang.AssertionError: AppendEntries - # entries expected:<1> but was:<0>
  at org.junit.Assert.fail(Assert.java:88)
  at org.junit.Assert.failNotEquals(Assert.java:743)
  at org.junit.Assert.assertEquals(Assert.java:118)
  at org.junit.Assert.assertEquals(Assert.java:555)
  at org.opendaylight.controller.cluster.raft.PreLeaderScenarioTest.testUnComittedEntryOnLeaderChange(PreLeaderScenarioTest.java:57)

AppendEntries appendEntries = expectFirstMatching(follower1CollectorActor,
                AppendEntries.class);
assertEquals("AppendEntries - # entries", 1, appendEntries.getEntries().size());

After the payload is sent to the leader, it expects an AppendEntries sent to follower1
with a single ReplicatedLogEntry. From the test output this did occur correctly but
the MessageCollectorActor still had the initial empty AppendEntries sent on leader
startup. The test setup waits for the initial AppendEntriesReply's from both followers
prior to clearing messages in each MessageCollectorActor however the AppendEntries may
not have been delivered to follower1's MessageCollectorActor yet and thus doesn't get
cleared. We need to specifically wait for the AppendEntries in follower1's
MessageCollectorActor.

Change-Id: I638a21e75ea135c1fe24970135f564da4fc5738e
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>