git.opendaylight Code Review - controller.git/log

BUG-8392 RpcRegistry has it's buckets populated by unreachable nodes

In a situation when a member(f.ex member2) is isolated and the rpc registrations
are removed from the node(member1) we can still have our bucket store populated
by buckets from the remaining node(member-3) which might not have received
the memberUnreachable message yet leadig to stale routing of an rpc to
member-2.
This patch adds bucket filtering based on the currently present peers
so that we only accept Buckets that we can see.

Change-Id: I92c1e063f4754aca829bd73df4518f859e1d8497
Signed-off-by: Tomas Cere <tcere@cisco.com>

Fix timing issue in PartitionedCandidateOnStartupElection*Test

If the initial AppendEntries sent by the leader (member 1) to member 3
is delayed enough such that the behavior field in MemberActor is already
set by the test code, the AppendEntries message will be forwarded to the
Candidate behavior and yield incorrect results for the test. To prevent this,
we really shouldn't set and access the behavior field directly but instead
do so via messages to maintain actor encapsulation.

Change-Id: If497583ce648e62e3279e5abff19cb8702943c17
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Bug 8385: Fix testMultipleRegistrationsAtOnePrefix failure

The previous patch added a callback on the Future returned by
gracefulStop on shard removal. The timout was set to 3 * election timeout
which is 30 s in production by default. For the tests the election
timeout is 500 ms so the timeout is 1500 ms. However, if the timing is right,
the leader may not be able to transfer leadership on shutdown if the other
member was already shutdown. On shutdown there's a 2 sec wait to hear from
a new leader - this is greater than the 1500 ms shutdown timeout which
leads to test failure. To alleviate this, I made 10 s the minimum for the
shutdown timeout.

Another problem was that, after the stop future failed, the OnComplete
callback for PrefixShardCreated was repeated many times before the
OnComplete callback queued the message to remove the Future from the map.
To alleviate this, I added a CompositeOnComplete containing a list of
deferred OnComplete tasks. This allows the control to remove the entry
from the map before the deferred tasks run.

Change-Id: I899518e6d7e92533d2c4008a978ac772b02863cf
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix testTransactionForwardedToLeaderAfterRetry failure

java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test:cars?revision=2014-03-13)cars/car, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test:cars?revision=2014-03-13)cars/car, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-1-shard-cars-testTransactionForwardedToLeaderAfterRetry currently has no leader. Try again later.]]}

The test submits transactions and deposes the current leader so it forwards the
pending transactions to the other member-2 that assumes leadership. However it calls

Cluster.get(followerSystem).leave(MEMBER_1_ADDRESS);

which may result in an untimely MemberExited message sent to the ShardManager that
clears the peer address, causing the FindPrimary message to fail to find the leader.
I'm not clear why this was call was put in but it's unnecessary and may cause a
failure if the timing is right.

I also saw a failure due to a timeout when forwarding a pending transaction. This is
b/c it takes some time for member-2 to switch to candidate and become leader due to
the checking of current leader availability via the akka cluster on ElectionTimout.
If it takes too long the pending transaction forwarding may time out. To alleviate
this, I forced the swicth to candidate by sending an immediate TimeoutNow message.

Change-Id: I2dd228964779e2b755b1740a518e2c400b5cb88d
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Bug 8391 - Close producer in become-prefix-leader rpc implementation

MdsalLowLevelTestProvider's become-prefix-leader rpc implementation
creates CDSDataTreeProducer to try to move shard leadership. However,
the producer is not closed after leadership change request. This
prevents any subsequent invocations of become-prefix-leader rpc with
same prefix parameter to be successful. Subtree specified by the prefix
is attached to still opened producer and creation of any new producer
for this subtree fails. Close producer once we don't need it.

Change-Id: I3827e425082c35a43ec18dac1ef0f2dbd19b291f
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

BUG-8372: fix abort message confusion

Immediate transaction aborts need to use the appropriate message,
not 3PC's TransactionAbortRequest.

Change-Id: I9e25e3f20ed62fc520853685af17accef35c1bb4
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 4036805b31f73c7e7e2b06e84c8da975b2e45263)

Bug 8342: Add info logging to ConfigManagerActivator

Change-Id: I7b01961910dd2ba7ed9a421ee52e0aec29c68ade
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
(cherry picked from commit 1635f00212f8a583c6160cc3ff153e2c62af5092)

Bug 8385 - @Ignore testMultipleRegistrationsAtOnePrefix

DistributedShardedDOMDataTreeRemotingTest.testMultipleRegistrationsAtOnePrefix
is failing intermittently - set it to ignore for now.

Change-Id: I3e8aec2bfbe97559525051805170203574472aab
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

BUG-8372: improve forward/replay naming

There is a bit of confusion between 'replay' and 'forward' methods.
They serve two distinct purposes:
- 'replay' happens during reconnect, i.e. for requests that have
           already entered the connection queue and have paid
           the delay cost, so they should not pay it again.
- 'forward' happens after reconnect for requests that have raced
            with the reconnect process, i.e. they need to hop from
            the old connection to the new one. These need to enter
            the queue and pay the delay cost.

This patch cleans the codepaths up to use consistent naming, making
it clearer that the problem we are seeing is in the 'replay' path.

Change-Id: Id854e09a0308f8d0a9144d59f41e31950cd58665
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit cc21df8ade11f41843dc558e8fc93d5be92ed151)

Make the last submit timeout after 30 seconds

The low level test was waiting indefinetly for submits
to finish, change this to block and timeout after one minute
in case there's an unrecoverable failure on the backend which
doesnt propagate to the frontend.

Change-Id: I3df2465b56c701c88341ab6cc7fa37a015f1c893
Signed-off-by: Tomas Cere <tcere@cisco.com>

Move initial list creation to create-prefix-shard.

This move the initial list population of produce-transactions
to create-prefix-shard rpc with 3 hardcoded prefixes(prefix-1,prefix-2,prefix-3)
so that csit suites can populate the id-int list just once when the shard is created
and produce-transactions can now run parallely on multiple entries from
multiple nodes.

Change-Id: If70990c0e217cd68027ae960a7545c69acf52cdb
Signed-off-by: Tomas Cere <tcere@cisco.com>
(cherry picked from commit 74175c48bb2b3ee786108bdda8e665484080b7f5)

Bug 8380: Fix unhandled messages in ShardManager

Added trace logging for RegisterRoleChangeListenerReply and other
MemberEvent (ie MemberJoined, MemberLeft).

The DeleteMessagesSuccess message was the result of deleting legacy
journal messages (SchemaContextModules) that were kept for backwards
compatibility with Helium. I removed the associated code and the
deprecated ShardManager class that was kept for the SchemaContextModules
inner class.

Also added logging for DeleteSnapshotsFailure and DeleteSnapshotsSuccess.

Change-Id: I145ea815b191f1e167e73029df348c7d15732c4f
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Bug 8385: Fix testMultipleRegistrationsAtOnePrefix failures

The test quickly creates/removes the prefix shard in iterations which
can result in an InvalidActorNameException if the shard actor from the prior
iteration hadn't been destroyed yet. To alleviate this I modified the
removal in the ShardManager to utilize Patterns.gracefulStop to store the
Future and block a subsequent create until the Future completes.

Change-Id: Ica98de3cc17c2d87195840bdf052d81ed3b9dd10
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Pre-construct YangModuleInfo service name

This cleans up ModuleInfoBundleTracker instantiation as well
as making a tiny bit faster when looking up services.

Change-Id: I2bdce2fdca9cefd56192b04f74ed7c594187d425
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Fix DistributesShardedDOMDataTree.ProxyProducer's getShardAccess

DistributesShardedDOMDataTree.ProxyProducer's getShardAccess works only
for subtrees that are rooted at some registered prefix based shard.
Moreover subtree has to be one of the subtrees specified in
DistributedShardedDOMDatatTree's createProducer method.

This is way more strict than what is required by CDSDataTreeProducer's
API. Pass ProxyProducer's implementation current shard layout, so
producer can lookup corresponding shard for specified subtree in
getShardAccess method. One-to-one mapping between shards and subtrees
is no longer required.

Change-Id: I765567d34c803a85b4be8a6e10fd81b6f64a1610
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

Fix logger formatting strings

Fix %s/{} mixups.

Change-Id: I916996e17839a61802a83ddff31d162ac662f934
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Bug 8328 - Create prefix shards with correct peers

Change-Id: I068b38bb275d23d27559aec3f336a6b9081fb732
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

Bug 8336 - Fix NPE in DistributedShardedDOMDataTree's ProxyProducer

Change-Id: If0060e6e2696674bc5418d2f2a80ad0d01327e29
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

Fix Eclipse warnings in config-manager

Change-Id: I0ed9bc52d4cf4e5ee7a4da8bd53355191326cba6
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

BUG 8301: Convert queue to a local variable

There's a possibility that this might race and an
actor can have it's queue overwritten by another thread, so convert
this to a local variable.

Change-Id: Ic84922c6d109d8361a48debbf971fddd9cee1d3e
Signed-off-by: Tomas Cere <tcere@cisco.com>

BUG-8342: force config-manager startup

config-manager needs to be pretty much the first thing that comes
up due to historic reasons. Assign it a low start level so it
activates before the blueprint extension.

Change-Id: I2d0a3706843409e8a22f9064f27e47cc0df46c95
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 21870dc0696930b3be36c689bc7ae0929c5b9878)

Downgrade most info messages in benchmarks

They create spam during CSIT,
making real errors less noticable in log.

Change-Id: Icf00389526919751e88189ffef1be70e16e806e8
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
(cherry picked from commit 3f26179cf9ed4ebd4b680805d2d93f904ea60806)

Refactor Register*ListenerReply classes

The listener-type specific RegisterDataTreeChangeListenerReply and
RegisterChangeListenerReply classes are identical. Simplify by
replacing them with a general RegisterDataTreeNotificationListenerReply.
This also simplifies AbstractDataListenerSupport by eliminating the
abstract newRegistrationReplyMessage method.

Change-Id: I97f6cf366ae6ff858ff258ebb8479468b144c193
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

BUG-8327: deprecate sal.core.api.model.SchemaService

This interface is deprecated in favor of the DOMSchemaService
for the MD-SAL project.

Change-Id: Icff2cced791bc9fbf5bfadbe2f1cf2b949ff2d58
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

BUG-8327: GlobalBundleScanningSchemaServiceImpl should be a proxy

We are currently running to separate services which assemble
the GlobalSchemaContext, which hurts our startup performance and
leads to wasted memory. This is an artefact of the mdsal split,
hence we should be getting the service from the MD-SAL and
just proxy to old interfaces.

This lowers the startup time for

feature:install odl-restconf odl-bgpcep-bgp
odl-bgpcep-data-change-counter odl-netconf-topology

from 86s down to 67s (22%). Final retained heap size is also
lowered from 217MiB to 181MiB (16%)

Change-Id: I549e9512538bd83d86cfd2164d03e34bc9130c1e
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Improve test logging in DistributedEntityOwnershipIntegrationTest

Some of the tests in DistributedEntityOwnershipIntegrationTest set the
datastore type to "test" which isn't helpful in identifying the output
in jenkins log archives. Use the name of the test method instead as is
done with other tests.

Change-Id: I25e40df5139a4d9f8c46d03c0f2c9c8a52fd15ee
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Bug 8303: BP odl:clustered-app-config initial/*-config.xml testability

DataStoreAppConfigDefaultXMLReaderTest illustrates usage.

Change-Id: I342fca4583c90802238e63262871e33b4b713438
Signed-off-by: Michael Vorburger <vorburger@redhat.com>

BUG-8309: Add message identity information

We have encountered an attempt to serialize a local request across
a remote connection. Since this is hit by the akka serializer, we
have lost the identity of the call site and of the message, because
all akka is seeing is the Envelope and the exception's stack trace,
which only indicates class hierarchy up to and including
AbstractLocalTransactionRequest.

This patch enriches the exception message so we know what the actual
request was, hopefully pinpointing the offending call site. Since
the problem revolves around the reconnect process, bump critical
transitions to info instead of debug.

Change-Id: I6d6d6e702d4b5baff7b707242583e923708e7637
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 4ec6c11c03ae92e11a7cea29795e4b7f5547f64e)

Bug 8337: Ignore testMultipleShardLevels

DistributedShardedDOMDataTreeTest.testMultipleShardLevels is
failing intermittently - set it to ignore for now.

Change-Id: Ib7f86166fd85cd54e6ec8cac106c993e9407ffea
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
(cherry picked from commit 8b4e6a8e938513006ad5f1be786767b103210e78)

Add more debug logging for DTCL registration/notification code paths

Added logging so the listener instance and actor can be traced
end-to-end from FE registration to the BE publisher actor.

Also added log context to some classes to identify which shard it
belongs to.

Change-Id: I3e6dd92e7632139372407abf94a160096aa7750e
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

BUG-7927: stop scanning bundles on framework stop

Monitor framework bundle for STOPPING event and when it triggers
flag us as stopping: all bundles are about to shut down, so there
is no point in trying to update the schema context anymore.

Change-Id: I1a55169fce1705c19a139063cf632674fc256701
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Improve logging around transaction lifecycle

Testing has shown that we have a gap in request handling and we
have a lot of unclosed transactions. Add logging of code paths
which trigger unsupported request.

Change-Id: I013ba8a141d5a1a9e311a8bca7842ac77064d277
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 63d615831603b7a0a11b173f9d9316641e880844)

Bug 8301: Fix some issues with testProducerRegistrations

The LogicalDatastoreType.CONFIGURATION type was being used for both data
stores - modified the IntegrationTestKit to set the logicalStoreType
appropriately.

Fixed a synchronization issue in DistributedShardedDOMDataTree#lookupShardFrontend
where it accessed shards unprotected.

Change-Id: I628add86667e4a812f8e7516bac59f9b66fe4033
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix intermittent failure in testLeadershipTransferOnShutdown

10:03:06 java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test:cars?revision=2014-03-13)cars/car, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test:cars?revision=2014-03-13)cars/car, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-cars-testLeadershipTransferOnShutdown currently has no leader. Try again later.]]}
10:03:06 at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:64)
10:03:06 at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.get(MappingCheckedFuture.java:92)
10:03:06 at org.opendaylight.controller.cluster.datastore.DistributedDataStoreRemotingIntegrationTest.verifyCars(DistributedDataStoreRemotingIntegrationTest.java:215)
10:03:06 at org.opendaylight.controller.cluster.datastore.DistributedDataStoreRemotingIntegrationTest.testLeadershipTransferOnShutdown(DistributedDataStoreRemotingIntegrationTest.java:928)

From the logs it seems member-2 hadn't gotten MemberUp for member-3 after the
leader transfer and by the time it tried to read. I added calls to wait for members
to be up. After the change it ran 333 times w/o failure.

Change-Id: Ifbbf304230292f69429d3086867679effb8db01c
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Remove artifacts entries for long-gone RESTCONF

RESTCONF has been moved to its own project, hence these
artifacts entries are duds. Remove them.

Change-Id: I72d918567a04841784b0a8061ec655fe79af6ae4
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 09f44218611fb5a1439d1b2c7ffef401449c354b)

BUG-5280: handle NotLeaderException

NotLeaderException is indicative of leader movement, in which
case we need to tear down the connection and resolve the new
leader.

Change-Id: I068e97f9a7feb75cc30afb5f5449f0adf00aa217
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 8265c26f7692086677fa943976824966f32eecf6)

Bug 8301: Disable DistributedShardedDOMDataTreeRemotingTest for now

Change-Id: I24068c5ee92533cdc23174d17cc1805328df7c4d
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

BUG-7390: fix dsbenchmark

Benchmark tests were not consistent as to what data store they were
using, leading to flooded logs in read case because of this and
irrelevant results in the delete case.

This patch corrects the mistakes, adding at least some consistency
and hope for relevant results.

Change-Id: I0528eb42cb38eacd5e0525c0a78ada111b1edb55
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Nest id-ints list inside a container

Needs to be nested to be able to refer to the whole list via restconf
and instance-identifier yang element, so update the model and the handlers
to account for this change.

Change-Id: Idf50de5e6faa9757f45ec68e9b796ae0742f6aa9
Signed-off-by: Tomas Cere <tcere@cisco.com>

Lower AbstractNormalizedNodeDataOutput debugs to trace

Setting debug to org.opendaylight.controller.cluster.datastore
also catches the clustering-commons, leading to a lot of logs
from serialization. Lower its logging to trace.

Change-Id: Ic0e9f9c60020675c45e79c7638dcb500d6de5091
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Bug 8116 - Make DistributedShardChangePublisher agnostic to data tree change events ordering

DistributedShardChangePublisher allows for registering DCTLs on
DistributedShardFrontend. Internally, DistributedShardChangePublisher
sets up DataTreeChangeListenerProxies on respective backend shard and
also on all of its backend subshards. Upon receiving data tree change
events from backend shards, DistributedShardChangePublisher updates
its own data tree. With the help of this tree, it finally constructs
data tree chnage events for registered DCTLs.

DistributedShardChangePublisher relies on specific ordering of backend
shards data tree change events. If it receives subshard's data tree
change event prior to current shard data tree change event, updating
internal data tree can fail. Subshard's data tree change event can
expect some changes from its parent shard.

Clearly, we don't have control on ordering of these events. Do not rely
on this. If we cannot apply subshard's change to data tree, cache it
and try to apply it once we have also its parent's change.

Change-Id: I3bd9b2d217d01974bce02465529c6cdbf8c3d633
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

Improve orphan transaction logging

This patch improves logging when we perform last-resort cleanup
from garbage collector, so that the type of client handle is also
logged. This allows us to discern snapshots and snapshots.

Also lower the logging level to INFO, as this is something that
should be fixed by whoever is causing it, but it does not pose
serious threat to stability.

Change-Id: Iad55c49de87ca73f9671f04f569be7eae0e4f885
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

BUG-8219: Cleanup CompositeDataTreeCohort

This patch reworks the logic so we can track which cohort times
out in case that happens. We also instantiate shortcuts so we do
not go through asynchronous processing if there are no cohorts
at all.

Change-Id: I9493b768c86e8d6b2d0f4f1d13f53b13ff98fe7b
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 9155298250e0fbfc0534ab5553fc562289be268b)

Bug 8274: add missing configfile dependency

odl-jolokia's configfile was missing its corresponding dependency in
the POM; this patch adds it.

Change-Id: I4e5420978020b19de58b65d06c4b2482f55351d0
Signed-off-by: Stephen Kitt <skitt@redhat.com>

Fix checkstyle problems not detected by the current version

This change is required for overall move to new Checkstyle version, see
https://git.opendaylight.org/gerrit/#/q/topic:bumpCheckstyle

Most of the changes are redundant "final" modifiers.

Change-Id: I637dd46617ca144f0ed33bd705c6357493b887fe
Signed-off-by: David <david.suarez.fuentes@ericsson.com>

Simplify DelayedListenerRegistration functionality

The DelayedListenerRegistration class is abstract with
parameterized sub-classes for the 2 listener types which
don't provide any additional functionality. Consequently
AbstractDataListenerSupport is parameterized with the
DelayedListenerRegistration type with an abstract method
to instantiate the appropriate type.

We can simplify AbstractDataListenerSupport by removing the
type parameter and the abstract method and consequently remove
the 2 DelayedListenerRegistration sub-classes.

Change-Id: I04933753b59748a09c31e0ec5ed4de9666fea364
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

BUG-8159: fix local transaction history tracking

ShardCommitCoordinator needs to make sure ShardDataTree tracks
the histories involved with local transaction being submitted
via ReadyLocalTransaction. This is consistent with what we are
doing for the BatchedModifications message.

Change-Id: I02cc61476b5e02fb45f1482c4a9693bc77335793
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 5f0f7152dedbed437bdfb055e9bc80c0a1edaa15)

BUG-5280: unwrap RuntimeRequestExceptions

This patch adds the primitive to unwrap RuntimeRequestExceptions,
so the underlying cause is propagated.

Change-Id: I77771867a48eb5f63d35a6402aca6ad0bc5b12e3
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

RpcRegistrar unit test

Change-Id: I90403cb3c5fb98854c9e7dcd80ba0ce6e5f944f4
Signed-off-by: matus.kubica <matus.kubica@pantheon.tech>
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

Bug 7747: Reply to the leader before applying previous state

Applying state to the data tree can be expensive so the follower
should reply to the leader before applying any previous state so
as not to hold up leader consensus.

Change-Id: Ic92ae2ac30d72d6a401bdc36fda900a0a7fb21d3
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix intermittent testAddShardReplicaWithAddServerReplyFailure failure

ShardManagerTest#testAddShardReplicaWithAddServerReplyFailure failed:

java.lang.AssertionError: assertion failed: timeout (3 seconds) during expectMsgClass waiting for class org.opendaylight.controller.cluster.raft.messages.AddServer
20:14:24 at scala.Predef$.assert(Predef.scala:170)
20:14:24 at akka.testkit.TestKitBase$class.expectMsgClass_internal(TestKit.scala:472)
20:14:24 at akka.testkit.TestKitBase$class.expectMsgClass(TestKit.scala:459)
20:14:24 at akka.testkit.TestKit.expectMsgClass(TestKit.scala:814)
20:14:24 at akka.testkit.JavaTestKit.expectMsgClass(JavaTestKit.java:415)
20:14:24 at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManagerTest$33.<init>(ShardManagerTest.java:1637)

The log shows:

08:14:06,302 PM [main] [INFO] ShardManagerTest - testAddShardReplicaWithAddServerReplyFailure starting
08:14:06,325 PM [main] [INFO] ShardManager - Starting ShardManager shard-manager-config22
08:14:06,329 PM [test-akka.actor.default-dispatcher-7] [INFO] ShardManager - Recovery complete : shard-manager-config22
08:14:09,339 PM [main] [INFO] TestActorFactory - Killing actor TestActor[akka://test/user/member-1-shard-astronauts-config]
08:14:09,340 PM [main] [INFO] TestActorFactory - Killing actor TestActor[akka://test/user/shardmanager-config22]
08:14:09,340 PM [main] [DEBUG] ShardManager - Got updated SchemaContext: # of modules 1
08:14:09,340 PM [main] [DEBUG] ShardManager - shard-manager-config22: onAddShardReplica: AddShardReplica[ShardName=astronauts]
08:14:09,340 PM [main] [INFO] ShardManager - Stopping ShardManager shard-manager-config22

So the ShardManager got the onAddShardReplica message but after the test timed out
after 3 seconds. The problem is that the test is using the default dispatcher for
TestActor which is the calling thread dispatcher which is problematic for persistent
actors. Either not use TestActor where we don't need access to the underlying actor
instance or use the system default dispatcher, which is async.

Change-Id: Ib6521c345bd0db9502d0078928f8d0e5dcd7f747
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix a typo

transacion -> transaction

Change-Id: I30b5b387dc9d21774798286984f67e46a2471e95
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

BUG-5280: fix snapshot accounting

The following warning is emitted under testing:

2017-04-19 08:49:34,707 | WARN | ... | AbstractClientHistory | ... | Could not find aborting transaction member-2-datastore-operational-fe-0-txn-19-0

Which is indicating that we cannot find the open transaction
inside AbstractClientHistory.

The problem is mis-routed invocation when we are taking a snapshot:
instead of going directy to subclass doCreateSnapshot() which only
allocates the transaction, invoke takeSnapshot(), which actually does
the appropriate book-keeping.

Change-Id: I07473f381d3147a7fc7d355afede254a781a3094
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Bug 5280: Enable tests for ClientBackedDatastore

Add new test parameter - commitTimeout. New implementation
needs more time to commit transaction in some cases
e.g. leader down.

Change-Id: I33d6312c9b18493e519b8607307c21c1b3a9bc75
Signed-off-by: Andrej Mak <andrej.mak@pantheon.tech>
(cherry picked from commit 9dde5085f50832148a6f3766e1bc988be0327401)

Bug 8231: Fix testChangeListenerRegistration failure

As described in Bug 8231, the sharing of the ListenerTree between the
ShardDataTree and the ShardDataTreeNotificationPublisherActor is
problematic. Therefore the ListenerTree (wrapped by the
DefaultShardDataTreeChangeListenerPublisher) is now owned by the
ShardDataTreeNotificationPublisherActor. On registration, a RegisterListener
messages is sent to the ShardDataTreeNotificationPublisherActor to perform
the on-boarding of the new listener, ie it atomically generates and sends
the initial notification and then adds the listener to the ListenerTree.

This change necessitated some refactoring of the DataChangeListenerSupport
class et al wrt to how the ListenerRegistration is handled. Prior the
ListenerRegistration was passed on creation of the registration actor. This
is now done indirectly by sending a SetRegistration message to the
registration actor via a Consumer callback passed in the RegisterListener
message. When the ListenerRegistration is obtained by the
ShardDataChangePublisherActor, it invokes the Consumer callback.

When a registration is initially delayed due to no leader, the
DelayedListenerRegistration is sent to the registration actor. When the
leader is elected later on, the actual ListenerRegistration is sent and
replaces the DelayedListenerRegistration.

The DOMDataTreeChangeListener registration classes were changed/refactored
similarly.

In addition, the 2 specific registration actor classes were replaced by a
generic reusable DataTreeNotificationListenerRegistrationActor that handles
both listener types. Also the 2 CloseData*ListenerRegistration and
CloseData*ListenerRegistrationReply messages were consolidated.

Change-Id: I79ac76b8044609351e5dd8367b691b589ea35075
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Remove deprecated persisted raft payloads

Removed the deprecated payload classes that were deprecated as Carbon
will trigger a snapshot when it encounters any of them on recovery:

ServerConfigurationPayload
ApplyJournalEntries
DeleteEntries
UpdateElectionTerm
ReplicatedLogImplEntry

Also removed the implemented MigratedSerializable interface from the
current classes.

Change-Id: I942584022ece0783c73b2596e9ad928a28dfdda2
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix incorrect last history update

This is a thinko -- the codepath will never trigger, eventhough
it should normally trigger all the time.

Change-Id: I29b24a3823c08c64c8c8a74e7be3b96e07672313
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Change DistributedShardedDOMDataTree's ctor signature

We should inject DistributedShardedDOMDataTree with AbstractDataStore
instead of DistributedDataStore, so we can allow different
implementations of distributed DOM store

Change-Id: I11d1b49e1413dcc233350a3c853b283df176bffa
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>
(cherry picked from commit be3bc49185f935ad2672b08e031f602eca594d1e)

Fix warnings in tests

This fixes up initialization failures and use of raw classes where
possible.

Change-Id: Icfa9bd0a08a6dd838d794c509612f711099ea0fe
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

BUG-5280: fix invalid local transaction replay

When we transition from a connecting to connected local connection,
we may encounter operations which are invalid and these violations
are detected during transaction replay.

If such replay fails, we need to suppress reporting the error until
the user initiates canCommit or directCommit, at which point we need
to report the delayed failure.

For reasons of consistency, we perform this suppression even under
normal connected circumstances.

Change-Id: I2018498afff0e463dbdceaec5c50e8ebf088001b
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

RemoteRpcProviderFactory and RpcErrorsException unit tests

Change-Id: Ife8c638d43810baede654cccac22fa8efccae1d0
Signed-off-by: matus.kubica <matus.kubica@pantheon.tech>
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

Fix intermittent failure in ClusterAdminRpcServiceTest.testModuleShardLeaderMovement

java.lang.AssertionError: Rpc failed with error: RpcError [message=leadership transfer failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.raft.LeadershipTransferFailedException: Failed to transfer leadership to member-2-shard-cars-config_testModuleShardLeaderMovement. Follower is not ready to become leader]
  at org.opendaylight.controller.cluster.datastore.admin.ClusterAdminRpcServiceTest.verifySuccessfulRpcResult(ClusterAdminRpcServiceTest.java:461)
  at org.opendaylight.controller.cluster.datastore.admin.ClusterAdminRpcServiceTest.doMakeShardLeaderLocal(ClusterAdminRpcServiceTest.java:450)
  at org.opendaylight.controller.cluster.datastore.admin.ClusterAdminRpcServiceTest.testModuleShardLeaderMovement(ClusterAdminRpcServiceTest.java:263)

It failed when trying to make member-2 the leader for a couple reasons. One is that
member-2 hadn't yet received the MemberUp event for member-3 from akka clustering and
thus didn't have its address when it started the election and tried to send
RequestVote.

The second problem is a result of the first - since member-2 couldn't get a vote
from member-3, it needed the vote from member-1, which was in the process of stepping
down as leader. When member-1 received the RequestVote with the higher term, it
switched to Follower. Therefore member-2 didn't receive any votes for that election
term. The request to transfer leadership, which was issued on member-1, then timed out
and failed.

The wait period for the new leader to be elected is 2 sec. This was chosen b/c
originally leadership transfer was only used on shutdown and we don't want to
block shutdown for too long. However, when requesting leadership outside of shutdown,
we should wait at least one election timeout period (plus some cushion to take into
account the variance).

This alleviates the time out but it still failed sometimes if member-1 timed out
in the Follower state and started a new election before member-2 timed out in
Candidate state. member-1 would then win the election and grab leadership back.
To alleiviate this, it would be ideal if member-1 replied to the RequestVote from
member-2 prior to switching to Follower. Normally when it receives a RaftRPC with
a higher term, the Leader is supposed to immediately switch to Follower and not
process and reply to the RaftRPC, as per raft. However if it's in the process of
transferring leadership it makes sense to process the RequestVote and make every
effort to get the requesting node elected.

I also fixed a couple issues in the test code, mainly adding waitForMembersUp.

Change-Id: Ibb1b00f03065680fe1fd338c3d26161ec6336d5a
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Remove DataTreeCandidatePayload

Deprecated since Boron, this payload will cause a snapshot in Carbon,
hence we can remove it in Nitrogen.

Change-Id: Ic2b5f54837ab130b56f9121c560e2616ae66dbda
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Unit test for RemoteRpcRegistryMXBeanImpl class

Change-Id: Ic00c607f3f66b327336b49f92afe6eb29c144a92
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

BUG-8159: add payload debugs

This patch adds debugging of metadata snapshot application
and recovery operations.

Change-Id: I9498f53af6ddc8fecf42eb239c7da7da08d3f0c6
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 2fd4150b338a3cbd422a3daff895bb5c9afcd7a6)

BUG-5280: Correct reconnect retry logic

Our reconnect logic failed to account for various timers
during resolution. This patch makes the BackendInfoResolver
explicit about the type of failures it can report and fixes
AbstractShardBackendResolver to conform to them.

Change-Id: I610ddb6e062e223557d46e2950a552de6e7d3843
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 63bca3841f0187b5127f62fd04e4edcdce3a63c1)

BUG-5222: remove sal-dom-xsql

XSQL has been deprecated and de-activated in Carbon due to bugs
and not being supported. This patch removes it completely.

Change-Id: I9faeb7200faa665484d6a3315cb4b8820b53c976
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Improved unit tests for AveragingProgressTracker class

Change-Id: I079b45304d82bfc9022321a1648fbdba13409c90
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

Bug 8206: Fix IOException from initiateCaptureSnapshot

Modified the install snapshot chunking to be idempotent to avoid attempts
to send the same chunk twice. This fixes the error:

java.io.IOException: The # of bytes read from the imput stream, -1, does not match the expected # 3075

Change-Id: I5336c88125f226d0976f0d7fe17d03c0d181e12d
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

BUG-7783: increase precision of execution times

Document the time units we are using for measuring execution
and make sure they can hold any long.

Change-Id: I859349e27604c75d426ad7c4eec9d6870b081291
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit 20cffa6b2251167e428a641d18a49958044fe598)

BUG-8205: use updated DatastoreContext

DatastoreContext is updated by the config admin overlay, which means
we cannot refer to the initial one passed in when we are deciding
which data store to instantiate.

This fixes up the protocol propagation and adds and initial info about
which protocol is in use.

Change-Id: I3c2f1a5eec1c7346fff3aca2d85609f47990723a
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
(cherry picked from commit e325f9ec2d3fc6b059823d88658596be544a1828)

Make AbstractIdentifiablePayloadTest public

For some reason, tests derived from AbstractIdentifiablePayloadTest
fail b/c AbstractIdentifiablePayloadTest isn't public when running from
eclipse - runs fine from command line.

Change-Id: Ie6ed1d6e0e130a1ffc5ad04db93e037ea6a79549
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Bug 8206: Prevent decr follower next index beyong -1

If a follower's next index is already -1, we shouldn't decrement it
further, ie -1 is the lowest allowed value. This can result in AbstractLeader
continuously decrementing and logging an info message while in the
process of sending an install snapshot.

member-3-shard-default-config (Leader): follower member-1-shard-default-config last log term 2 conflicts with the leader's 3 - dec next index to -2

Modified decrNextIndex to return a boolean if next index was decremented
which is checked by AbstractLeader.

Change-Id: I29454d4e71a7f9128b3b47f6a4e3403615c2c8d2
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Remove akka-distributed-data-experimental

This module was used during development, but we stopped using it,
remove it from dependencies.

Change-Id: I415347f4e8a264a0daf604375815728f3a77837a
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Fix intermittent failure in testWriteTransactionWithSingleShard

DistributedDataStoreRemotingIntegrationTest.testWriteTransactionWithSingleShard
fails intermittently with tell-based protocol in verifyCars after it has
reinstated the follower where it's expecting just car1 but the
data tree contains car 1 and car2. This is b/c the delete transaction for car2
prior to reinstatement wasn't applied on recovery due to the corresponding
ApplyJournalEntries message missing from the persisted journal. The test
expects 2 ApplyJournalEntries messages to be persisted corresponding to the
2 transactions but tell-based persists other payloads as well so there may
be 3 ApplyJournalEntries messages. I changed the code to handle this case.

Also the assertion failure in verifyCars caused it to bypass shutting down
the ActorSystem which resulted in several other failures in tests that try to
use the same port configuration due to the port already in use. So I made
changes to ensure ActorSystems are shutdown properly.

Change-Id: Id6316d71fcd9eb3e768c6b1f676fa0e9be1287a2
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix intermittent failures in FollowerTest

FollowerTest.testCaptureSnapshotOnLastEntryInAppendEntries:1152 Persisted journal entries size: [] expected:<1> but was:<0>

The test waits on the deletion of journal entries after the snapshot is saved
to occur and then checks the persistent journal for the remaining
ApplyJournalEntries. But occasionally the persisting of the ApplyJournalEntries
message occurs after the deletion so the assertion fails b/c the
ApplyJournalEntries wasn;t persisted yet. This is a little odd b/c the
sequencing in the raft code is that the ApplyJournalEntries write is done
before the delete so it should also be observed the same way in the
InMemoryJournal, even though it doesn't really matter either way.

To alleviate the problem I added a wait for the ApplyJournalEntries
message in the journal in the 3 similar tests.

I also made a couple other minor changes that I observed while running the
tests.

Change-Id: I67cbb8fd79c91cd1cc23c363b78e7f5e9b9f2bbe
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Fix FindBugs error in DelayedListenerRegistration#getInstance

The ObjectRegistration interface was recently changed to annotate
getInstance with @Nonnull to promise it will not return a null.
However DelayedListenerRegistration could return null if the delegate
is not set yet. In reality, we do not and should not ever call this
method on DelayedListenerRegistration instances so I changed it to
throw UnsupportedOperationException to make it explicit and to avoid
the FindBugs error.

Change-Id: I9fe374b23336d8ade65b2f1b697d93f50a090df9
Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Bump versions by x.(y+1).z for next dev cycle

Change-Id: Ife090ef2c9bb25e515b2fb06e2766c56d8174d76
Signed-off-by: Anil Belur <abelur@linuxfoundation.org>

Handle odl-mdsal-common with Karaf 4.0.9

Karaf 4.0.9 simplifies feature dependencies (so that dependencies
specified in feature.xml can be completed from the POM), but that
causes issues with odl-mdsal-common since features in Karaf are
identified by their name only:

* if both the controller odl-mdsal-common and the mdsal
  odl-mdsal-common are encountered in the dependency tree, whichever
  one came first (as dependencies are resolved) is the one that ends
  up being kept;
* in some circumstances, the mdsal repository replaces controller's
  even when the controller dependency is retained (this is a Karaf bug
  which I'll submit a patch for, but we can work around it).

Change-Id: I5400a829560ae96cb2f264e103020cccd1d225c3
Signed-off-by: Stephen Kitt <skitt@redhat.com>

BUG-5280: add the concept of a recorded failure

This patch reworks LocalReadWriteProxyTransaction to be defensive
of its internal modification and introduces the concept of a delayed
recorded error (currently unused).

The defensiveness checks allow us to get rid of FailedDataTreeModification,
as we do not give out our modification at all in the codepaths which
would leak this implementation.

Change-Id: I5f91218ac308f7450a3b59252d44f953be54626c
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Bug 7805: Add make-leader-local rpc for module based shard.

csit testing scenarios require movement of the shard leader for module
based shards aswell so add this into ClusterAdminRpcService.

Change-Id: Ib8a310cdba728c0a42d8850703740bf4698adbe0
Signed-off-by: Tomas Cere <tcere@cisco.com>

Bug 7806 - Implement agent RPCs for shard replica manipulation testing

These can be implemented as a part of ClusterAdminRpcService instead
of creating new rpcs that would be part of the lowlevel suite.

Change-Id: I891f9d3703a9357e829159691cbf18f95523d529
Signed-off-by: Tomas Cere <tcere@cisco.com>

BUG-5280: log a message when tell-based protocol is active

Discerning the two access modes is critical to understanding
when failures occur. Add an explicit note when the tell-based
protocol is enabled on a data store.

Change-Id: I3e2b1d2f84a73ce1a3759d419176c47a6dd0ad12
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Pass no op callback instead of null during replay

Change-Id: Ife964481dc225bbc1d5b312035384f8bd597d740
Signed-off-by: Andrej Mak <andrej.mak@pantheon.tech>

Add AbstractIdentifiablePayload unit tests

Change-Id: I884f3c35d1767ed02accabc3b9a775ef9c667716
Signed-off-by: Andrej Mak <andrej.mak@pantheon.tech>

Unit test for ClientBackedTransaction derived classes

Change-Id: I2967a0e224fc783ffac73a994def666e86a423a6
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

Unit tests for ClientBackedTransactionChain class

Change-Id: I97953cfdc32619c31295cfed2584b7466d48aa5d
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

AbstractClientHistory derived classes tests

Change-Id: I1261eb764c730bbce6eb833644db99e4bbf0605c
Signed-off-by: matus.kubica <matus.kubica@pantheon.tech>
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

AbstractDOMDataBroker fix annotations

Change-Id: I7938965d805ba3e4228e9fc5a36c75c52f6ec881
Signed-off-by: Jie Han <han.jie@zte.com.cn>

Bug 7805 - Implement agent RPCs for shard leader movement testing

Change-Id: Ic19d1867f3c54ec22d600e9b80c6490d5a4b99bb
Signed-off-by: Tomas Cere <tcere@cisco.com>

BUG-5280: Close client history after all histories are closed

Make sure record history state as closed once we are done with
it.

Change-Id: Icbdf947ad166b082e06df896741e618e801ecf2e
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

BUG-5280: switch tests to ClientBackedDataStore

Enable integration tests to run
on the new frontend code with parametrized JUNIT.

Not working tests for new code are ignored.
For old code all tests run and pass.

Change-Id: Ib5656ecd2333a56d5c466e633fbdd477accc4095
Signed-off-by: Robert Varga <rovarga@cisco.com>
Signed-off-by: Ivan Hrasko <ivan.hrasko@pantheon.tech>

BUG 7801: prevent OptimisticLockFailedExceptions in write-transactions.

When multiple instances of this rpc are running concurrently in paralel
we would run into an optimistic lock since every instance tries to write
the topmost parent list first.
When these happen handle these failures as expected and resume with the
next stage of the rpc.

Change-Id: I43efaea3315b04272113eb86733e68609e434984
Signed-off-by: Tomas Cere <tcere@cisco.com>

Bug 7803: Implement agent RPCs for data tree change listener testing

Change-Id: Id2d53d3765fb9d518d4b052792d716d2b2b4c976
Signed-off-by: Tomas Cere <tcere@cisco.com>

Bug 7804: Implement agent RPCs for DOMDataTreeListener testing

Change-Id: I9e57e169fc3151a12914b2f370e0c97f41395992
Signed-off-by: Tomas Cere <tcere@cisco.com>

BUG 7802: split out shard creation from produce transactions

Change-Id: I33fa46791a6c80477f57badf3bd44c3d6c5a2f9e
Signed-off-by: Tomas Cere <tcere@cisco.com>

Bug 7802 : Implement agent RPCs for transaction producer testing

Change-Id: I56d89093bd292032f92cdc98f25056822d93e628
Signed-off-by: Tomas Cere <tcere@cisco.com>

Bug 7407 - CDS: allow applications to request Leader movement

This patch provides the routing from cds-dom-api CDSShardAccess
to the backend RaftActor.

Change-Id: I9fa315034d95a1896393a6152147a7bc50829b2a
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>

Bug 7407 - Add request leadership functionality to shards

This adds a new MakeLeaderLocal message to Shard class API.
MakeLeaderLocal message is sent to a local shard replica to request
the shard leader to be moved to the local node. Local shard will
contact the current leader with RequestLeadership message to initiate
leadeship transfer to itself. Original sender of MakeLeaderLocal
message will be notified about result of this operation.

Change-Id: I2b0ee7caf772457e31250d1bdddd5fc77b16fc53
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>