git.opendaylight Code Review - controller.git/log

Add option to enable/disable basic DCL and/or DTCL

The cars stress test is a very appropriate place to measure the effects
of DCL and DTCL on a very long list. This change adds a few RPC
implementations in order to do the following:

1) enable DCL
2) disable DCL
3) enable DTCL
4) disable DTCL

This change includes very basic DCL/DTCL implementations, which just log
a message at trace level (off by default but there for ensuring the
onData*Changed(...) method is actually called.

The existing clustering-test-app behavior doesn't change at all; these
new RPC(s) do not need to be used, and the added Listener implementations
are not registered listeners by default.

Change-Id: I6fcec6cd8c0a082e815561e88b325a55022ad2af
Signed-off-by: Ryan Goulding <ryandgoulding@gmail.com>

Fix intermittent unit test failures

Cherry picked from master.

Change-Id: I2ef68b48de8da4cc7d82a91263976295458d011a
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Bug 6106: Prevent flood of quarantine messages

Added a "quarantined" flag to the QuarantinedMonitorActor so it only
prints the warning and attempts to restart the karaf container once
(which is invoked indirectly via the caller's Effect callback).

Change-Id: I0a57af729280abded93d1b1a575df1672e52032e
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Fix intermittent test failures in ClusterAdminRpcServiceTest

Failed tests:
ClusterAdminRpcServiceTest.testChangeMemberVotingStatesForShard:555->verifySuccessfulRpcResult:296
Rpc failed with error: RpcError [message=Failed to change member voting
states for shard cars: Shard
member-3-shard-cars-config_testChangeMemberVotingStatusForShard
currently has no leader. Try again later., severity=ERROR,
errorType=RPC, tag=operation-failed, applicationTag=null, info=null,
cause=null]

Needs to ensure node3's datastore shards are ready with leaders.

Change-Id: Iae6179e6f577b98f267c1afd3a901a14eed81e7f
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Fix intermittent test failures in PartitionedLeadersElectionScenarioTest

Seeing intermittent failures on jenkins, eg

Failed tests:
PartitionedLeadersElectionScenarioTest.runTest1:37->setupInitialMemberBehaviors:313->AbstractLeaderElectionScenarioTest.initializeLeaderBehavior:207
Missing messages of type class
org.opendaylight.controller.cluster.raft.messages.AppendEntriesReply

Sometimes the initial AppendEntries messages go to dead letters,
probably b/c the follower actors haven't been fully created/initialized by akka.
So added retries as a workaround.

Change-Id: I5c838950f8ed2af3d5bc8ee3bd29602d8a8e8a9f
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Add voting state to shard mbean FollowerInfo

The shard mbean displays the peer voting states map but it's also useful
to see the voting state in the leader's FollowerInfo.

Also fixed an NPE when JMX accesses the peerAddresses when a peer's
address is null. We use guava's Map.Joiner to output the map but it
throws an NPE for a null entry vlaue. I chnaged RaftActor to put "" in
the map if null.

Change-Id: I1eb963808fd7878dfe1e4935f3ac06a579a3504e
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Implement cluster admin RPCs to change member voting states

Backported from master: https://git.opendaylight.org/gerrit/#/c/38086/

Added 3 new RPCs for changing voting states:
  change-member-voting-states-for-shard
  change-member-voting-states-for-all-shards
  flip-member-voting-states-for-all-shards

These replace the original ones added in Be that weren't implemented.
They were added as placeholders based on how it was thought it would
work at that time.

New related ShardManager messages were added that are sent by the
ClusterAdminRpcService.

The flip-member-voting-states-for-all-shards RPC is a shortcut that
obtains the current voting states via the GetOnDemandRaftState message
to the RaftActor and inverts them. New fields were added to the
OnDemandRaftState response to return the voting states.

Modified the ShardStats JXM bean to report the new OnDemandRaftState
fields.

Added a check in RaftActorServerConfigurationSupport to ensure that
there's at least 1 voting member otherwise one can end up with an
unusable shard with no ability to elect a leader.

Fixed a couple bugs in Leader and AbstractLeader that were found during
testing. AbstractLeader needs to take into account the follower's voting
state when determining if the leader is isolated.

Change-Id: I58686e3ce94d58de7cf289e55bb717ba46bc1de5
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Bug 5913: Fix ISE in DefaultShardDataChangeListenerPublisher

The publishChanges method is only called from the
ShardDataTreeNotificationPublisherActor which is single-threaded so
publishChanges can't be called concurrently. However the
DefaultShardDataChangeListenerPublisher instance is passed via
the PublishNotifications message so the Stopwatch isn't thread safe
wrt thread visibility of its internal state. Therefore it's possible
the change in state done on thread 1 isn't immediately visible to
a subsequent thread. To alleviate this, I moved the Stopwatch and the
elapsed time check to the ShardDataTreeNotificationPublisherActor.

Change-Id: I046e7e92aa96eec01d5a355c8431ef797c534ead
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Fix test failures in RaftActorServerConfigurationSupportTest

Fixed test failures due to order of recent cherry picks that are failing
jenkins builds.

Change-Id: I140d2b9e69c16ef10ccb5e183eb77b0bb56e9ab9
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Implement change to voting with no leader

Backported from master.

Implemented a special case where on a voting state change from
non-voting to voting, if there's no leader, it will try to elect a
leader in order to apply the change and progress.

This is to handle a use case where one has 2 geographically-separated
3-node clusters, one a primary and the other a backup such that if the
primary cluster is lost, the backup can take over. In this scenario,
there's a logical 6-node cluster where the primary sub-cluster is
configured as voting and the backup sub-cluster as non-voting such
that the primary cluster can make progress without consensus from
the backup cluster while still replicating to the backup. On fail-over
to the backup, a request would be sent to a member of the backup
cluster to flip the voting states, ie make the backup sub-cluster
voting and the lost primary non-voting. However since the primary
majority cluster is lost, there would be no leader to apply, persist and
replicate the server config change.

Therefore, if the server processing the request is currently non-voting
and is to be changed to voting and there is no current leader, it will
try to elect itself the leader by applying the new server config change in
the RaftActorContext and sending an ElectionTimeout. If it's elected
leader, it persists and replicates the new server config. If no leader
change occurs within the election timeout period, it reverts the server
config change and tries to forward the change request to another server
with the same voting state change. In this manner, the intent is to elect
the newly voting server that has the most up to date log.

Change-Id: I67b5b2d3a97745dbe9a8215f9a28f3a840f2a0db
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Implement ChangeServersVotingStatus message in RaftActor

Backported from master.

Added a new ChangeServersVotingStatus message to change servers to/from
voting members. The leader updates its local peer info and persists and
replicates a new ServerConfigurationPayload with the appropriate voting
states. If the leader changes to non-voting it steps down as leader by
initiating a leadership transfer.

Change-Id: If073e4665cb1a270aae6e3dce36a6b3e900d0282
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Add a few toString() methods

These help when converting to DataTree ;-).

Change-Id: I9b0fdb428ebe0265cb4321bd6ee31dedb4811950
Signed-off-by: Stephen Kitt <skitt@redhat.com>
(cherry picked from commit 86bc3639095c1d6cc3c764ba8e8721257b87c5c6)

Bug 5504: Fix IllegalStateException handling from commit

https://git.opendaylight.org/gerrit/#/c/36172 attempted to
handle/workaround IllegalStateException thrown from commit to re-apply
the transaction. However the change wasn't correct - the commit call
actually throws an ExecutionException which the IllegalStateException as
the cause. So we need to catch ExecutionException and check it the cause
is IllegalStateException.

Change-Id: I65b2d646a60a700d070dea822d20b0e649290643
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Debug logging in AbstractLeader is too chatty

The additional debug logging added with
https://git.opendaylight.org/gerrit/#/c/39796/ makes it too chatty with
heartbeats when nothing changed which will roll-over log files much more
quickly. Changed a debug to trace.

Change-Id: I4c204c6d0734d6ac8655380adcc2df09cb2890ae
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Remove snapshot after startup and fix related bug

Fixed an issue in the follower out-of-sync integrity checking where is
needs to take into account that the previous index may be in the
snapshot. A similar issue was seen with other inegrity checks.

These issues were indirectly related to the snapshot after startup that
was introduced in Be. I think this snapshot is unsafe b/c the
replicatedToAllIndex hasn't been determined yet which I think may cause
other issues with the trimming after snapshot completion, as the logic
takes replicatedToAllIndex into account. And there may be other lurking
bugs. I thinks it's safer to let the normal snapshot logic handle it.

The reason for the snapshot after startup was to avoid having to recover
the same journal entries again on restart that were just recovered. However
in reality, in production, servers aren't commonly restarted and
typically go weeks/months in between restarts. By the time of the next
restart there would likely have been another snapshot and an arbitrary amount
of new journal entries to recover so it really doesn't add much value.

Change-Id: Ie14148e5dbde3e93deafc5943278aea8c9bb3e75
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Guard against duplicate log indexes

We saw an issue where a duplicate log index was added to the journal.
The duplicates were contiguous. It is unclear at this point how it
happened but we should guard against it so I added a check to ensure the
new index > the last index.

Change-Id: Iacb7e5c83870eb79550bb4314d7f24c4530fc113
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Add more debug output in AbstractLeader and Follower

Adding more debug to help troubleshoot an issue.

Change-Id: Iff3e78157415de2841bb32f3dd588705d518b015
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Update version enforcement to Java 7

OpenDaylight was never able to run with 1.6, hence we should
enforce Java 7 at least.

Change-Id: I3e6a5d21d4af6c3916528178b0465e86190c0dc6
Signed-off-by: Robert Varga <rovarga@cisco.com>

BUG-5414 introduce EOS inJeopardy flag

The inJeopardy flag is used to indicate that the leader has lost quorum,
e.g. if cannot reach majority of followers or the follower has lost connection
to the leader (and has initiated new elections).

While EOS is in jeopardy, any reported entity state may not reflect cluster-wide
consensus, but rather represents the latest intended state as seen by this node.

Change-Id: I18df5a11ebbef6607fb0a0754ba0f09bc52f19ba
Signed-off-by: Robert Varga <rovarga@cisco.com>
(cherry picked from commit d4fa6758d6b94aad894854c0fe6fcd82e7bbefd6)

Make Karaf dump heap on OOM by default

See mails in this thread:
https://lists.opendaylight.org/pipermail/release/2016-March/006098.html
This changes DEFAULT_JAVA_OPTS,
so if user sets JAVA_OPTS it would override this.

Change-Id: I54fad73c5f50a6bf251bd3b255293ff3ef4ed877
Signed-off-by: Vratko Polak <vrpolak@cisco.com>

Bumping versions by 0.0.1 for next dev cycle

Change-Id: I93804f91f274da742ad4276e45737948c3ad576e
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

Release Beryllium-SR2

Change-Id: Ia633f2d63c086ca8a2eecbda23931c9806e6e117
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

BUG 5690 : No owner present even when entity has a candidate

If a candidate for an entity is removed and another added in quick
succession it can leave the owner of the entity blank. This happens
because the BatchedModifications for candidate removal happen one
after another which results in the commit of those modifications.
The BatchedModification which writes an owner on removal is committed
only after the addition of the new candidate. In this scenario when
the new candidate is added it finds that there is still an owner
for that entity and so it does not assign a new owner for that entity.

To fix this problem in onCandidateAdded we check if the currentOwner
is present in the current candidate list and if it is not then we
choose a new owner.

Change-Id: I47f90314e018e25f2c1dac82342b931c4e2d882d
Signed-off-by: Moiz Raja <moraja@cisco.com>

Fix ApplyState elapsed time check

On ApplyState, there's a check if the elapsed time exceeds a 50ms
threshold and it logs a warning. However the start time is captured when
the message is created prior to queueing. So if there's many ApplyState
or other messages already queued, the elapsed time also includes the time spent
in the queue, ie as a side effect includes the cumulative processing time
of each prior message in the queue. When a follower starts up, there can
be hundreds to thousands of catchup ApplyState messages and, eventually,
the cumulative processing times can add up to more than 50 ms, in which
case every subsequent ApplyState message trips the threshold with
increasing elapsed times, even though none of them actually took 50 ms
to process. Seeing hundreds to thousands of warnings with misleading
elapsed times looks ominous and leads users to think something is wrong.

Therefore I changed it to capture the start time just prior to calling
applyState so it captures just the processing time for that message. I
also removed the startTime field from ApplyState. This class is
Serializable but it is only ever sent locally to self and is never
serialized so there's no backwards compatibility concerns.

Change-Id: I9493734b5307d6dd5d723e5fe416ba97915dfc63
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 57d7e4788a488d992b9868d44ebc392b06e317c5)

BUG 5656 : Entity ownership candidates not removed consistently on leadership change

This patch removes candidates for all downed members when entity ownership shard
leadership changes. This fixes a corner case where a leader/follower are both killed
simulaneously in a cluster which has greater than 3 nodes. In this case the old leader
does not have a chance to remove the killed follower. The new leader does know that
the follower is down so it can remove the candidates for all downed followers on
shard leadership change.

Change-Id: If28f5656e0daee40fb96a937dbd0a868b7d3645a
Signed-off-by: Moiz Raja <moraja@cisco.com>

Default shard-journal-recovery-log-batch-size to 1

In Helium there was an issue with batching journal log entries in a
single transaction on recovery which could cause validation exceptions
and/or missing data. Setting the batch size to 1 alleviated the issue and
thus it was defaulted to 1.

It was thought this issue wasn't present in Lithium but it is as I have
a Helium journal which exhibits the problem. I have tried this journal
with the current code base and didn't see an issue (it looked like all
data was recovered from what I could tell) but I'm not confident an issue
isn't still lurking with the right combination of modifications across
many journal transactions. It is safest to recover the transactions in the
same manner as they were originally committed, ie one by one.

Therefore I have defaulted the batch size to 1. In my testing, the prior
setting of 1000 doesn't add any value anyway as the recovery time is
virtually the same with batch size 1000 and 1. Setting it to 1
eliminates the potential risk of data loss.

Change-Id: Icd7fd3c60bdd6cf1b677ccae38be810e779d2bd3
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 28313ad901a88b4a5e5e9f54da0368c7171ca080)

Bug 5613 - unregister candidates on node restart
When EntityOwnershipShard receives CandidateAdded for local candidate
before any local registration happes it means a restart of a local node
must have happned and the candiates are not registered yet.

So this change removes candidate for such case.

The corresponding test reproduces the issue if the change is not applied.

Fixed other test failures.

Change-Id: I0e8e675530c93dca172ca661fa4c5e1250f40150
Signed-off-by: Amit Mandke <ammandke@cisco.com>

Bug 5625: Fix OutOfMemoryError in YangStoreSnapshot

Close the InputStream returned via yangTextSchemaSource.openStream().

Change-Id: I3ecd2e1a3f52f91203a3a00c2f982b061cc62c42
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 990c36b8a92ffb36b0b386855f6a7ea79e5ea226)

Add **/yang-gen-config/** to checkstyle ignore path

Change-Id: I4080cd5a5c6d2ccd9374af9979ff2fca76e607ab
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
(cherry picked from commit 17d19bca1350102fd5ca1a1b7162cc5fc2ac9f79)

Add yang-jmx-generator dependency

When building in parallel sal-dom-xsql fails due to
yang-jmx-generator missing. This implies that yang-jmx-generator is
actually a dependency.

Change-Id: I624d4026d8182c12a147830ded0391eca25b0f62
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
(cherry picked from commit 7cda9d7daa76fa84caead38f5979314ff35cf9db)

Enlarge critical section to cover processNextTransaction()

As it turns out the critical section is not sufficient to cover the case
when the user thread performs a submit/allocate/submit in the time window
between us releasing the in-flight transaction and taking the lock: we would
have to re-check inflightTx after taking the lock.

Since we are going to take the lock anyway, reverse the order of operations
by making processNextTransaction() synchronized, which means the user
thread will not be able to submit the transaction even when it observes
inflightTx as null outside the lock.

Change-Id: I688ceb5e8aae28f5e582b64e6bbaa64c9699c7f5
Signed-off-by: Robert Varga <rovarga@cisco.com>
(cherry picked from commit 30d98c1da2a32f719302668f8deb6ef4f371749c)

Bug 5485: Improve DataTreeModification pruning on recovery

Modified the PruningDataTreeModification and NormalizedNodePruner to
validate path and node QNames via the SchemaContext instead of just
namespaces. This allows migration support for any element to be removed
from a yang hierarchy.

Also handled SchemaValidationFailedException on ready which can happen
with writes which don't immediately validate the sctructure as merge
does. The modification tree is re-applied with pruning.

Change-Id: I986d1116d2e25115f406abc21b1f816525387125
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Bug 5460: Fix snaphots on follower

Added a callback to the appendAndPersist call in Follower to call
captureSnapshotIfReady.

Added checks in ReplicationAndSnapshotsIntegrationTest to verify the
followers snapshot along with the leader.

Change-Id: Ie71f1b16152541d069f9d005ba669cb1e5771dd1
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Stop logging complete data tree on prepare/commit failure

Sometimes the data tree modification is so large that just trying to create
the buffer to hold the message can make the controller run out of memory. Plus
it's rarely useful to have a log filled with data which obfuscates other
important log messages. This patch still logs the data tree modification at
trace level.

Change-Id: I76bff9f7e836ee5eff347b0b77e2817f441ab953
Signed-off-by: Moiz Raja <moraja@cisco.com>
(cherry picked from commit 2cf157241dc0ce5045c26e2ad07d053a60b37822)

Bumping versions by 0.0.1 for next dev cycle

Change-Id: I045cfbec3f810bd58885a726ff31612d30dae343
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

Release Beryllium-SR1

Change-Id: I7acebbdf1c8b0c6172477620c3e468c334768e43
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

Bug 5504: Handle IllegalStateException from commit

Tries to re-apply the transaction if the "store tree and candidate base
differ" IllegalStateException occurs.

Change-Id: If2ef81d88fbd756edd54842d1afb7cd62043de05
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Fix intermittent RaftActorLeadershipTransferCohortTest failure

Change-Id: I4c58f6545d7ef7667c7fcf42f5dda82345ab1167
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 68eb628b5aca1fc4de4e29bcacf46dcb7b3a19c8)

Bug 4823: Offload generation of DCNs from Shard

Generation of data change notifications can be expensive with large
lists which can block the Shard actor for many seconds. This processing
was offloaded to other actors to free up the Shard, one for DCLs and the
other for DTCLs. I separated the 2 types of listeners b/c DCN generation
is much more expensive than DTCs so at least DTCLs aren't held up by
DCLs.

Change-Id: I1bfb5d572c793f8eb703ebf0a7fd9bf628747168
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit a46305fbc6bb7ec6883c21298d356a5e4fbbb015)

Fix issues with LeastLoadedCandidateSelectionStrategy

LLCSS degenerates into a round robin owner allocator when
ownership changes. This patch fixes that issue as follows,

- Consider the statistics that are collected using the DTL
  only as initialStatistics which are passed to the Strategy
  when it is created
- When Leadership changes clear all the strategies so that
  they get freshly created with the right initial statistic
- Modify the newOwner method on Strategy to
    - pass the currentOwner for the entity, for the current
      owner we decrease the ownership statistic
    - remove the statistics passed to it as it would no longer
      be required. Due to this removal we also get rid of all
      the CRUD which we had added to check if the passed in
      stats were actually greater than the local stats which
      anyway did not work.

Change-Id: I754f0459051687a95056857044777ca6eebbcd93
Signed-off-by: Moiz Raja <moraja@cisco.com>

Fix broken downstream features

factoryakkaconf needs to be spelled out in the dependency of
features-mdsal.

Change-Id: I71e7cff1076fc63c08f6debefc72107046f8337f
Signed-off-by: Robert Varga <rovarga@cisco.com>
(cherry picked from commit cfdb7ed1fdd440feea75adfe1b0289b76ffc9e50)

Bug 5329: Add factory akka.conf

Added a factory akka.conf file that is shipped to
configuration/factory/akka.conf. This file contains all the necessary
akka settings. Modified the FileAkkaConfigurationReader to load the
existing configuration/initial/akka.conf file with the factory file as
the fallback. In this manner akka will overlay/merge the initial file
with the factory file. I pared down the initial file to only contain the
settings that users would normally set or configure to setup a cluster,
ie hostname, port, seed-nodes, roles.

In the features.xml, the factory file is configured to always overwrite
so changes are picked up on upgrade. We still preserve the initial file.

Change-Id: I8e80161e21d0ad0e26f1efa1023c670b3a5ef6bc
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 6e76bb44514ea79524f46ff283ea0d0d4ad8c7f8)

Choose owner when all candidate registrations received.

In the Delayed Owner Selection Strategy we should not wait for
the timeout to occur when we have received the candidate
registrations for all the candidates possible in the system.

Change-Id: Ifcd1f376b050baf2e422e00bd4a93a4d9d3d6c45
Signed-off-by: Moiz Raja <moraja@cisco.com>

Add notification-dispatcher configuration for default akka.conf.

Change-Id: I9d4983b9d435f527738a84aa03904f23ec2237c1
Signed-off-by: Moiz Raja <moraja@cisco.com>
(cherry picked from commit 2b7d2365c64087cfce66196bf0bf5857c0a4c315)

When no candidates are present for an entity do not return EntityOwnershipState

Change-Id: I22c0100755a1fca50c638ff4b435e04bdd0f76ff
Signed-off-by: Moiz Raja <moraja@cisco.com>
(cherry picked from commit 4f2123238f32ad97019ad0ce0a7b588ea33397ed)

Fix reading of EntityOwnerSelectionStrategy

1. The pid used for reading a config admin file should not have hyphens
   so replaced them with dots
2. The config admin returns properties that are not from the file
   so we need a way to ignore them. I specifically look for the
   a prefix of "entity.type." and ignore the other properties

Change-Id: I26a66176583ec39cbdb78fec749022429218e005
Signed-off-by: Moiz Raja <moraja@cisco.com>
(cherry picked from commit 7030ae1a3c8fcc19e2b88d874a18faf73496682e)

Bug 4823: Use tx commit timeout for BatchedModifications

When sending BatchedModifications messages to the shard we use the
general operation timeout which is 5 sec. We should instead use the
transaction commit timeout to be consistent with the other transaction
messages (ReadyLocalTransaction, CanCommitTransaction etc).

Change-Id: If69704c3e9bde7f2cbed344912166137d43c039b
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Fix ConcurrentModificationEx in RpcRegistry.onBucketsUpdated

This was introduced by a recent patch. onBucketsUpdated iterates the
routesUpdateCallbacks however one of the callbacks in receiveGetRouter
removes itself from the list causing the ConcurrentModificationEx.

I changed onBucketsUpdated to first copy the list to an array to prevent
this.

Change-Id: I44c9a89b4b433f711cf4f90bf28e6955d8784f5f
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit f09d37ec4cce2411eaae11dde18f9ce2d2f14118)

Fix missing bundle

Change-Id: I9b8c0ca660e0101a2459f92dd16e36727f8ab9c3
Signed-off-by: Robert Varga <robert.varga@pantheon.sk>
(cherry picked from commit aeb60bc8ab1b62e18dc090f946c2bdf12b3e9a6c)

Bug 4866: Add wait/retries for routed RPCs

If a routed RPC is registered on one node it takes a little time for the
route to propagate via gossip to other nodes. If another node tries to
invoke the RPC prior to propagation it fails. To alleviate this timing
issue, I added wait/retries via a timer in the RpcRegistry for the
FindRouters message. As routes are updated via gossip, it retries the
FindRouters request. If the timer triggers, it sends back an empty list.
The timer period is 10 times the gossip tick interval (500ms * 10 = 5s).

Change-Id: Iaafcfb4c93cde44f62f6645c8b8684102ac0d0db
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 92ce52ab3df561a2a07bf56c7115123b0825449e)

BUG-2912: Better document DataChangeScope.ONE

Provides information about the way this scope interacts with lists in and the
binding independent data tree that might be counterintuitive when compared with
a binding aware view of the same data.

Change-Id: If966331b4daa5a88be61fb2efea65a4b69495b0b
Signed-off-by: Colin Dixon <colin@colindixon.com>

Fix sporadic ShardManagerTest failures

Some of the tests fail sporadically. Most were alleviated by:

  - using tell on an actor rather than calling receiveCommand directly
  - using the normal fork/join dispatcher for creating TestActors instead
    of the default CallingThread dispatcher.

After the changes the tests ran over 200 times successfully.

Change-Id: Ib2c7c3b6dace9e89dff54eccc58a2b8aabad75de
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 2f6d96da89035d5aec78c90bce5065d2f202a515)

Bug 4627: Fix premature RO tx cleanup

For the RO tx PhantomReference cleanup mechanism, modified
RemoteTransactionContextSupport to pass the front-end client
TransactionProxy instance as the referent to the
FinalizablePhantomReference. Previously we were passing the
RemoteTransactionContextSupport instance which is only reachable via
a hard reference until the primary shard actor is obtained and thus may
be eligible for GC while the TransactionProxy is still in use.

Change-Id: Ib2808b4ba8113a5722f9ee422434a89adaf775fe
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 89274c9c31212a1d0f13aeb90384442e72221029)

Reduce output from DeadlockMonitor

If a module doesn't finish after 5 sec, the DeadlockMonitor starts
logging warning messages. However it does this every second. CDS will
wait up to 90 sec for all shards to elect a leader so the
DeadlockMonitor produces a lot of output during this period. To reduce
the noise I changed the sleep to use WARN_AFTER_MILLIS so the message is
logged every 5 sec.

Change-Id: I63842075dee1fc6a4fc4e4200cc089e33a110e78
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 384479f0181763f3202f2f7ad681e90182bcc820)

Binding Codecs support of APPEARED,DISAPPEARED.

In BE two new modification types were introduced for
structural containers, but binding codecs were not
updated accordingly.

Frontend mapping is simple:
APPEARED -> SUBTREE_MODIFIED
DISAPPEARED -> DELETE

Change-Id: I62810c501234a62343150c328c6f2802402669c5
Signed-off-by: Tony Tkacik <ttkacik@cisco.com>

BUG-5247: notify listeners for entities which are not owned

Rather than broadcasting just the 'up' state, notify listeners about all
state we know of.

Change-Id: Iaae6db925a321aad420fa0ee8bdf8b56b5d2a29e
Signed-off-by: Robert Varga <rovarga@cisco.com>
(cherry picked from commit e86a9107fc3ae4451b5a7eb54a03f9ad6776fe72)

Fix intermittent ShardTest failures

Some tests fail intermittenly due to modifying Shard state directly
instead of thru messages.

Change-Id: I704d6d23c1b2a47e78b3d8823a3136e921e9113b
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 892a6ca966046fd790bdf8a64dccb456a3ece8b4)

Bumping versions by 0.0.1 for next dev cycle

Change-Id: Ib8013410eca860b8cbd3cdd246c4506610b53a6b
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

Release Beryllium

Change-Id: I676190af22ffe729663af4023b95548b9fd1feac
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

Fix intermittent LeaderTest/CandidateTest failures

The test cases in LeaderTest and CandidateTest have been failing
intermittently. A particular test in CandidateTest has recently started
failing fairly regularly on jenkins for some reason.

The common denominator is that an initial message to an actor isn't
received and goes to dead letters instead, even though the actor was
just created. This seems related to the use of ActorSelection in the raft
behavior classes, I suspect a timing issue where the underlying actor
isn't actually created/available yet via actorSelection. I had seen this
in the past and attempted to alleviate it by adding a verifyActorReady to
TestActorFactory to verify with retries that the actor can be obtained via
actorSelection.resolveOne. However it doesn't appear resolveOne works as
advertised or maybe a successful call doesn't mean a message will
succeed.

I changed verifyActorReady to send an Identify message to the
actorSelection and verify successful response. On my system LeaderTest
would usually fail within 30 test runs. After the change it ran
successfully 400 times.

Change-Id: I2da7d4a4d14c68810e87fc64b711b5c80608f5d7
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 5e8721fd675825ec5c9f826aed61c97e22188960)

Bug 5153: Add timestamp to TransactionIdentifier

TransactionIdentifiers are created locally but sent to the remote leader
so it's possible, after a restart, for the remote leader to see the same id
for 2 different txns since the local counter starts at 1. To alleviate
this I added a timestamp to TransactionIdentifier. I could've just used
a UUID but the counter is useful for debugging and a full UUID would
make the string version pretty long for logging. I think an additional
millisec timestamp is sufficient.

Change-Id: Iaabd3d25eb64dd14053f96336c48de90d4364678
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 74fc38503a3565bed6218f65ab8f4425c61460a3)

Public constants need to be final

Make constants really constant.

Change-Id: Iacca77288d0b53f578da367fd490ff69b4484a2a
Signed-off-by: Robert Varga <rovarga@cisco.com>

BUG 5115: Fix missing artifact exception in log

Add new dependency "org.apache.karaf.region.persist"
in karaf-parent pom to fix missing artifact exception
for region feature.

Change-Id: I1d08b69e4afee4e4911d9fc5be9cfd5250868b3f
Signed-off-by: oshvartz <oshvartz@redhat.com>
(cherry picked from commit 2d0262de6e6371cd2d4875c598cd30fe891a76dc)

BUG-4869: use odl-lmax feature

Removes direct declaration of lmax version, pulling in odl-feature from
odlparent instead.

Change-Id: I52ca9433e25efc42159ee8929837f1b0d6f7292b
Signed-off-by: Robert Varga <rovarga@cisco.com>

Bug 5109: Handle stand alone leaf nodes in CDS streaming

Modified AbstractNormalizedNodeDataOutput to output the leaf set QName
that is now passed to leafSetEntryNode if no parent LeafSetNode QName is
present. Modified NormalizedNodeInputStreamReader accordingly.

I also found that OrderedLeafSetNode was not handled correctly.
AbstractNormalizedNodeDataOutput#startOrderedLeafSet needs to set
lastLeafSetQName.

The NormalizedNodePruner assumed a leaf set entry node must have a
parent and threw an exception if not, similarly with leaf node and anyXML
node. But all 3 can be standalone so I modified NormalizedNodePruner to
handle it.

Change-Id: I02a71d9280dac0eb466ff401699a40d3d8826220
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 7a0fb19fe86fbf7c7bd78f7e522884b6e477b067)

Bug 4992: Removed old leader's candidates on leader change

Modified onLeaderChanged to call removeCandidateFromEntities same as
onPeerDown.

Change-Id: I9b56e64254485fa0de4fdc1b7f4f6ddf100338af
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 207129172cb981630f955170cb67efceba02df85)

Reduce logging in QuarantinedMonitorActor

The QuarantinedMonitorActor logs every AssociationErrorEvent as warn
which causes a lot of output when a peer node is down as akka raises a
conneciton-refused event every 5 sec until it re-connects. Since we're
only interested in the specific quarantined event, which is logged at
warn, other events should log to debug to avoid the noise.

Change-Id: I26ab7db9a71d137ae3227409d6dcbf39675c6ec9
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Clear out router event on completion

Rather than keeping references on heap, clear references once the future
has been notified. Also some logging to enable debugging.

Change-Id: I2ab352db51134b30fb352a4adabc07eda0945841
Signed-off-by: Robert Varga <robert.varga@pantheon.sk>

Remove the leader's FollowerLogInformation on RemoveServer

On RemoveServer, if removing follower, we need to also remove the
FollowerLogInformation entry from the followerToLog map in
AbstractLeader. Also, if a snapshot was being installed, we should
cleanup the mapFollowerToSnapshot.

Change-Id: I37df57a82a1c79ce375e48127bafd661a2dfe2c6
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Bug in AbstractLeader replication consensus

In determining whether to advance the commit index, only the voting
members should be counted in the replicatedCount. There was a logic
error that instead caused it to be incorrectly based on whether the
AppendEntriesReply message as sent by a voting member.

This patch fixes the issue.

Change-Id: I6efb9574c39db608351297fc2552689d1ff77979
Signed-off-by: Gary Wu <gary.wu1@huawei.com>

Bug in AbstractLeader replication consensus

I ran into an issue where the leader's commit index wasn't advancing
for new log entries even though consensus was reached. This scenario can
occur if the leader previously didn't get consensus and thus didn't commit
and apply a log entry and later regains leadership with a higher term.

The code in handleAppendEntriesReply doesn't update the commit index
if an entry's term doesn't match the current term. This behavior is correct
as per the raft paper - §5.4.1: "Raft never commits log entries from
previous terms by counting replicas". However the code also breaks out
of the loop and thus can never make progress on new entries in the current
term that reach consensus. This part is incorrect - as per raft "once an
entry from the current term is committed by counting replicas, then all
prior entries are committed indirectly". Therefore we need to continue
processing subsequent log entries in order to eventually make progress.

Change-Id: I2d093848c3a846e1f6420ac695b4ff652a65bf6b
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

BUG-4963: Bump scala to 2.11.7

With 2.10 I was experiencing random freezes when installing features
that used it in karaf.

Bumping to 2.11 doesn't seem to break anything nor has any downsides.

We have scala.micro.version defined so use it instead of the
hardcoded micro version

Change-Id: I2a445790980d0da3152db3664294fd789f8272c7
Signed-off-by: Tomas Cere <tcere@cisco.com>
Signed-off-by: Robert Varga <rovarga@cisco.com>

Bug 4455 - Inconsistent COMMIT operation handling when no transactions are present

Return positive response for commit operation in config subsystem
netconf northbound even if no candidate transaction is open for session.

Need to be merged after https://git.opendaylight.org/gerrit/#/c/32598/

Change-Id: Ia6ce2aa6ffdfafc47f69ae7315669f64b653c514
Signed-off-by: Jakub Morvay <jmorvay@cisco.com>

BUG 4017: Notification publish service is not available from provider context

Change-Id: I2cb2dd4e6e3c22b8db1d368bde2c914d53100661
Signed-off-by: Tomas Cere <tcere@cisco.com>

Disallow remove leader in single node

We don't want to allow removal of the leader in a single node cluster,
ie when there's no followers.

Change-Id: I3bedd1727736c7dfec55ba696f5ef1197a68c89d
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 421b5a27bd36cdaa04159d5f7ceb9f8e3affb2fa)

Set to non-voting if not in server confguration

On recovery, if a RaftActor is not in its own recovered
ServerConfigurationPayload list, then set itself to a non-voting member
so it stays at Follower and doesn't try to start an election.

This scenario is an edge case for Shards as, normally, when a server is
removed, it self-destructs and is removed from the ShardManager. However
there is a small window where disconnect or shutdown could prevent
ShardManager removal from occurring. This patch protects against a server
restart causing disruption after removal.

Change-Id: I64ecd89cddec7a4e1711e0d8d17c7ea6b36e29a0
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
(cherry picked from commit 8dabbaa07e7034a2f385f9b553eaf2dbde91525b)

Revert "Add mockito-configuration to tests"

This reverts commit dcc92fc8fdf056d5ada94931f2d24523070fd9a7.

Change-Id: Ia89b88f9b933d31d369e5ad75ebf8c762c9dfde0
Signed-off-by: Robert Varga <robert.varga@pantheon.sk>

Update .gitreview for stable/beryllium

Change-Id: Ie99a1d430deaba902a182cb986d721ee5ec0e557
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

Remove ModificationPayload class

The ModificationPayload class was introduced early in Lithium but was
replaced later in Lithium by DataTreeCandidatePayload. Since ModificationPayload
was never contained in a release it can be removed.

Change-Id: Ia4da96695fb9c0356d16f048451b4dab7e0bcf70
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Removed unused actorPath from ShardManager.

Change-Id: I31f52e59ff59d5acc86feed32118a392cf1132bc
Signed-off-by: Tony Tkacik <ttkacik@cisco.com>

BUG 4930 & BUG 4017: Allow multiple refine statements in MXBean generation

Stops enforcing a single refine statement when generating MXBean's.

Change-Id: I2f07fc23b355b1871170a00baf52db34f5e6eb66
Signed-off-by: Tomas Cere <tcere@cisco.com>

Remove deprecated getDataStoreType methods

getDataStoreName methods were recently added to DatastoreContext and ActorContext
to replace the getDataStoreType methods. The latter were marked as
deprecated but we can remove them since they aren't public APIs outside
of the context of sal-distributed-datastore. The remaining callers were
migrated to the getDataStoreName methods.

Change-Id: I7dab731d96b3b8c249a59824de4d78ea72500e05
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Bug 3871: Deprecate opendaylight-inventory model.

Change-Id: I526496120b79158df41aec315d67303e4604074b
Signed-off-by: Tony Tkacik <ttkacik@cisco.com>
Signed-off-by: Robert Varga <rovarga@cisco.com>

Update features archetypes

* Use {{VERSION}} in the documentation comments.
* Use {{VERSION}} in the generated features.xml.
* Remove an invalid space in the schema locations.
* Hook up opendaylight-karaf-features to features-parent.
* Import yangtools-artifacts in the generated features pom.xml.
* Remove redundant versions in the generated features pom.xml.

Change-Id: I60e7d49d0d29a1d9040501e7a8fa0a61ef6fc1bc
Signed-off-by: Stephen Kitt <skitt@redhat.com>

Add mockito-configuration to tests

Ynagtools' mockito-configuration ensures that all methods touched in
mocked objects have to be mocked, preventing failures which are hard to
track down.

The reason for this is that by default unmocked methods do nothong and
return null -- injecting nulls into context which do not expect them.

Change-Id: If7b9afac01128be6f1b2a90b1e8c068cb4a39b65
Signed-off-by: Robert Varga <robert.varga@pantheon.sk>

InternalJMXRegistration should be an ObjectRegistration

This way it follows AutoCloseable#close() contract, e.g. allows multiple
invocations.

Change-Id: Ied93bbdd388189a928cf06cbbc913fe124a284dd
Signed-off-by: Robert Varga <rovarga@cisco.com>

Introduce lifecycle to runtime beans registrator in config-manager

Root runtime bean registrator was not properly closed or reused
during reconfiguration.

Change-Id: I537f7af5957496001f51663ded206a4ab04e5401
Signed-off-by: Maros Marsalek <mmarsale@cisco.com>

Use local param in persist callback

The message being persisted is echoed as the parameter, no need to
reference it via a global argument. Also add preconditions to guard
against null context/behavior.

Change-Id: Ia93fcf6d331492081a1a3c69899c86e1b55d1e71
Signed-off-by: Robert Varga <robert.varga@pantheon.sk>

Remove deprecated DataExistsReply constructor

The DataExistsReply was previously deprecated and is not in use so
remove it.

Change-Id: Ib1c184901be8070f70c14f4125fdfbefc59b541d
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Bug 2062 - StreamWriter APIs loses information about leaf-set ordering

* modified clustering NormalizedNodeStreamWriter implementation to use
OrderedLeafSet

Change-Id: I663f6b6d894b8366b7a54a3c56be05f20fef43c2
Signed-off-by: Jan Hajnar <jhajnar@cisco.com>
Signed-off-by: Robert Varga <rovarga@cisco.com>

Implement remove-all-shard-replicas RPC

Change-Id: Idc1481c0f6903554fd6659c32c9639af5aa47e92
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Send Shutdown message to Shard on ServerRemoved

Modified the ShardManager to send the Shutdown message to the Shard on
ServerRemoved. If the shard was the leader, this will trigger leadership
transfer.

I also made changes to propagate the appropriate error to the caller on
RemoveServerReply instead of always replying with success.

Added a test case in ClusterAdminRpcServiceTest for removing the leader.

Change-Id: I30d2a22f07c1003fad2aba68e4f2d1d2c9fe7eb3
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Provide new runtime bean registrator for recreated module

After recent changes in config-manager, runtime bean registrations are not
preserved after reconfiguration of a module. This is caused because internal
jmx registrators are now properly closed. Runtime bean registrations relied
on previous, incorrect behavior.

This commit fixes the issue by recreating runtime bean registrator for each module.
However there's a possible leak if clients do not close runtime bean registrations
in their createInstance methods.

Change-Id: I37f0956effdd2a183390252615c23a90e23ebe8e
Signed-off-by: Maros Marsalek <mmarsale@cisco.com>

Implement RemoveServer for leader

Implemented RemoveServer for leader which previously was coded to fail
with the NOT_SUPPORTED error until leadership transfer was implemented.
Leadership transfer will be triggered via the Shutdown message in the
ShardManager via ServerRemoved message. This wil be done in a subsequent
patch.

Change-Id: Iae7895a3801986e482073ccf8ea24e5b720b7618
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Drop karaf-maven-plugin from it-base/parent

This is no longer used for building; removing it greatly simplifies
the dependency tree of projects using the it-base stuff. In
particular, it drops commons-collections 3.2.1 which otherwise gets
flagged by various security tools (because of CVE-2015-7501).

Change-Id: I5960e6436833ecf623228685f254a0a55ebf292c
Signed-off-by: Stephen Kitt <skitt@redhat.com>

Expose yang source provider into config subsystem

Change-Id: Ie5ae543e2be0f1ef129c97cd3e5e945ba6467d89
Signed-off-by: Maros Marsalek <mmarsale@cisco.com>

BUG-4514: use SimpleImmutableEntry

Instead of brewing our own class, reuse an Entry implementation from
JRE.

Change-Id: I94972985050921838f0b217a0957a413d7971427
Signed-off-by: Robert Varga <rovarga@cisco.com>

BUG-4514: clean children in InternalJMXRegistrator

Retaining children once they have been closed can lead to a leak, make
sure we call back to parent to remove ourselves from its list.

This includes a refactor to hide InternalJMXRegistrator, which becomes
abstract and has two subclasses: Root and Nested. This allows us to use
proper synchronization between closing of the child and parent
registrators. Also saves a bit of memory.

Also clean up {Module,Transaction}JMXRegistrator constructors to not do
a createChild(), as that fails to cleanup immediate children of the
Root, leading to empty InternalJMXRegistrators being collected in Root's
child list.

Change-Id: I9a4708b67777ca6033e5a83c586b3f78692dff2a
Signed-off-by: Robert Varga <rovarga@cisco.com>

Don't transfer leadership to a non-voting follower

Change-Id: I5ee97f2cef50b100f21627f26ba6c339972cd677
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>

Abort pending txn's in Shard on leader transition and shutdown

Added code in the ShardCommitCoordinator to abort pending txn's with an
appropriate failure message when leadership is lost and on shutdown.

Also moved the handleTransactionCommitTimeoutCheck logic from the Shard
to the ShardCommitCoordinator for consistency.

Change-Id: I4af1262aba76909536348a07a368f1559714f90d
Signed-off-by: Tom Pantelis <tpanteli@brocade.com>