Add Payload.serializedSize() There is a rather big difference in payload sizing when we consider how big an entry is: in-memory size can easily be "zero", which can translate to a serialized size of hundreds of bytes. This difference is problematic, as we use the former to estimate how many payloads we can squeeze in AppendEntries and we compare that to the configured payload limit. Even when there is some (32KiB by default) cushion, we can end up blowing past the frame size. Add Payload.serializedSize(), which should provide a semi-conservative estimate of serialized size and use that to select the cut-off. Also improve SimpleReplicatedLogEntry's estimates by performing a a quick serialization operation -- which reduces potential waste for each entry by 294 bytes, as our hard-coded estimate of 400 bytes was way too conservative. JIRA: CONTROLLER-2037 Change-Id: I5abe7d00db9e10f1c66e6db0f7c82854f9aa352d Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
Clean up AbstractReplicatedLogImplTest Use static imports for Assert methods, add final qualifiers. Change-Id: I35f589b14e1188fa1fb94524065b562023687f27 Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
Do not allow persistence callbacks to throw Exception Since we are indirecting through an executor, we are forced to wrap any exception -- just do not bother, as the callbacks are executed in the context of an actor anyway (dealing with RuntimeExceptions) and users are not throwing checked exceptions. Change-Id: I6cea19ab7192fa42ad3c346d554411cd0d558a64 Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
Bug 7362: Notify applyState synchronously As per comments in bug 7362, it's problematic for snapshots that we set lastApplied index but queue the ApplyState message. This can leave the derived actor's captured state out-of-sync with the lastApplied index if a snapshot occurs in between. We need to set lastApplied atomically with apply to state so I added a callback that RaftActor sets in the RaftActorContext that is called by the behavior's applyLogToStateMachine method. The callback immediately calls applyState. The ApplyState message is still queued for other compenents that trigger on it (and also unit tests). I considered setting lastApplied when the ApplyState message is processed, and not in applyLogToStateMachine, but this would be problematic if another AppendEntriesReply was queued before the ApplyState message as it would try to apply it again. Change-Id: Ie062af8440bc251eec9c9ef58f450dee23abd04c Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Remove MockReplicatedLogEntry This class now just inherits from SimpleReplicatedLogEntry so in no longer needed. Changed all users to SimpleReplicatedLogEntry. The only difference is that the term and index params are flipped in the ctors. Change-Id: I0b32078eeb9ea45001dcd8e8aa30bfe548256e96 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Bug 5419: Persist log entries asycnhronously Modified RaftActor#persistData to persist the ReplicatedLogEntry using akka's persistAsync method. This is similar to the persist method except subsequent messages are delivered prior to persistence completion. This avoids blocking the RaftActor so it can process AppendEntriesReply and other messages while persistence is in progress. In addition, AbstractLeader was modified to only count itself for consensus when persistence is complete. This required communicating the persistence complete state to the AbstractLeader. A transient persistencePending flag was added to the ReplicatedLogImplEntry that is set by RaftActor#persistData prior to the persist call and is cleared when the persist callback executes. AbstractLeader checks the flag when counting consensus. It's possible that the persistence complete event arrives after AppendEntriesReply messages from replicated followers so a new message, CheckConsensusReached, is sent by the RaftActor on persistence complete to check if consensus is reached. Change-Id: If34a5f395d52e17b2737464a2e2403f56a520c43 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Fix warnings in sal-akka-raft test classes Fixed checkstyle warnings in the test classes. Most of the warnings/changes were for: - white space before if/for/while/catch - white space before beginning brace - line too long - illegal catching of Exception (suppressed) - variable name too short - indentation - removed use of JavaTestKit with embedded code (avoids having to indent code another 4 spaces). In most cases, JavaTestKit wasn't even used. - local vars/params hiding a field - putting overloaded methods close to one another - remove unused vars - convert functional interfaces to lambdas (eclipse save action) - empty catch block - added comment or Throwables.propagate as appropriate - missing period after first sentence in javadoc - missing first sentence in javadoc - adding final for locals declared too far from first usage Change-Id: I5c522e5b0383b2c5e9b0b036dc444c51f788b650 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Force install snapshot when follower log is ahead It's possible for a follower's log to actually be ahead of the leader's log. Normally this doesn't happen in raft as a node cannot become leader if its log is behind another's. However, the non-voting semantics deviate a bit from raft. Only voting members participate in elections and can become leader so it's possible for a non-voting follower to be ahead of the leader. This can happen if persistence is disabled and all voting members are restarted. In this case, the voting leader will start out with an empty log however the non-voting followers still retain the previous data in memory. On the first AppendEntries, the non-voting follower returns a successful reply b/c the prevLogIndex sent by the leader is -1 and thus the integrity checks pass. However the follower's returned lastLogIndex may be higher in which case we want to reset the follower by installing a snapshot. Therefore I added a check in AbstractLeader#handeAppendEntriesReply if the reply lastLogIndex > leader's last index. Since the initial AppendEntries is sent immediately by the leader, normally the follower will reply and this change works. However if a follower happens to be disconnected and doesn't reply for some time, the leader can still progress with new commits. If the leader has enough commits such that its lastIndex matches or exceeds the lagging non-voting follower, this check doesn't work. In this case, the follower's integrity checks will fail since the leader's prevLogTerm will differ. On reply the leader will start decrementing the follower's nextIndex in an attempt to find where the logs match. During this process the leader may trim its log via replicatedToAllIndex in which case the follower's nextIndex may no longer be in the leader's log and the leader will install a snapshot. However if other nodes are down and prevent the log trimming then the follower's nextIndex may be in the log until it eventually decrements to 0. The follower's integrity checks will pass in this case since the leader's prevLogIndex will be -1. The follower will then attempt to add the leader's log entries to its log. It first loops the log entries in the AppendEntries with the intent of skipping matching entries in its log (ie index and term the same) and stopping when it finds an entry that doesn;t exist or finds one whose term doesn't match, in which case it removes the entries beginning at this index. However I found some issue in this code. First it was calling get on the getReplicatedLog which doesn't take into account that the index may be part of the prior snaphot and not actually in the log. I changed this check to isLogEntryPresent which takes into account the snapshot. Second, if it hits a conflicting entry it tries to remove it from the log. However, as before, it may be in the snapshot and not in the log in which case nothing gets removed. To alleviate this, I modified removeFromAndPersist to return a boolean - false meaning it didn't find the index. In this case I changed it to send back a reply to force a snapshot. I added several tests in a new class NonVotingFollowerIntegrationTest that runs thru various scenarios to cover the cases described above. While testing I ran into some orthoganl issues that I also fixed. - if a leader has only non-voting followers, on replicate, it should immediately commit and apply to state as it does when there's no followers since it doesn't need consensus from non-voting followers. So I added a method anyVotingPeers to RaftActorContext to handle this case. - When calculating the prevLogIndex and prevLogTerm for the AppendEntries message, it calls get on the getReplicatedLog which doesn't take into account that the index may be the snaphot index/term. Follower does this check prevLogIndex/prevLogTerm so the leader should as well. Change-Id: I3f92fc0b92ddc6d02dc6cb0e56b444a7c61035d7 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Guard against duplicate log indexes We saw an issue where a duplicate log index was added to the journal. The duplicates were contiguous. It is unclear at this point how it happened but we should guard against it so I added a check to ensure the new index > the last index. Change-Id: I5acc0fa5b4fe7f4352fc7935e7262834894878f3 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Bug 2787: Batch AppendEntries to speed up follower sync AbstractLeader#sendUpdatesToFollower now attempts to send all entries to the follower to catch it up with 1 AppendEntries message. However, we don't want to send too large a message and risk exceeding akka's message size limit. So I added another param, maxDataSize, to ReplicatedLog#getFrom. It will attempt to add all entries up to maxEntries but stops if the accumulated data size exceeds maxDataSize. For maxDataSize, I reused the existing snapshotChunkSize (default 2M) defined in ConfigParams. This currently is hard-coded - we may want to make this configurable. Change-Id: Ib2c1165c4140a36f4eada8f81b4261260615dd18 Signed-off-by: Tom Pantelis <tpanteli@brocade.com> (cherry picked from commit 6065ba82c90e366919a1b78105507b935b91af8e)
Bug:3260-Recovery misses flows installed on single node There are 2 bugs which were enountered 1. When akka replays the journal, it replays it from the peristent journal's sequence number present at the time of snapshot success 2. The log entry on which a snapshot capture is triggered does not make it to that snapshot and gets removed from persistent journal as well. So on recovery, that log entry/data is missing To fix the first, the snapshotSeqNr() method of UnTypedPersistenActor is overridden, so that akka uses the cached last-sequence-number rather than using its own To fix the second issue, the capture of snapshot for single node is done after applyState. This ensures that the persistent journal and snapshot are in sync Also the in-memory journal was replaying all its messages and not honoring the fromSequenceNr, like akka does. So fixed it. The tests needed to be fixed primarily due to the in-memory journal change. A new test is added to test out the recovery of single node. Change-Id: I779d1d6ce9880b19322d831ef5c8696b4c751e3d Signed-off-by: Kamal Rameshan <kramesha@cisco.com> (cherry picked from commit 1c4abd5ad29d3200929275cf0030c0e4f35c3886)
Add unit tests for ReplicatedLogImpl Some test cases in RaftActorTest directly test the ReplicatedLogImpl so they were moved to a new specific test class ReplicatedLogImplTest as they donlt need the overhead of creating a RaftActor. Also added more tests and increased coverage in AbstractReplicatedLogImplTest. Change-Id: I14e65620d1c8bc89bab055488aed89837ff3f4e7 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Calculate replicated log data size on recovery We maintain the replicated log data size at runtme but we should also calculate the data size for recovered entries on startup. I changed the AbstractReplicatedLogImpl#append method to also add to the dataSize. Previously dataSize was adjusted in ReplicatedLogImpl#appendAndPersist after it was persisted. I'm not sure why it was done this way but if persistence failed then the entry would've been added to the in-memory log without increasing the dataSize. This seems inconsistent - if we add to the log we should always increase the dataSize. The same with removing entries from the log - ReplicatedLogImpl re-calculated the dataSize after it was persisted. So I changed removeFrom to adjust dataSize and changed ReplicatedLogImpl#removeFromAndPersist to call AbstractReplicatedLogImpl#removeFrom (code was duplicated). To avoid out-of-band changes to dataSize I made it private. Same with journal. I think this is safer - these should be owned by AbstractReplicatedLogImpl and derived classes shouldn't modify these directly. Change-Id: I114cbac1d6a450bc0a1c8c6ee60042ad28a89bf4 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Refactor ReplicatedLogImpl to separate class To reduce the size of the RaftActor class for improved readability. Change-Id: I845cfeb4bc48f0c5eb96ba2cc0a925d9ce5fa416 Signed-off-by: Tom Pantelis <tpanteli@brocade.com>
Bug-2692:Real snapshots should use the replicatedToAllIndex for clearing in-mem log Similar to fake snapshots, real snapshots should also be using the replicatedToAllIndex. replicatedToAllIndex is part of the RaftActorBehavior. A performSnapshotWithoutCapture helper method is part of the AbstractRaftActorBehavior. On CaptureSnapshotReply, we clear the log based on replicatedToAllIndex and the corresponding term. Many of the existing tests seem to test out this changes and some were added and enhanced. Bug-2715: Fix in-mem log cleanup for an Inactive follower In case if one of the follower is down, rely on the memory usage to clear the log. This is done in CaptureSnapshotReply Change-Id: Ifb3ca19691f20735e3f84b968a7adf01398c20e0 Signed-off-by: Kamal Rameshan <kramesha@cisco.com>
Bug-2590: Clustering : Minimize usage of in-memory journal In order to minimize the memory usage of the in-memory journal, we can remove the entries from the Leader's journal once it has been successfully replicated to ALL its followers. This does not intefere with snapshots, as we capture snapshots on demand. The followers follow the leader in cleaning the in-memory journal, there by ensuring that all the journals have more or less same entries. This is done by the leader passing its replicatedToAllIndex as part of the AppendEntries. Change-Id: I579a1f90d3c4e5d6be4ce699072688788b07bd48 Signed-off-by: Kamal Rameshan <kramesha@cisco.com>
Hide AbstractReplicatedLogImpl index fields The fields have proper getter/setter methods, there is no need to expose them anywhere. Change-Id: I7971753310dfe8c8654caeef04cfc38dfddeef86 Signed-off-by: Robert Varga <rovarga@cisco.com>
BUG 2437 - Enable snapshotting based on size of data in the in-memory journal - Changed RaftActor to snapshot based on the transaction count or the size of the data in the in-memory journal whichever comes earlier - The size of data that is used is the rough (not-accurate) size of the Payload in the Replicated log entry - In ShardStats exposed another property which is the data size of the in-memory journal - The snapshot data threshold percentage is configurable using the config sub-system and is set to a default of 12%. The reason for setting it at 12% by default is because we have a total of 8 default shards out of the box. I could have set this to 16% as toaster is not a "real" data shard. - The snapshot data threshold is calculated as a percentage of the Runtime.totalMemory() which is the total memory the jvm considers available for object allocation. From testing it appears that the total memory is what would appear in jconsole as the committed memory. I have not added any unit testing for this - but tested this using the scenario described in bug 2437 and it seems to work pretty well. The deployment used only 2G of memory and worked fine for a 7 switch topology and I observed that it had not run out of memory after more than 2 hours. Change-Id: I09ec0827c0411c42a9224bb6d159d5590c22e20b Signed-off-by: Moiz Raja <moraja@cisco.com>
Fix warnings in sal-akka-raft Make sure to add generic arguments to instantiation and do not use deprecated junit classes. Change-Id: I0a84c3c20125a00bb262bf4dfce30c77fcaa765d Signed-off-by: Robert Varga <rovarga@cisco.com>
Tune replication and stabilize tests Made following changes for replication - Increased Heartbeat timeout to 500 milliseconds. - Send only one entry from the replicated log to the follower in append entries Both of these tweaks have been made to prevent election timeouts and frequent switching of leaders Changes to tests - Added a duration when constructing an ExpectMsg. This prevents ExpectMsg from waiting forever when and expected event does not occur - Removed all Thread.sleep from the tests and replace them with waiting for a specific LogEvent this is a more deterministic. Change-Id: Ie9ce0c9c73bf1b170a78879b1e2dab76f1de64df Signed-off-by: Moiz Raja <moraja@cisco.com>