This is the second in the series of assignments in which you'll build a fault-tolerant key/value storage system. You've started off in Assignment 3 assignment by implementing the leader election features of Raft. In this assignment, you will implement Raft's core features: log consensus agreement. From here, Assignment 4 will be a key/value service that uses this Raft implementation as a foundational module.
While being able to elect a leader is useful, we want to use Raft to keep a consistent, replicated log of operations. To do so, we need to have the servers accept client operations through Start(), and insert them into the log. In Raft, only the leader is allowed to append to the log, and should disseminate new entries to other servers by including them in its outgoing AppendEntries RPCs.
If this sounds only vaguely familiar (or even if it's crystal clear), you are highly encouraged to go back to reread the extended Raft paper, the Raft lecture notes, and the illustrated Raft guide. You should, of course, also review your work from Assignment 3, as this assignment directly builds off that.
You will continue to use the same cos418 code bundle from the previous assignments. For this assignment, we will focus primarily on the code and tests for the Raft implementation in src/raft and the simple RPC-like system in src/labrpc. It is worth your while to read and digest the code in these packages again, including your implementation from Assignment 3.
In this lab you'll implement most of the Raft design described in the extended paper, including saving persistent state and reading it after a node fails and then restarts. You will not implement cluster membership changes (Section 6) or log compaction / snapshotting (Section 7).
A set of Raft instances talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called log entries. The entries are numbered with index numbers. The log entry with a given index will eventually be committed. At that point, your Raft should send the log entry to the larger service for it to execute.
Your first major task is to implement the leader and follower code to append new log entries. This will involve implementing Start(), completing the AppendEntries RPC structs, sending them, and fleshing out the AppendEntry RPC handler. Your goal should first be to pass the TestBasicAgree() test (in test_test.go). Once you have that working, you should try to get all the tests before the "basic persistence" test to pass before moving on.
Only RPC may be used for interaction between different Raft instances. For example, different instances of your Raft implementation are not allowed to share Go variables. Your implementation should not use files at all.
The next major task is to handle the fault tolerant aspects of the Raft protocol, making your implementation robust against various kinds of failures. These failures could include servers not receiving some RPCs and servers that crash and restart.
A Raft-based server must be able to pick up where it left off, and continue if the computer it is running on reboots. This requires that Raft keep persistent state that survives a reboot (the paper's Figure 2 mentions which state should be persistent).
A “real” implementation would do this by writing Raft's persistent state to disk each time it changes, and reading the latest saved state from disk when restarting after a reboot. Your implementation won't use the disk; instead, it will save and restore persistent state from a Persister object (see persister.go). Whoever calls Make() supplies a Persister that initially holds Raft's most recently persisted state (if any). Raft should initialize its state from that Persister, and should use it to save its persistent state each time the state changes. You can use the ReadRaftState() and SaveRaftState() methods for this respectively.
Implement persistence by first adding code to serialize any state that needs persisting in persist(), and to unserialize that same state in readPersist(). You now need to determine at what points in the Raft protocol your servers are required to persist their state, and insert calls to persist() in those places. Once this code is complete, you should pass the remaining tests. You may want to first try and pass the "basic persistence" test (go test -run 'TestPersist1$'), and then tackle the remaining ones.
You will need to encode the state as an array of bytes in order to pass it to the Persister; raft.go contains some example code for this in persist() and readPersist().
In order to pass some of the challenging tests towards the end, such as those marked "unreliable", you will need to implement the optimization to allow a follower to back up the leader's nextIndex by more than one entry at a time. See the description in the extended Raft paper starting at the bottom of page 7 and top of page 8 (marked by a gray line).
Test | Points |
---|---|
BasicAgree | 6 |
FailAgree | 6 |
FailNoAgree | 6 |
ConcurrentStarts | 6 |
Rejoin | 6 |
Backup | 7 |
Count | 7 |
Persist1 | 7 |
Persist2 | 7 |
Persist3 | 7 |
Figure8$ | 7 |
UnreliableAgree | 7 |
Figure8Unreliable | 7 |
ReliableChurn | 7 |
UnreliableChurn | 7 |
You hand in your assignment as before.
$ git commit -am "[you fill me in]"
$ git tag -a -m "i finished assignment 4" a4-handin
$ git push origin master a4-handin
Recall, in order to overwrite a tag use the force flag as follows.
$ git tag -fam "i finished assignment 4" a4-handin
$ git push -f --tags
You should verify that you are able to see your final commit and tags on the Github page of your repository for this assignment.
You will receive full credit for Part I if your software passes the tests mentioned for that section on the CS servers. You will receive full credit for Part II if your software passes the tests mentioned for that section on the CS servers.
This assignment is adapted from MIT's 6.824 course. Thanks to Frans Kaashoek, Robert Morris, and Nickolai Zeldovich for their support.