-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add replication-id to implement master/slave Replication #1951
fix: add replication-id to implement master/slave Replication #1951
Conversation
src/pika_repl_client_conn.cc
Outdated
if (meta_sync.run_id() == "" || g_pika_server->master_run_id() != meta_sync.run_id()) { | ||
LOG(INFO) << "Run id is not equal, need to do full sync, remote master run id: " << meta_sync.run_id() | ||
<< ", local run id: " << g_pika_server->master_run_id(); | ||
if (meta_sync.run_id() == "" || g_pika_conf->replication_id() != meta_sync.replication_id()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
conf/pika.conf
Outdated
max-rsync-parallel-num : 4 | ||
|
||
# replicaiton-id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里加个英文注释,解释一下replicationID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
include/pika_admin.h
Outdated
@@ -462,6 +463,18 @@ class DiskRecoveryCmd : public Cmd { | |||
std::map<std::string, uint64_t> background_errors_; | |||
}; | |||
|
|||
class ClearReplicateIDCmd : public Cmd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClearReplicateIDCmd ==> ClearReplicationIDCmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
include/pika_admin.h
Outdated
@@ -288,6 +288,7 @@ class ConfigCmd : public Cmd { | |||
void ConfigSet(std::string& ret); | |||
void ConfigRewrite(std::string& ret); | |||
void ConfigResetstat(std::string& ret); | |||
void ConfigRewriteReplicateID(std::string& ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigRewriteReplicateID => ConfigRewriteReplicationID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
include/pika_admin.h
Outdated
@@ -462,6 +463,18 @@ class DiskRecoveryCmd : public Cmd { | |||
std::map<std::string, uint64_t> background_errors_; | |||
}; | |||
|
|||
class ClearReplicateIDCmd : public Cmd { | |||
public: | |||
ClearReplicateIDCmd(const std::string& name, int arity, uint16_t flag) : Cmd(name, arity, flag) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClearReplicateIDCmd ==> ClearReplicationIDCmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
include/pika_command.h
Outdated
@@ -51,6 +51,7 @@ const std::string kCmdNameQuit = "quit"; | |||
const std::string kCmdNameHello = "hello"; | |||
const std::string kCmdNameCommand = "command"; | |||
const std::string kCmdNameDiskRecovery = "diskrecovery"; | |||
const std::string kCmdNameClearReplicateID = "clearreplicateid"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kCmdNameClearReplicateID => kCmdNameClearReplicationID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
include/pika_conf.h
Outdated
@@ -577,6 +587,7 @@ class PikaConf : public pstd::BaseConf { | |||
|
|||
int Load(); | |||
int ConfigRewrite(); | |||
int ConfigRewriteReplicateID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigRewriteReplicateID => ConfigRewriteReplicationID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/pika_admin.cc
Outdated
@@ -1324,6 +1324,8 @@ void ConfigCmd::Do(std::shared_ptr<Slot> slot) { | |||
ConfigRewrite(config_ret); | |||
} else if (strcasecmp(config_args_v_[0].data(), "resetstat") == 0) { | |||
ConfigResetstat(config_ret); | |||
} else if (strcasecmp(config_args_v_[0].data(), "rewritereplicateid") == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewritereplicateid => rewritereplicationid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/pika_repl_client_conn.cc
Outdated
if (meta_sync.run_id() == "" || g_pika_server->master_run_id() != meta_sync.run_id()) { | ||
LOG(INFO) << "Run id is not equal, need to do full sync, remote master run id: " << meta_sync.run_id() | ||
<< ", local run id: " << g_pika_server->master_run_id(); | ||
if (meta_sync.replication_id() == "" || g_pika_conf->replication_id() != meta_sync.replication_id()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果replication_id是当做group唯一标识用的,那meta_sync.replication_id为空说明是一个异常情况,这种情况下我觉得状态不能正常向下流转,可以继续发送metasync请求,或者是状态流转到kerror。
g_pika_conf->replication_id() != meta_sync_replication_id这个条件,需要进行下拆分,如果是本地的replication_id为空导致不相等,说明是新加的一个空节点,那的确是要全量同步。如果是本地的replication_id非空,但是不相等,说明这个节点之前是其他group的,这种我理解需要人工介入,是操作错误,还是说没有清理环境。
src/pika_repl_client_conn.cc
Outdated
LOG(INFO) << "Run id is not equal, need to do full sync, remote master run id: " << meta_sync.run_id() | ||
<< ", local run id: " << g_pika_server->master_run_id(); | ||
if (meta_sync.replication_id() == "" || g_pika_conf->replication_id() != meta_sync.replication_id()) { | ||
LOG(INFO) << "Replication id is not equal, need to do full sync, remote replication id: " << meta_sync.replication_id() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个功能上线时,如何从3.5平滑升级到3.5.1需要考虑下。最好先拿3.5的测试集群试下升级过程。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个功能上线时,如何从3.5平滑升级到3.5.1需要考虑下。最好先拿3.5的测试集群试下升级过程。
3.5.0暂时没有上线单机版,这个问题可以先不考虑,把这个代码和进取当作3.5.0的二进制
src/pika_inner_message.proto
Outdated
@@ -117,6 +117,7 @@ message InnerResponse { | |||
required bool classic_mode = 1; | |||
repeated DBInfo dbs_info = 2; | |||
required string run_id = 3; | |||
required string replication_id = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个不用required把,不然有兼容性问题
src/pika_conf.cc
Outdated
@@ -687,6 +689,30 @@ int PikaConf::ConfigRewrite() { | |||
return static_cast<int>(WriteBack()); | |||
} | |||
|
|||
int PikaConf::ConfigRewriteReplicationID() { | |||
std::string userblacklist = suser_blacklist(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个变量看起来没有用到?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/pika_repl_client_conn.cc
Outdated
|
||
// First synchronization between the master and slave | ||
if (g_pika_conf->replication_id() != meta_sync.replication_id()) { | ||
LOG(INFO) << "Replication id is not equal, need to do full sync, remote replication id: " << meta_sync.replication_id() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方的日志可能要改下,单纯从日志来看,可能跟133行的日志内容差不多。
我理解逻辑走到这个地方的原因是当前的节点是作为一个新的空节点加入到replication_group,日志内容上跟133进行下区分。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
4754a92
conf/pika.conf
Outdated
max-rsync-parallel-num : 4 | ||
|
||
# The synchronization mode of Pika primary/secondary replication is determined by ReplicationID. ReplicationID in one cluster are the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方改成replication_cluster 要不这个跟集群的cluster会给人造成误解的
…omFoundation#1951) * Add replication-id to implement master/slave Replication --------- Co-authored-by: wuxianrong <[email protected]>
…omFoundation#1951) * Add replication-id to implement master/slave Replication --------- Co-authored-by: wuxianrong <[email protected]>
背景
Pika 目前处于切主操作时,其他从节点会和新主节点进行全量同步,但是在执行切主操作发生之前,原本的从节点和新主节点的数据是基本一致的 (因为在旧主正常运行时,各个从节点的数据基本一致),所以我们认为在做切主操作时我们应该进行的是从节点与新主节点之间执行增量同步
改进方案
当第一个从实例对主进行 slaveof ip port 的时候,主节点生成 ReplicationID,将生成的 ReplicationID 持久化到从节点。Pika 实例启动时候的 ReplicationID:Offset 默认为 0:-1;同时我们提供一个接口给运维的同学 ClearReplicateID,这个命令是运维的同学对离开集群的节点进行操作的,功能是将离开集群节点的 ReplicationID 置为默认的 0
全量复制和增量复制判断
当一个从实例连接主实例时,发送 slaveof 命令去进行主从同步,slave 会把它持有的 ReplicationID 发送给 master,与 master 的 ReplicationID 和 Offset 做比较,这样就能获取缺失的增量数据。但是如果 master 的 buffer 中 Offset 偏移量过大或者从实例所持有的 ReplicationID 在当前的 master 中匹配不到,那么从实例就会进行一次全量同步。
当一个从实例需要重启时,重启后如果想要只进行增量同步,只需要同步 ReplicationID 及 Offset 给主实例即可。
增量复制及全量复制的几种情况:
从实例的 ReplicationID 在主实例中匹配不到,Offset = -1,全量复制
从实例的 ReplicationID 在主实例中能匹配到,Offset 偏移量过大,全量复制
从实例的 ReplicationID 在主实例的 ReplicationID 能够匹配到,Offset 偏移量在范围内,增量复制
问题
ClearReplicateID 底层调用的是 configset 去更改 ReplicationID 的值,这个值是在内存中,如果此时实例宕机重启了那么 ReplicationID 将查询不到?
我们的想法是执行 ClearReplicateID 的时候调用 configset 修改 ReplicationID 值再重载一下 rewrite 方法并调用使其将 ReplicationID 的值持久化到硬盘上
一个集群内的 ReplicationID 一定是唯一的吗?
是的,一个集群只维护一个 ReplicationID,ReplicationID 生成与第一个从 slaveof 主节点的时候,由主节点生成持久化到 config 文件中,并且将 Replication ID 持久化到从节点的 config 文件中
从节点掉线重连之后,ReplicationID 还能获取到吗?
可以的,因为 ReplicationID 是持久化到每个实例的配置文件中,所以每次上线时候可以获取到 ReplicationID
怎么去判断 ReplicartionID 的生成条件?
由于每个实例启动的时候 ReplicationID 默认为空字符串,只有当主从第一次数据同步时,系统判断主节点自身的 ReplicationID 是否为空,如果是空,则生成 ReplicationID 否则将不生成。所以当有切主操作时,不会生成新的 ReplicationID,我们可以认为当一个集群第一次建立时,这个 ReplicationID 的值就确立了并且不会更改,除非这个集群完全消失
ReplicationID 什么时候清空?
有两种情况:
注意线上环境的时候这个操作一般是运维的同学执行
close #1952