Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add replication-id to implement master/slave Replication #1951

Merged
merged 4 commits into from
Sep 8, 2023
Merged

fix: add replication-id to implement master/slave Replication #1951

merged 4 commits into from
Sep 8, 2023

Conversation

Mixficsol
Copy link
Collaborator

@Mixficsol Mixficsol commented Sep 1, 2023

背景

Pika 目前处于切主操作时,其他从节点会和新主节点进行全量同步,但是在执行切主操作发生之前,原本的从节点和新主节点的数据是基本一致的 (因为在旧主正常运行时,各个从节点的数据基本一致),所以我们认为在做切主操作时我们应该进行的是从节点与新主节点之间执行增量同步

改进方案

当第一个从实例对主进行 slaveof ip port 的时候,主节点生成 ReplicationID,将生成的 ReplicationID 持久化到从节点。Pika 实例启动时候的 ReplicationID:Offset 默认为 0:-1;同时我们提供一个接口给运维的同学 ClearReplicateID,这个命令是运维的同学对离开集群的节点进行操作的,功能是将离开集群节点的 ReplicationID 置为默认的 0

全量复制和增量复制判断

当一个从实例连接主实例时,发送 slaveof 命令去进行主从同步,slave 会把它持有的 ReplicationID 发送给 master,与 master 的 ReplicationID 和 Offset 做比较,这样就能获取缺失的增量数据。但是如果 master 的 buffer 中 Offset 偏移量过大或者从实例所持有的 ReplicationID 在当前的 master 中匹配不到,那么从实例就会进行一次全量同步。

当一个从实例需要重启时,重启后如果想要只进行增量同步,只需要同步 ReplicationID 及 Offset 给主实例即可。
增量复制及全量复制的几种情况:

  1. 从实例的 ReplicationID 在主实例中匹配不到,Offset = -1,全量复制

  2. 从实例的 ReplicationID 在主实例中能匹配到,Offset 偏移量过大,全量复制

  3. 从实例的 ReplicationID 在主实例的 ReplicationID 能够匹配到,Offset 偏移量在范围内,增量复制

问题

  • ClearReplicateID 底层调用的是 configset 去更改 ReplicationID 的值,这个值是在内存中,如果此时实例宕机重启了那么 ReplicationID 将查询不到?

    我们的想法是执行 ClearReplicateID 的时候调用 configset 修改 ReplicationID 值再重载一下 rewrite 方法并调用使其将 ReplicationID 的值持久化到硬盘上

  • 一个集群内的 ReplicationID 一定是唯一的吗?

    是的,一个集群只维护一个 ReplicationID,ReplicationID 生成与第一个从 slaveof 主节点的时候,由主节点生成持久化到 config 文件中,并且将 Replication ID 持久化到从节点的 config 文件中

  • 从节点掉线重连之后,ReplicationID 还能获取到吗?

    可以的,因为 ReplicationID 是持久化到每个实例的配置文件中,所以每次上线时候可以获取到 ReplicationID

  • 怎么去判断 ReplicartionID 的生成条件?

    由于每个实例启动的时候 ReplicationID 默认为空字符串,只有当主从第一次数据同步时,系统判断主节点自身的 ReplicationID 是否为空,如果是空,则生成 ReplicationID 否则将不生成。所以当有切主操作时,不会生成新的 ReplicationID,我们可以认为当一个集群第一次建立时,这个 ReplicationID 的值就确立了并且不会更改,除非这个集群完全消失

  • ReplicationID 什么时候清空?

    有两种情况:

    1. 当实例调用 shutdown 进程退出的时候,调用 ClearReplicateID 清空 ReplicationID
    2. 当实例离开当前集群的时候,调用 ClearReplicateID 清空 ReplicationID

    注意线上环境的时候这个操作一般是运维的同学执行

close #1952

@Mixficsol Mixficsol changed the title fix: add replication-id to implement master/slave Replication WIP: add replication-id to implement master/slave Replication Sep 1, 2023
@Mixficsol Mixficsol changed the title WIP: add replication-id to implement master/slave Replication fix: add replication-id to implement master/slave Replication Sep 4, 2023
if (meta_sync.run_id() == "" || g_pika_server->master_run_id() != meta_sync.run_id()) {
LOG(INFO) << "Run id is not equal, need to do full sync, remote master run id: " << meta_sync.run_id()
<< ", local run id: " << g_pika_server->master_run_id();
if (meta_sync.run_id() == "" || g_pika_conf->replication_id() != meta_sync.replication_id()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

conf/pika.conf Outdated
max-rsync-parallel-num : 4

# replicaiton-id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加个英文注释,解释一下replicationID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

chejinge
chejinge previously approved these changes Sep 7, 2023
@@ -462,6 +463,18 @@ class DiskRecoveryCmd : public Cmd {
std::map<std::string, uint64_t> background_errors_;
};

class ClearReplicateIDCmd : public Cmd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClearReplicateIDCmd ==> ClearReplicationIDCmd

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -288,6 +288,7 @@ class ConfigCmd : public Cmd {
void ConfigSet(std::string& ret);
void ConfigRewrite(std::string& ret);
void ConfigResetstat(std::string& ret);
void ConfigRewriteReplicateID(std::string& ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigRewriteReplicateID => ConfigRewriteReplicationID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -462,6 +463,18 @@ class DiskRecoveryCmd : public Cmd {
std::map<std::string, uint64_t> background_errors_;
};

class ClearReplicateIDCmd : public Cmd {
public:
ClearReplicateIDCmd(const std::string& name, int arity, uint16_t flag) : Cmd(name, arity, flag) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClearReplicateIDCmd ==> ClearReplicationIDCmd

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -51,6 +51,7 @@ const std::string kCmdNameQuit = "quit";
const std::string kCmdNameHello = "hello";
const std::string kCmdNameCommand = "command";
const std::string kCmdNameDiskRecovery = "diskrecovery";
const std::string kCmdNameClearReplicateID = "clearreplicateid";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kCmdNameClearReplicateID => kCmdNameClearReplicationID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -577,6 +587,7 @@ class PikaConf : public pstd::BaseConf {

int Load();
int ConfigRewrite();
int ConfigRewriteReplicateID();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigRewriteReplicateID => ConfigRewriteReplicationID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1324,6 +1324,8 @@ void ConfigCmd::Do(std::shared_ptr<Slot> slot) {
ConfigRewrite(config_ret);
} else if (strcasecmp(config_args_v_[0].data(), "resetstat") == 0) {
ConfigResetstat(config_ret);
} else if (strcasecmp(config_args_v_[0].data(), "rewritereplicateid") == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritereplicateid => rewritereplicationid

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (meta_sync.run_id() == "" || g_pika_server->master_run_id() != meta_sync.run_id()) {
LOG(INFO) << "Run id is not equal, need to do full sync, remote master run id: " << meta_sync.run_id()
<< ", local run id: " << g_pika_server->master_run_id();
if (meta_sync.replication_id() == "" || g_pika_conf->replication_id() != meta_sync.replication_id()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果replication_id是当做group唯一标识用的,那meta_sync.replication_id为空说明是一个异常情况,这种情况下我觉得状态不能正常向下流转,可以继续发送metasync请求,或者是状态流转到kerror。
g_pika_conf->replication_id() != meta_sync_replication_id这个条件,需要进行下拆分,如果是本地的replication_id为空导致不相等,说明是新加的一个空节点,那的确是要全量同步。如果是本地的replication_id非空,但是不相等,说明这个节点之前是其他group的,这种我理解需要人工介入,是操作错误,还是说没有清理环境。

LOG(INFO) << "Run id is not equal, need to do full sync, remote master run id: " << meta_sync.run_id()
<< ", local run id: " << g_pika_server->master_run_id();
if (meta_sync.replication_id() == "" || g_pika_conf->replication_id() != meta_sync.replication_id()) {
LOG(INFO) << "Replication id is not equal, need to do full sync, remote replication id: " << meta_sync.replication_id()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个功能上线时,如何从3.5平滑升级到3.5.1需要考虑下。最好先拿3.5的测试集群试下升级过程。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个功能上线时,如何从3.5平滑升级到3.5.1需要考虑下。最好先拿3.5的测试集群试下升级过程。
3.5.0暂时没有上线单机版,这个问题可以先不考虑,把这个代码和进取当作3.5.0的二进制

wanghenshui
wanghenshui previously approved these changes Sep 8, 2023
include/pika_admin.h Show resolved Hide resolved
@@ -117,6 +117,7 @@ message InnerResponse {
required bool classic_mode = 1;
repeated DBInfo dbs_info = 2;
required string run_id = 3;
required string replication_id = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不用required把,不然有兼容性问题

src/pika_conf.cc Outdated
@@ -687,6 +689,30 @@ int PikaConf::ConfigRewrite() {
return static_cast<int>(WriteBack());
}

int PikaConf::ConfigRewriteReplicationID() {
std::string userblacklist = suser_blacklist();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个变量看起来没有用到?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// First synchronization between the master and slave
if (g_pika_conf->replication_id() != meta_sync.replication_id()) {
LOG(INFO) << "Replication id is not equal, need to do full sync, remote replication id: " << meta_sync.replication_id()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方的日志可能要改下,单纯从日志来看,可能跟133行的日志内容差不多。
我理解逻辑走到这个地方的原因是当前的节点是作为一个新的空节点加入到replication_group,日志内容上跟133进行下区分。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

wangshao1
wangshao1 previously approved these changes Sep 8, 2023
chejinge
chejinge previously approved these changes Sep 8, 2023
@Mixficsol Mixficsol dismissed stale reviews from chejinge, wangshao1, and wanghenshui via 4754a92 September 8, 2023 08:26
conf/pika.conf Outdated
max-rsync-parallel-num : 4

# The synchronization mode of Pika primary/secondary replication is determined by ReplicationID. ReplicationID in one cluster are the same
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方改成replication_cluster 要不这个跟集群的cluster会给人造成误解的

@chejinge chejinge merged commit a3e5243 into OpenAtomFoundation:unstable Sep 8, 2023
11 checks passed
@Mixficsol Mixficsol deleted the add_replicationid branch September 25, 2023 09:19
bigdaronlee163 pushed a commit to bigdaronlee163/pika that referenced this pull request Jun 8, 2024
…omFoundation#1951)

* Add replication-id to implement master/slave Replication

---------

Co-authored-by: wuxianrong <[email protected]>
cheniujh pushed a commit to cheniujh/pika that referenced this pull request Sep 24, 2024
…omFoundation#1951)

* Add replication-id to implement master/slave Replication

---------

Co-authored-by: wuxianrong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

主从复制使用 ReplicationID
5 participants