Skip to content
View shaorui0's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report shaorui0

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
shaorui0/README.md

Rui Shao

Senior Site Reliability Engineer by day. I build production control systems for large-scale Kubernetes and cloud environments (AWS 6 regions, 30+ clusters, 600+ nodes). Outside work: AI infrastructure, agentic workflows, and tools that extend what one person can operate alone.

Blog: shaorui0.github.io · GitHub: @shaorui0 · Email: [email protected]

Core positioning

I treat production as a controllable system: make change safer, isolate failure, shorten recovery, and keep efficiency sustainable under SLO guardrails.

What I focus on

  • Safer change: progressive rollouts, pause/rollback gates, compatibility preflights (CRDs/webhooks/add-ons).
  • Traffic control across clusters/regions: weighted routing for canary, cutover, failover, and staged failback.
  • Observability that's actionable during incidents: service-impact signals (latency/error/saturation, SLO burn), incident views that join metrics/logs/changes/deps.
  • Cost work with reliability guardrails: Reserved/On-Demand/Spot by workload criticality + interruption tolerance; lifecycle/tiered storage; topology-aware placement.
  • On-call recovery: symptom-first triage (refused vs timeout, waiting vs upstream latency, app vs node pressure), smallest-safe mitigation, codified runbooks.

Experience highlights

Datavisor — Senior Site Reliability Engineer (Japan) (2025/10–Present); Site Reliability Engineer (China) (2025/03–2025/10)

AI Agent Engineering

  • Built an autonomous SRE triage agent as a Claude Code skill, used in daily on-call to auto-investigate production alerts via structured debug trees and MCP tool calls (VictoriaMetrics, Grafana, Loki).
  • Designed 3-layer code-level safety system — query validation, cluster-tier enforcement, scope tracking — ensuring all agent operations are read-only; unknown environments default to the most restrictive policy.
  • Engineered a 132-file knowledge base (6-cluster routing, 7 executable debug trees, 35 incident cases) with JSONL observability tracing for every tool call and safety decision.

Infrastructure & Reliability (AWS 6 regions | K8s 30+ clusters | 600+ nodes)

  • Led production Kubernetes upgrades (1.24→1.29) across 30+ clusters; built automation tool cutting per-cluster time from 18–21h to 3–4h with zero incidents.
  • Designed distributed auto traffic-switch system across 4 AWS/K8s backends, reducing failover from manual 5–15 min to seconds.

Observability & On-call (40 tenants | ~1.2M active series)

  • Built full-stack monitoring (VictoriaMetrics + Grafana + Loki) serving 30 clusters and 40 tenants; migrated from Prometheus Federation, eliminating weekly OOM and reducing data lag from 45s to <5s.
  • Designed 3-tier alerting with per-tenant SLI rules, enabling tenant-level fault isolation and SLA tracking.
  • Primary on-call for 30+ production clusters; resolved 20+ P1/P2 incidents with avg MTTR ~30 min.

Intel — Cloud Software Development Engineer (China) (2022/06–2025/02)

Developed Go microservices and Kubernetes operators for platform tooling; owned production monitoring infrastructure and collaborated with product teams to design end-to-end observability.

Earlier internships

  • Tencent (Software Development Engineer Intern, 2021/05–2021/09): Go gRPC microservices for live-streaming workloads (Kafka, MySQL, Redis).
  • ByteDance (Software Development Engineer Intern, 2021/03–2021/05)
  • App Annie (Web Backend Engineer Intern, 2020/11–2021/02)
  • Baidu (Data Engineer Intern, 2020/01–2020/10)

Projects

  • shaorui0/agents — SRE Triage Agent (Anthropic SDK reimplementation with full agent loop, safety gates, eval framework)
  • shaorui0/context-infrastructure — Personal AI-native context system, memory architecture, and agentic workflow infrastructure
  • shaorui0/chatgpt_flow — Browser extension to extract and organize ChatGPT conversation data

Skills

  • Languages: Python, Go, Java, Shell
  • Infra & data: AWS, Kubernetes, Helm, Docker, MySQL, ClickHouse, Redis, Kafka
  • Observability & platform: Prometheus, VictoriaMetrics, InfluxDB, Grafana, Loki, Jenkins, Ansible
  • Human languages: Chinese, English

Education

  • Master — University of Science & Technology Beijing, Computer Science and Technology (2019/09–2022/06)
  • Bachelor — Wenhua College, Electronic Information Engineering (2014/09–2018/06)

Side interests

Outside of production work, I spend a lot of time at the intersection of AI and infrastructure:

  • Building agentic workflows and multi-agent systems — not just using LLMs but wiring them into real operational loops
  • AI-native tooling: context systems, memory architectures, orchestration patterns (heavy Claude Code / OpenCode user and contributor)
  • Writing about SRE mental models, AI collaboration, and what it actually takes to operate production systems at scale

Popular repositories Loading

  1. CSAPP-3e-Solutions CSAPP-3e-Solutions Public

    Forked from DreamAndDead/CSAPP-3e-Solutions

    CSAPP 3e Solutions gitbook

    C 1

  2. flask flask Public

    Forked from pallets/flask

    The Python micro framework for building web applications.

    Python

  3. automate_the_boring_stuff_with_python automate_the_boring_stuff_with_python Public

    Python

  4. muduo muduo Public

    Forked from chenshuo/muduo

    A C++ non-blocking network library for multi-threaded server in Linux

    C++

  5. toy toy Public

    C++

  6. shaorui0.github.io shaorui0.github.io Public

    HTML