Quipper Engineering

How We Migrated Millions of Data Without Downtime

Fri, 03 Feb 2023 00:00:00 +0000

Recently, my team and I managed to migrate millions of our users’ data with no downtime required. In this post, I’m going to share why we did it, how we did it, and what we’ve learned from this.

Background – Why Do We Need To Migrate, anyway?

Initially, our organization relies on a monolithic database system (MongoDB) to manage a wide range of functions, including user identification, authentication, authorization, content-checking, payment, etc. However, as the demands on our system have grown, it has become clear that this single, all-encompassing database hinders our ability to efficiently and effectively manage and expand our services.

To achieve separation of concern and improve efficiency, our organization has decided to undertake a database migration project. However, through this migration, we aim to create dedicated databases for specific functions, such as authentication & authorization, two-factor authentication, and passwordless login systems.

In short:
We need a dedicated database to store users’ account data – for authentication and/or authorization purposes.

Setup The Goals

Before the migration, our architecture looked like this:

It has changed to this:

Please note, in this post, we refer to Main DB as the old database, and the New Account DB as the new database.

Research & Preparation

“If I only had an hour to chop down a tree, I would have spent the first 45 minutes sharpening my axe.” – Abraham Lincoln

This is the most important step. Before we start the migration, we have to spend some time preparing and doing some research to make the migration run as expected with minimum (or even zero) impact on the current live production environment.

In this step, we identify the entities to be migrated, calculate the size, then analyze what kind of database we need for the new Account DB, and finally, we define the migration steps.

Identify Entities To Migrate

After we analyzed the current implementation of the auth service, we found out that we have to migrate at least five entities (called collections in MongoDB, or tables in SQL databases). Those collections are: users, access_tokens, teams, authentications, and clients.

In total, there will be around 32 million documents to be migrated from our Main DB (old DB) to the new Account DB – it is about 25 GiB in size.

What Kind of Database do We Need?

As I have mentioned earlier, our Main DB is running on a MongoDB server, and it’s running very well. Then, what kind of database that we are going to use in this separation/migration project?

This one might be debatable, but after going through several discussions, we go with MongoDB (again? Yes), here are our considerations:

Easy to scale – High Availability
We don’t need a strong transactions mechanism here
No requirements for joining many different types of data
MongoDB also provides the Time To Live Index – We will be able to auto-delete expired access_tokens without adding any additional service like a Job-scheduler or something.

Migration Strategy

Can we do this migration in one go and that’s all?
The answer is NO. Many services currently read/write from and to our migratable collections in the Main DB. So we can’t just migrate those collections to the new DB and switch to read/write from/to the new DB.

I mean, in the simplest terms yes, the step is just migrating all of the data from the old DB to the new one, and then we switch the services to read/write to the new DB. But unfortunately, it is not that straightforward. Since we are working with a monolith database that receives a lot of read and/or write requests from many services. Instead, we have to split the overall migration process into several granular steps or phases.

How to actually migrate those data? So, let’s split the migration process into several actionable phases:

“Double Write” Any New Changes
Copy data from the Main DB (old DB) to the new Account DB – Do we need downtime?
Make sure the data between Main DB and the new Account DB is always synch
After we’ve confident enough, we can start reading from the new Account DB
Then, finally, we can stop writing related data to the Main DB

Migration Implementation

“Double Write” Any New Changes – a.k.a Replication

At this point, we need to find a mechanism for synchronizing any new changes in the Main DB (old DB) to the new Account DB, so the changes that happened in Main DB will be mirrored in the new Account DB. How? Here are several solutions that we can follow:

It’s important to note that there are other options available, but for the purpose of this post, we will be focusing specifically on the comparison of two options, and explaining why we chose one over the other.

Synchronization via API end-points

So, here is the flow:

The auth service establishes connections to the new Account DB
The auth service reserves API end-points for synchronizing new changes to users, access_tokens, teams, clients, and authentications collections in the new Account DB
Quipper Platforms such as: LEARN API, LINK API, Back-Office, and other services will act as the clients of the API end-points defined by auth service
On every change that happens to those collections in Main DB, Quipper Platforms will make an HTTP (or RPC) call to auth service to synchronize those changes
The auth service will receive each request and process it (basically, CRUD to the new Account DB)

We think this solution is quite simple, but there are several drawbacks:

Looks like this solution is not reliable enough, it will be prone to network failure
This will also give a huge additional load to the auth service
This “mirroring” process is not auth service’s responsibility anyway – so we don’t want to compromise to risk our auth service being down for handling requests that are not even its responsibility
And, we have to make changes in many places: API Learn, Educator API, Back-Office, etc

So, we didn’t choose this option.

Change Data Capture: The MongoDB Changestreams

After careful consideration, we chose this option. In this option, we fully utilize the feature of MongoDB called Change Streams. Our Main DB has the ability to stream every event/change that happens inside, and at the other end, our app will be able to watch/listen to every streamed event/change to process it further.

We also introduced a new service called auth-double-writer. This service is written in Go. The responsibility of this service is to replicate any changes by watching (or listening) to any changes that happen to the collections in the Main DB and then write those changes to the new Account DB.

MongoDB Change Streams

Change Streams allows applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them.

Change Streams is available for replica sets and sharded clusters
Watch a Collection, Database, or Deployment
Modify Change Stream Output – $addFields, $match, $project, etc
MongoDB Changestream is resumable – resumeAfter, startAfter
Use Cases:
- Extract, Transform, and Load (ETL) services
- Cross-platform synchronization
- Collaboration functionality
- Notification services
Change Events (v6.0):
- create
- createIndexes
- delete
- drop
- dropDatabase
- dropIndexes
- insert
- invalidate
- modify
- rename
- replace
- shardCollection
- update

Copy Data from Old DB to the New DB

We have two options on how we will execute the actual migration step. The first one is with downtime required, and the second one is without downtime.

With Downtime Here are the steps:

Turn off auth-double-writer service
Turn off Quipper services (downtime)
Run the job to copy the data from Main DB to the New Account DB
Turn back on auth-double-writer service
Turn back on Quipper services

Without Downtime It is possible to run the migration without downtime since our auth-double-writer service has a “pause and resume” capability (thanks to MongoDB Changestream’s resume token). Here are the steps:

Turn off auth-double-writer service
Run the job to copy the data from Main DB to the New Account DB
Turn back on auth-double-writer service

We have determined that the latter option is preferable, the database migration with no downtime is the best course of action as it maintains continuity of service for our users and minimizes potential disruption.

Execute The Migration With Zero-Downtime

We finally managed to execute the migration with zero downtime (cheers!). The question is how did we do that? Let me explain to you.

The main reason why we managed to execute the migration with zero downtime is that MongoDB Changestream is resumable. So, the auth-double-writer utilizes this capability very well.

We’ve designed our auth-double-writer with “paused” capability, and resume from the point it left. So it will be able to listen to the events stream continually as if there is no disruption.

This is what actually happened:

When we turned off the auth-double-writer service, the auth-double-writer stores the last Changestream token that has been processed to some datastore – we can use some persistent datastore to store the Changestream tokens here
Then, we executed our main task: copy the data from the Main DB to the new Account DB
- We’ve carefully tested this step
- We’ve run the job several times before (for testing purposes)
- Based on the test result, we’ve calculated that on average this job will take 30 minutes to run. This is a safe number since our MongoDB Oplog can hold the Changestream tokens for one hour long
We turned back on the auth-double-writer service. This service will pick up the Changestream token from the data store, so it will continue to listen from the point when it was being turned off
We checked the data integrity and compare the size and the number of records between the Main DB and the new Account DB. Thankfully, all is good
Now we’ve fully replicated data from the Main DB to the new Account DB and kept them synced

Conclusion

As we have seen throughout this article, database migration is a complex process that requires careful planning and execution. However, the benefits of separating concerns, improving performance, and ensuring continuity of service through zero-downtime migration make it worth the effort. With this migration, our organization will be better equipped to handle future demands, and we will continue to deliver the best possible service to our users.

Thank you for your time in reading this post. See you later.

Originally published at https://tirasundara.hashnode.dev.

Automate handling a number of Pull Requests by Renovate in Terraform Monorepo

Tue, 29 Mar 2022 05:00:00 +0000

Original article in Japanese: Renovate の大量の Pull Request を処理する技術

In this post, I’d like to introduce techniques for handling a large number of pull requests from Renovate in a Terraform Monorepo.

Background

We manage a Terraform Monorepo, and recently we’ve migrated its CI from AWS CodeBuild to GitHub Actions and tfaction.

2022-02-25 Migrate Terraform CI from AWS CodeBuild to GitHub Actions

We have about 400 working directories (Terraform States), and the following tool versions are managed in each working directory.

Terraform
Terraform Provider
tflint
tflint plugin
tfsec
etc

If a single package is used in multiple services in Monorepo, by default, Renovate updates them in a single pull request. We use additionalBranchPrefix to separate pull requests per working directory.

e.g.

{
  "additionalBranchPrefix": "{{packageFileDir}}-",
  "commitMessageSuffix": "({{packageFileDir}})",
  "matchManagers": [
    "terraform",
    "regex"
  ]
}

This way, when a tool is updated, nearly 400 pull requests need to be merged. Reviewing such a large number of pull requests one by one by humans is difficult and not worth the effort. Therefore, if CI is successful and the result of terraform plan is no change, it is desirable to merge automatically. If the number of pull requests that can be merged a day is too small, we wouldn’t be able to fully process pull requests and tools wouldn’t be updated properly.

Solution

To handle a large number of pull requests from Renovate automatically, we did the following actions.

Enable automerge
Enable platformAutomerge
Set prHourlyLimit to 0
Set prConcurrentLimit to 5
Limit branchConcurrentLimit too
Update the feature branch and enable automerge automatically when the automerge is disabled due to the update of base branch
Close the pull request and delete the feature branch immediately when CI fails
Skip terraform plan and apply for updates other than Terraform and Terraform Provider
Install not only Renovate Approve but also Renovate Approve 2 to make pull requests certainly approved
Set prPriority to prevent some tools from blocking other tools’ update
Replace GITHUB_TOKEN to GitHub App’s token to prevent API rate limiting

1. Enable automerge

If you enable automerge, Renovate will merge pull requests automatically. If one approval is required to merge pull requests, you can use Renovate Approve.

However, it is known that it takes a little long time to merge pull requests with automerge. It can take several hours. So, you can enable platformAutomerge.

2. Enable platformAutomerge

When platformAutomerge is enabled, pull requests are merged as soon as the conditions are met by GitHub Automerge feature.

Notes on GitHub Automerge

Please use GitHub Automerge carefully, otherwise your pull requests will be merged even if CI fails.

You have to enable Allow auto-merge in the repository setting
The base branch must be protected by Branch Protection Rule
- You must select the status checks in Status checks that are required least at one, otherwise Automerge cannot be enabled

Be aware that the pull request will be merged even if the checks other than the ones you checked in Status checks that are required fail. If a GitHub Actions job is skipped by if, it will be merged.

We run GitHub Actions’ multiple jobs in parallel by build matrix, but it is difficult to add those jobs to Status checks that are required because executed jobs are changed dynamically. So we add a job which depends on the build matrix, and add it to Status checks that are required.

There is still a problem that if other workflows fail the pull request would be merged, but we tolerate this because it rarely happens and we can fix it if it happens.

3. Set prHourlyLimit to 0

Renovate has several limits that restrict the creation of pull requests. Note that even if they are unlimited by default, it may be limited by the preset config:base.

config	default	`config:base`
prHourlyLimit	0	2
prConcurrentLimit	0	10
branchConcurrentLimit	`prConcurrentLimit`

prHourlyLimit is limited to 2 by config:base, which means that only 2 pull requests will be created per hour. So, explicitly set it to 0 so that an unlimited number of pull requests can be created.

4. Set prConcurrentLimit to 5

Renovate tries to create as many pull requests as possible within the above limit. When you run terraform plan and apply in Terraform CI, probably CI would fail due to API rate limiting if you run a lot of them at the same time. Also, GitHub Automerge may be automatically disabled when the base branch is updated.

For this reason, we set prConcurrentLimit to 5.

5. Limit branchConcurrentLimit too

branchConcurrentLimit is a limit based on the number of branches. I thought we didn’t have to limit pull requests by the number of branches, so I set it to 0 at first, but that was a mistake. It seems that branches are created even if no pull requests are created, so more than 1000 branches were created unnecessarily. Since branchConcurrentLimit is the same as prConcurrentLimit by default, we explicitly set only prConcurrentLimit and not branchConcurrentLimit.

6. Update the feature branch and enable automerge automatically when the automerge is disabled due to the update of base branch

GitHub Automerge may be automatically disabled when the base branch is updated.

To merge these pull requests automatically, you can update feature branches and re-enable automerge automatically by GitHub Actions.

https://github.com/suzuki-shunsuke/reenable-automerge-action

7. Close the pull request and delete the feature branch immediately when CI fails

Since we have set prConcurrentLimit and branchConcurrentLimit, leaving Renovate pull requests open will limit the number of new pull requests that can be created. Therefore, we decided to close pull requests that could not be automerged and delete feature branches automatically.

https://github.com/suzuki-shunsuke/renovate-autoclose-action

You can search closed pull requests with simple query like is:pr is:unmerged author:app/renovate, and can also be found in Renovate’s Dependency Dashboard.

8. Skip terraform plan and apply for updates other than Terraform and Terraform Provider

With tfaction, CI would fail if the result of terraform plan of pull request by Renovate is not No Change to prevent dangerous changes from being applied by terraform apply. However, sometimes tools such as tfsec and tflint couldn’t be updated due to this failure.

tfsec and tflint are not related to terraform plan and apply, so you don’t have to run terraform plan and apply to update them.

Since tfaction v0.4.9, tfaction supports skipping terraform plan and apply in Renovate pull requests, so we’re using the feature.

This also speeds up CI and prevents API rate limiting.

9. Install not only Renovate Approve but also Renovate Approve 2 to prevent approve omissions

We don’t know the reason, but sometimes Renovate approve does not approve pull requests expectedly. So we also installed Renovate Approve 2 to prevent the omission of approve. This app is supposed to be used when two approval are needed, but we think it can also be used to prevent approval leaks. So far, we haven’t had any approval leaks after we installed Renovate Approve 2.

10. Adjust prPriority properly

Tools like Terraform and AWS Provider may disturb other tools’ updates for a long time, because they are frequently updated. If you want to prioritize other tools updates, you can adjust the prPriority.

11. Replace GITHUB_TOKEN to GitHub App’s token to prevent API rate limiting

tfaction takes GitHub Access Token as input. By default secrets.GITHUB_TOKEN is used, but if the number of builds’ an hour increases, API rate limiting may occur. So we switched to a GitHub App token which has a less strict rate limit than secrets.GITHUB_TOKEN. About the rate limit, please see the document.

To switch, you need to modify GitHub App permissions (issues: read is required). Furthermore, you also need to switch GitHub Access Token for github-comment hide, because github-comment hide only hides comments from the same user.

Conclusion

As a result of the above actions, we are now able to create and merge about 500 pull requests a day into a single repository. This number still has room for improvement (we think it could be up to 700 or so), but it is still sufficient for the current situation. We used to check and respond to open pull requests from time to time, but by automating tasks as much as possible, we only have to deal with those that really need to be dealt with by humans, and this has reduced our workload.

Vision, Mission and Values to make SRE team more sustainable

Mon, 28 Mar 2022 08:00:00 +0000

My name is @yuya-takeyama and I am the Engineering Manager in the Global SRE Team.

Previously, our company had only one SRE team and I was the manager of that team, but we split the SRE team between Japan and Global because we were developing different products in Japan and other countries. I am now in charge of launching the Global SRE Team.

This article is about what I did with my pre-split SRE team. At that time, we defined our Vision, Mission, and Values with our team.

Quipper has a company Vision, Mission, and Identities.

Vision: Distributors of Wisdom
Mission: Bringing the Best Education to Every Corner of the World
Identities: User-first, Diversity, Ownership, Fact-based, Growth

Although these have been established for more than a few years, they are still as important to Quipper employees as ever.

However, the day-to-day work of SREs does not directly contribute to teaching and learning. Of course, we do support them in ways they cannot see.

Therefore, we decided to establish a Vision as a future that is more intuitive to our team, Mission as what we should do to achieve it, and Values as the values that are important in our daily activities. The current team, after the team split, is still working under this Vision, Mission, and Values.

The following is a quick introduction.

Vision, Mission, and Values of the SRE Team

Vision: Realize a development organization that can continue to create the best learning products

Vision is a future that is not there yet, but should be aimed for and created.

In this context, we consider the “development organization” to be the direct customer for the team. The easiest way to explain the development organization is that it is the people who belong to the Product Development Division which makes Quipper products. Designers, Developers, Product Managers, QA Engineers and SREs are part of this division.

In a broader sense, however, product development involves a diverse range of people on a daily basis. In our products, the people who create content (learning materials) also play an important role in the product.

In order to execute our mission without falling into local optimization, it is important to understand that a wide variety of people are involved in product development in many different ways.

Mission: Create a platform and culture for self-contained teams to continue to deliver product quickly and safely.

Mission is what we do every day to realize our Vision.

A particularly important keyword in this context is “self-contained team”. This has been a theme I have been talking about ever since I became an Engineering Manager, even before we defined this Vision, Mission, and Values.

In a nutshell, the relationship between the development team and the SRE team should not be one of “ask” and “receive”.

For example, if a new database is needed for the development of a new service, and the development team requests it, and the SRE team creates it and hands it over to them, what kind of problems will there be? The lead time for infrastructure provisioning becomes long because of the waiting time after the request is made. And because it is difficult for the development team to control, it becomes an uncertainty in the development schedule. In addition, such a structure makes it difficult to motivate the development team to think about the optimal database and architecture to use, which in turn will affect the quality of the product.

To prevent this from happening, we provide a Terraform Platform for self-contained teams, which allows development team members to manage any kind of cloudd resources like databases, cache servers and messages queues by themselves.

And for applications, we similarly provide a Kubernetes Platform for self-contained teams that allows developers to build new services by themselves. They can continuously deploy services without the help of the SRE team.

Platforms that include tools such as CI/CD are visible in the form of source code, but to enable development teams to work as self-contained teams, it is not enough to have the tools alone. It is also necessary to understand and practice the methodology for developing as a self-contained team.

At Quipper, all development teams have SLOs, and all teams are able to regularly monitor and take action when there are problems. In addition, when problems such as failures occur, each team reviews the situation through postmortem.

Such measures may not require much effort for a one-time event. However, to make them sustainable, it is necessary to have a system and culture, not just a mentality.

We aim to realize our vision by having both a platform and a culture, and by continuously evolving them.

Values

We have defined four Values. Quipper Identities are five, but we kept it to four because we are conscious of the number of chunks that can fit in short-term memory.

Fail smart
- Do not blame failure, but use it as a learning opportunity. Also, control the scope of impact and incorporate failure into the process so that the greatest return can be obtained from the least risk.
- Failure is to be avoided if possible, but in complex systems, failure cannot be reduced to zero. It is important to face failures properly with Postmortem and other tools, rather than treating them as absolute evils.
- It is also important to actively utilize the remaining 0.1% of the 99.9% SLO through methodologies like Canary Release.
Learning
- Continue to see everything as an opportunity to learn and make necessary changes in order to discover and solve unknown problems.
- In Peter Senge’s The Fifth Discipline: The Art & Practice of The Learning Organization and in Accelerate: The Science of Lean Software and DevOps, the importance of learning and improvement as an organization is emphasized.
- “What we can do now” is important, but “what we will be able to do in the future” is even more important. It is necessary for the organization to continue to overcome the challenges while continuously updating the issue setting.
Borderless
- Communicate and collaborate across organizational boundaries to achieve greater results.
- Individual ability is important, but a major job that has an impact on the product cannot be accomplished by itself.
- Especially since we are a functional organization, we cannot achieve results without actively working together to overcome the borders between ourselves and the development team.
Metrics-driven
- Measure all issues and things, see problems not as dots but as lines, and aim for flexible and automatic solutions.
- Even if it is not difficult to solve each thing one by one, it is necessary to monitor and detect by indicators to avoid it continuously or in advance.
- To solve problems systematically, it is necessary to model problems as indicators and control them or deal with their side effects.

These are the ways of being that we must value and build on in each of our actions as we carry out our Mission.

All of the members on our team are highly capable as individuals, but we feel that there are an increasing number of problems that cannot be solved by that alone. In order to overcome them, we need to improve the quality of our problem-solving and try to solve them at a higher level, and to do so, I believe we need to make these our action guidelines.

How we defined the Vision, Mission, and Values?

Although I was the one who proposed the idea of defining the Vision, Mission, and Values, the process was done by all team members at that time.

We would have liked to get together and discuss the Vision, Mission, and Values on a whiteboard, but due to COVID-19, we had to discuss the vision and values remotely.

After explaining why we were doing this, we presented our ideas of Vision, Mission, and Values to each other, and then we explored the values within each of them in depth, breaking them down into their elements and reconstructing them.

The reasons for doing this were explained as follows.

To enable teams to work and make decisions in the same direction.
As a tool for matching in recruitment
To make the team more attractive to work with

We are now explaining our Vision, Mission, and Values in our hiring interviews and hope to attract more candidates who understand our ideas.

Migrate Terraform CI from AWS CodeBuild to GitHub Actions

Fri, 25 Feb 2022 09:00:00 +0000

Author: @suzuki-shunsuke, SRE in Quipper

Original article in Japanese: Terraform の CI を AWS CodeBuild から GitHub Actions + tfaction に移行しました

In this post, I’d like to talk about how we migrated our Terraform CI from AWS CodeBuild to GitHub Actions + tfaction.

Terraform Workflow so far (AWS CodeBuild)

Originally, we used to run CI on AWS CodeBuild. Before that, we used CircleCI, but we migrated to AWS CodeBuild.

There are two main reasons why we migrated to AWS CodeBuild.

Security
- You can manage AWS resources without persistent Access Keys
- Google Cloud Platform (GCP) can also be managed without Service Account Key by Workload Identity Federation
Dynamic workflow
- In Monorepo, we would like to run CI only in the working directory where the code was changed by pull request
- This can be achieved by generating a buildspec dynamically during build and executing Batch Build with AWS CLI
- CircleCI now supports dynamic workflow, but at the time it did not

The first reason in particular was a major strength of AWS CodeBuild.

Another advantage of AWS CodeBuild is that it can be run in AWS VPC. We manage MongoDB Atlas by Terraform. The Atlas supports restriction of API usage by source IP addresses, if it is not restricted, it is very risky when the key leaked. So, we only allow access from the Elastic IP address of a specific AWS VPC NAT Gateway.

OIDC support for GitHub Actions

However, the situation has changed dramatically since GitHub Actions started supporting OIDC to access AWS and GCP without a persistent access key.

https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect

You can also run GitHub Actions in VPC by GitHub Actions’ Self-hosted Runner. Since we originally run Self-hosted Runner, we thought it would be relatively easy to run Terraform with Self-hosted Runner.

As the strengths of AWS CodeBuild became available in GitHub Actions, the momentum to migrate to GitHub Actions grew.

Reasons for migrating to GitHub Actions

We decided to migrate from AWS CodeBuild to GitHub Actions for the following reasons.

No more need to sign in to AWS to see CI logs and retry CI
GitHub Actions’ build matrix allows for a more natural dynamic workflow
Leverage Action ecosystem

It’s bothersome to sign in to AWS to see CI logs and retry CI. In case of GitHub Actions, you don’t have to sign in to AWS for them.

GitHub Actions’ build matrix allows for a more natural dynamic workflow

We achieved dynamic workflow in AWS CodeBuild by generating buildspec and uploading it to AWS S3 and running a Batch Build with AWS CLI. So the build is executed in two stages, and CI takes a bit of time. Batch Build itself takes some time to start and finish too.

GitHub Actions’ build matrix allows for a more natural dynamic workflow. There is no need to dynamically generate a buildspec and upload to S3. It also makes CI faster.

Leverage of Action ecosystem

you can leverage GitHub Actions’ Action ecosystem. By replacing existing shell scripts with Actions, you can reduce the number of maintenance targets and improve maintainability.

Adopt tfaction

We have adopted tfaction, which is GitHub Actions collection for Opinionated Terraform Workflow.

https://github.com/suzuki-shunsuke/tfaction

tfaction supports almost all the features that we have originally implemented with shell scripts by ourselves, so we expected that we could eliminate shell scripts entirely.

Benefit of migration to GitHub Actions and tfaction

Migrating to GitHub Actions with tfaction has improved things in the following ways.

Least Privilege
No more need to sign in to AWS to view CI logs or retry
Faster CI
Elimination of shell scripts
tfaction provides useful features such as automatic generation of Follow up Pull Requests, automatic update of Pull Requests, scaffolding working directory with GitHub Actions, and so on

Least Privilege

One of the main problems we had was that all builds of Terraform used IAM Role with very strong privileges. GitHub Actions’ OIDC allows you to change IAM Roles per branch, so you can use IAM Role with strong privileges only in the default branch that executes terraform apply, and you can use IAM Role with almost ReadOnly privileges for pull requests. Furthermore, you can use IAM Role with very limited permission in builds for tfmigrate and for non-AWS Terraform Providers.

tfaction provides a Terraform Module to create IAM Roles with minimal privileges.

https://github.com/suzuki-shunsuke/terraform-aws-tfaction

And tfaction allows to configure IAM Roles for each working directory and GitHub Actions job (terraform plan, terraform apply, tfmigrate plan, tfmigrate apply), so you can achieve the least privilege easily.

tfaction-specific features

tfaction provides various useful features. Please see the official document.

Conclusion

In this post, I introduced the migration of Terraform Monorepo Workflow from AWS CodeBuild to GitHub Actions and tfaction. This migration has improved Developer Experience and achieved the lest privilege.

Scheduled-Scaling with Kubernetes HPA External Metrics

Tue, 13 Jul 2021 14:00:00 +0000

Original article in Japanese: Kubernetes HPA External Metrics を利用した Scheduled-Scaling

Hi, I’m @chaspy from Site Reliability Engineering Team.

At Quipper, we use Kubernetes Horizontal Pod Autoscaler (HPA) to achieve pod auto-scaling.

The HPA can handle most ups and downs in the traffic. However, in general, it cannot deal with spike in traffic caused by unexpectedly high number of users accessing the platform at once. When the unexpected increase in CPU utilization happens, it would still take about 5 minutes to scale out the node even if HPA immediately increased the Desired Replicas.

Compared to the scaling mechanism based on the CPU utilization, Scheduled-Scaling can be defined as a method to schedule a fixed number of nodes/pods to be scaled at a specific time in the future. The simplest way to perform Scheduled-Scaling is to just change the minReplicas of the HPA at a specified time. This method may be efficient if the change is only made once or around the same time every day. However, if the spikes are expected at different times, it may be difficult to change the minReplicas every time.

In this article, I will explain a case study using Kubernetes HPA External Metrics to perform Scheduled-Scaling for traffic spike during regularly scheduled exams in the Philippines.

Background

In the Philippines, Quipper is already being used in the schools. Teachers and students have been using it for scheduled exams e.g., term end exams. The teachers register the questions for the examinations in the system before the exam.

One day while one of such scheduled exam was about to start, some students could not login into the portal at all. Schools and Customer Success teams got really confused because they suddenly started receiving complaints about students not being able to take the exam. After some investigation we found that this was due to a sudden traffic spike.

As a temporary solution, firstly, we avoided service downtime by setting the HPA minReplicas to a high enough value during daytime hours. However, this resulted in redundant server costs because we didn’t scale down the replicas during night time or during time when there was no traffic spike.

Description: The number of pods. It scales out up to 400 uniformly from 6:30 am to 7:30 pm.

Description: The number of Nodes also increases in proportion to the number of Pods.

To solve this problem, @naotori, Global Division Director, asked me if it would be possible to scale the server based on the starting time of the exam and the expected number of users in advance. Then, @bdesmero, Global Product Development VPoE, wrote a batch script to get the above data from our database. When we observed this data and the actual server metrics we found out that the server load correlates to the starting time of the exam and the expected number of users. We also found out the maximum number of users our current architecture could handle from the metrics.

Therefore, to optimize the number of pods/nodes which were being scaled out excessively, we decided to use the data obtained by @bdesmero as external metrics for HPA and used it together with CPU auto-scaling to achieve Scheduled-Scaling safely.

Mechanism: HPA External Metrics and Datadog Custom Metrics Server

HPA is widely known for auto-scaling based on CPU, but autoscaling based on External Metrics is available since apiversion auto-scaling/v2beta1. Since Quipper uses Datadog, I decided to use Datadog metrics as External Metrics.

So how do you autoscale by using the Datadog metric? HPA Controller is designed to get metrics from Kubernetes metrics api(metrics.k8s.io, custom.metrics.k8s.io, external.metrics.k8s.io).

Setting it up in accordance with the documentation of the Datadog Custom Metrics Server, the API Service is added. By registering the API Service, it is registered in the Aggregation Layer of the Kubernetes API, and HPA can retrieve metrics from Datadog’s Metrics Server via the Kubernetes API. Here is a diagram.

Furthermore, if you want to use Datadog’s metrics query, register CRD called datadogmetric.

First, the Datadog cluster-agent checks if the HPA spec.metrics field is external, parses the metric name such as datadogmetric@<namespace>:<name>, and then sets the HPA Reference field. First, the Datadog cluster-agent checks if the HPA spec.metrics field is external, parses the metric name such as datadogmetric@<namespace>:<name>, and then sets the HPA Reference field.

The HPA then queries the metrics server for references, and the cluster-agent receives it and returns the query retrieved from Datadog. As a side note, the Controller seems to save the query retrieved as a Local Store and sync it to the DatadogMetric resource in the Reconcile Loop rather than querying Datadog each time.

Architecture

Next, I will explain the architecture of using the Datadog metrics server and HPA to achieve Scheduled-Scaling.

Fetch data from our database and save it as a ConfigMap

See the area around schedules_retrive_timed_examinations at the bottom right (check the diagram above). @bdesmero created this part. schedules_retrive_timed_examinations gets the starting time of the exam and the corresponding number of students from our database and saves it as a TSV file. The TSV file looks like this:

We divided the work between @bdesmero (as the web developer) and me (as the SRE). The dependency on Jenkins and the use of ConfigMap is a drawback that increases the number of points of failure. Still, I think it was a reasonable choice for the shortest possible time and cooperation.

Export the read data from TSV in Prometheus format

Next, let’s take a look at the timed-exam-schedule-exporter component on the bottom left. It is written in Go and runs as a Kubernetes Deployment.

This component does the following:

Mount the ConfigMap
Read the file in an infinite loop
Compare with the current time
Export the corresponding number of users in Prometheus format

The key point is to export the value 15 minutes after the current time because we want pods/nodes to start scaling out 15 minutes before users access it, given the time it takes to scale.

Let’s take a look at the code(it’s not that long, around 180 lines).

package main

import (
    "encoding/csv"
    "errors"
    "fmt"
    "io"
    "log"
    "net/http"
    "os"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    //nolint:gochecknoglobals
    desiredReplicas = prometheus.NewGauge(prometheus.GaugeOpts{
        Namespace: "timed_exam",
        Subsystem: "scheduled_scaling",
        Name:      "desired_replicas",
        Help:      "Number of desired replicas for timed exam",
    })
)

func main() {
    const interval = 10

    prometheus.MustRegister(desiredReplicas)

    http.Handle("/metrics", promhttp.Handler())

    go func() {
        ticker := time.NewTicker(interval * time.Second)

        // register metrics as background
        for range ticker.C {
            err := snapshot()
            if err != nil {
                log.Fatal(err)
            }
        }
    }()
    log.Fatal(http.ListenAndServe(":8080", nil))
}

func snapshot() error {
    const timeDifferencesToJapan = +9 * 60 * 60

    tz := time.FixedZone("JST", timeDifferencesToJapan)
    t := time.Now().In(tz)
    today := t.Format("2006-01-02")
    // Configmap is mounted
    filename := "/etc/config/" + today + ".tsv"

    file, err := os.Open(filename)
    if err != nil {
        return fmt.Errorf("failed to open file: %w", err)
    }
    defer file.Close()

    currentUsers, err := getCurrentUsers(t, tz, file)
    if err != nil {
        return fmt.Errorf("failed to get the current number of users: %w", err)
    }

    desiredReplicas.Set(currentUsers)
    return nil
}

func getCurrentUsers(now time.Time, tz *time.Location, file io.Reader) (float64, error) {
    const metricTimeDifference = +15

    // read input file
    reader := csv.NewReader(file)
    reader.Comma = '\t'

    // line[0] is time.  i.e. "13:00"
    // line[1] is users. i.e. "350"
    var previousNumberOfUsers float64 // A variable for storing the value of the previous loop
    var index int64

    for {
        index++
        parsedTSVLine, err := parseLine(reader, now, tz)
        if errors.Is(err, io.EOF) {
            return previousNumberOfUsers, nil
        }
        if err != nil {
            return 0, fmt.Errorf("failed to parse a line (line: %d): %w", index, err)
        }

        // Example:
        // line[0] line[1]
        // 17:00   4
        // 17:15   10
        //
        // Loop compares the current time with the time on line[0],
        // and if the current time is later than the current time,
        // the previous line[1] is used as gauge.
        //
        // To prepare the pods and nodes "metricTimeDifference" minutes in advance,
        // expose the value "metricTimeDifference" minutes ahead of the current value.
        // In the above example, it will expose 10 at 17:00.
        if parsedTSVLine.time.After(now.Add(metricTimeDifference * time.Minute)) {
            // If the time of the first line is earlier than the time of the first line,
            // expose the value of the first line.
            if previousNumberOfUsers == 0 {
                return parsedTSVLine.numberOfUsers, nil
            } else {
                return previousNumberOfUsers, nil
            }
        }
        previousNumberOfUsers = parsedTSVLine.numberOfUsers
    }
}

type tsvLine struct {
    time          time.Time
    numberOfUsers float64
}

func parseLine(reader *csv.Reader, now time.Time, tz *time.Location) (tsvLine, error) {
    line, err := readLineOfTSV(reader)
    if err != nil {
        return tsvLine{}, fmt.Errorf("failed to read a line from TSV: %w", err)
    }

    parsedTime, err := parseTime(line[0], now, tz)
    if err != nil {
        return tsvLine{}, fmt.Errorf("failed to parse time from string to time: %s: %w", line[1], err)
    }

    parsedNumberOfUsers, err := strconv.ParseFloat(line[1], 64)
    if err != nil {
        return tsvLine{}, fmt.Errorf("the TSV file is invalid. The value of second column must be float: %s: %w", line[1], err)
    }

    return tsvLine{
        time:          parsedTime,
        numberOfUsers: parsedNumberOfUsers,
    }, nil
}

func parseTime(inputTime string, t time.Time, tz *time.Location) (time.Time, error) {
    const layout = "15:04"
    // parse "13:00" -> 2020-11-05 13:00:00 +0900 JST
    startTime, err := time.ParseInLocation(layout, inputTime, tz)

    if err != nil {
        return time.Time{}, fmt.Errorf("failed to parse a time string %s (layout: %s): %w", inputTime, layout, err)
    }

    parsedTime := time.Date(
        t.Year(), t.Month(), t.Day(),
        startTime.Hour(), startTime.Minute(), 0, 0, tz)

    return parsedTime, nil
}

func readLineOfTSV(reader *csv.Reader) ([]string, error) {
    const columnNum = 2

    line, err := reader.Read()
    if errors.Is(err, io.EOF) {
        return line, fmt.Errorf("end of file: %w", err)
    }
    if err != nil {
        return line, fmt.Errorf("loading error: %w", err)
    }

    // Check if the input tsv file is valid
    if len(line) != columnNum {
        return line, fmt.Errorf("the input tsv column is invalid. expected: %d actual: %d", columnNum, len(line))
    }

    return line, nil
}

The main() and the snapshot() functions are the essential parts to this design pattern.

In main(), we do some background processing using a ticker and listen on HTTP port 8080.

In snapshot(), we read the file, get the values we need, and set them as gauge metrics in desiredReplicas.Set(currentUsers).

The rest of the code is to read and parse lines. Basically, in the Prometheus Go client library, the timestamp is set to the current time. In Datadog, timestamps cannot be set more than 10 minutes in the future or more than 1 hour in the past, so we export the value after 15 minutes to the current time.

Here is an example of getting the exported metrics.

# in another window
# kubectl port-forward timed-exam-schedule-exporter-775fcc7c5b-qg6q6 8080:8080 -n timed-exam-schedule-exporter
$ curl -s localhost:8080/metrics | grep timed_exam_scheduled_scali
ng_desired_replicas
# HELP timed_exam_scheduled_scaling_desired_replicas Number of desired replicas for timed exam
# TYPE timed_exam_scheduled_scaling_desired_replicas gauge
TYPE timed_exam_scheduled_scaling_desired_replicas gauge

Get datadog-agent to scrape the exported metrics.

We use Datadog Kubernetes Integration Autodiscovery, which looks at the Pod’s annotation and fetches the metrics for us.

Here is the deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: timed-exam-schedule-exporter
  namespace: timed-exam-schedule-exporter
  labels:
    name: timed-exam-schedule-exporter
spec:
  replicas: 3
  selector:
    matchLabels:
      app: timed-exam-schedule-exporter
  template:
    metadata:
      labels:
        app: timed-exam-schedule-exporter
      annotations:
        ad.datadoghq.com/timed-exam-schedule-exporter.check_names: |
          ["prometheus"]
        ad.datadoghq.com/timed-exam-schedule-exporter.init_configs: |
          [{}]
        ad.datadoghq.com/timed-exam-schedule-exporter.instances: |
          [
            {
              "prometheus_url": "http://%%host%%:8080/metrics",
              "namespace": "timed_exam",
              "metrics": ["*"]
            }
          ]
    spec:
      containers:
      - image: <aws-account-id>.dkr.ecr.<region-name>.amazonaws.com/timed-exam-schedule-exporter:<commit hash>
        name: timed-exam-schedule-exporter
        ports:
        - name: http
          containerPort: 8080
        livenessProbe:
          initialDelaySeconds: 1
          httpGet:
            path: /metrics
            port: 8080
        resources:
          limits:
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
      volumes:
      - configMap:
          defaultMode: 420
          name: api-exam-data
        name: config-volume  

Use Datadog query to scale in HPA

Finally, take a look at the upper left part of the diagram. It’s probably easier to understand if you look at the manifest.

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 40
  maxReplicas: 1000
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60 # want 570 mcore of cpu usage. 570 / 950(requests) = 0.6
  - type: External
    external:
      metric:
        name: datadogmetric@production:timed-exam
      target:
        type: AverageValue
        averageValue: 1

The “type: External” part is the new part that we added to our existing HPA. Note that HPA allows us to specify multiple metrics, and it uses the higher value once it has been calculated. Thanks to this mechanism, it is possible to achieve a combination of scaling by different metrics at specific times while we do CPU scaling.

Here is the DatadogMetric referred.

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
  name: timed-exam
spec:
  # throughput: 10 = 500 / 5000. 500 pods accept 5000 users.
  # ref: https://github.com/quipper/xxxxxxx/issues/xxxxx
  query: ceil(max:timed_exam.timed_exam_scheduled_scaling_desired_replicas{environment:production}/10)

Datadog custom metrics timed_exam.timed_exam_scheduled_scaling_expected_users represents the number of users written in the TSV file. Datadog query calculates how many pods are required per user.

By using the Datadog query, I was able to reduce the amount of code to write.

How to apply

After having confirmed the operation in Staging, I applied the following steps in Production:

Deploy ConfigMap and timed-exam-schedule-exporter to Production, and send metrics to Datadog. Apply Datadog Metrics and test HPA to confirm if the HPA works as expected. Update the HPA with the production application. The minReplicas should be large at this time. Gradually lower the value of minReplicas while we observe the situation.

Since this is a configuration change related to production scaling, and there are many integrated parts, I had to apply it carefully.

Note that even if you only apply DatadogMetric, HPA Controller does not retrieve the metric unless HPA references the DatadogMetric. That’s because the cluster-agent executes a DatadogMetric Query and updates the status only when HPA Controller retrieves the metric. Therefore, we used a dummy application and HPA for the verification at step 2.

Once I knew that the Datadog custom metric and HPA settings were all in place, to test the setup, I set minReplicas to a high value; then, I gradually decreased the replicas while keeping an eye on the actual TSV file to make sure the number of replicas changed based on the data in the TSV file. I was able to confirm the replicas scaled out properly.

FAQ

What happens if the TSV file is invalid?

The timed-exam-schedule-exporter exposes 0 value. In which case, it scales by the CPU.

What happens if the communication with Datadog fails?

Datadog cluster-agent sets Invalid status to DatadogMetric Custom Resource, and the result of the HPA external metric calculation shows unknown. In this case, It is scaled by CPU.

What happens if the timed-exam-schedule-exporter goes down?

The metrics are not exposed to Datadog so the metric will result in No Data in Datadog. The metrics will be scaled by CPU as above.

In both cases, thanks to HPA’s behavior regarding multiple metrics, CPU scaling kicks in even if something goes wrong with the external metrics.

Result

The number of pods and nodes we have got so far:

The number of pods scale out up to 400 uniformly from 6:30 am to 7:30 pm.

The number of Nodes also increases in proportion to the number of Pods.

And this is the number of Pods and custom metrics one week after we started using Scheduled-Scaling :

The yellow line is the metric registered with DatadogMetric Custom Resource, and the purple line is the HPA Desired Replicas.

How amazing it is! When there’s no traffic spike, the scaling is executed by CPU. On the other hand, when many users are expected to use the platform, scaling is executed by the External Metrics.

The number of Nodes is also lower than before. The decreasing of the area graph indicates that we reduced costs. The daily usage cost has gone down to $145 from $250, and estimated cost reduction is about $3150 monthly.

The purple line is the number of Nodes before we started using Scheduled-Scaling. The blue line is the number of Nodes after we started using Scheduled-Scaling.

We achieved flexible scaling based on the domain data of the number of users in scheduled exam. Furthermore we were able to eliminate human intervention, and reduce the redundant infrastructure cost. That’s great!

Conclusion

In this article, I explained how to send the number of users in Datadog custom metrics and then the way to scale them as External Metrics with HPA. The multiple metrics of the HPA enabled us to achieve Scheduled-Scaling safely while using it together with the CPU-based scaling. As a result, we were able to ensure both of resource efficiency and reliability.

This case study has led us to adopt Datadog metrics as an external metric for HPA. We have confirmed cases where CPU scaling does not work well for some services using a messaging/queue system like Google Cloud Pub/Sub. I think that auto-scaling by queue length metric in Datadog might help scale those services properly.

Besides, I think this is a great example of problem solving through communication among different teams with different roles and responsibilities, including SRE, Web Developer, and Business Developer. We SREs may know how to use Kubernetes HPA and Datadog, but we don’t know the details of the database and application features, such as the domain knowledge of the service. I think this is an excellent example of a problem solving by close communication. We were able to share the problems and face them together. That led us to the success!

Quipper is looking for people who want to Bring the Best Education to Every Corner of the World. SRE Team is also hiring.

Modifying a third-pary library on a bytecode level

Fri, 21 Aug 2020 10:00:00 +0000

As developers in an EdTech company it is important for us to be updated with the latest tech trends especially if it involves one of the vital libraries our application can’t launch without, the Android Support Library. This is important because it enables us to reach a wide range of user base by supporting a lot of android platforms as possible.

But since Android Support Library is on its end of life, we recently migrated to androidx and updated a lot of dependencies. We survived a huge code change but not without some hiccups. On the UI side of things we need to update material-component libraries from version alpha01 to alpha05 and eventually alpha06 to fix an issue with our login. But this came with a price because on our happy path testing, a side effect appeared. Not on our code base but to a third-party library that we’re using.

We are using a third party sdk on our app for messaging communication between student and a teacher so this feature is critical for our app. For now let’s call it “ChatSDK”. Upgrading the material component can really impact a third party library that uses UI components.

Android ChatSDK crashes when ChatSDKBaseActivity call applyOverrideConfiguration with an

IllegalStateException(" Override configuration has already been set ")

This is because getResources() or getAssets() has already been called

In this case getResources is called before the applyOverrideConfiguration in the ContextLocaliser class

So after we find out the root cause, we have speculated that maybe the recent library upgrades affects the third-party library’s crash. We’ve tried downgrading the material libraries to make it work but it’s not a good practice since a lot of UI stuff from our code base will be affected just to fix this crash so we investigated further. Good thing there’s a dedicated forum for ChatSDK users and we found out that we’re not alone. Surprisingly, others have been having this issue since 5 months ago. Around 3 months after the reported issue, one of the company representatives informed in a comment that their engineers are working for a fix but there is no ETA and possible damage assessment.

Other users reported that downgrading the material support library fixes it for them, but for us it isn’t an option because appcompat is a dependency for many other (Jetpack) libraries. We are in the middle of a sprint and this feature is critical to our paying users so we have to look for a workaround. Our manager suggested that if we can reverse engineer the aar library and delete the offending line, it might work for us temporarily until ChatSDK releases a fix.

In our team, we practice pair programming so I’m working with another developer to figure out the feasibility of the workaround.

First we need a java decompiler and since we are working on different machines we went on trying decompilers on our own system.

Step 1 - Extract java classes from the package

Step 2 - Modify Java Bytecode

Step 3 - Verify and repackage it again

Step 4 - Import the repackaged aar together with other dependencies

Then on our project we will import it manually together with other dependencies instead of using gradle for this work around.

We tried a couple of decompilers but we can’t modify up to the bytecode level and repackage it again.

We found a library called Recaf which is a modern java bytecode editor

But first we have to copy rename the base library and change the file extension from .aar to jar in order to extract the classes folder before we can import it on the bytecode editor

We find it easier to use because it has a GUI. Just execute the downloaded jar file and a window will open. Drag the classes.jar file and browse the specific class file that you want to edit. Then right click on the class name on the right window and change the class mode to table.

When on the table view, go to the methods tab and remove the target line.

Go back to the decompiler view again and see that the method call is now removed!

Now we can export the classes.jar with the modified target class

Now we can repackage the entire library and include it to our project manually.

Voila! The workaround works like a charm!

However, there is a side effect on this workaround as it now only works on a single language which is English by default. But it’s still better than an app crash though for the meantime until the ChatSDK developers release a fixed version.

Apparently the next day, a new version of the sdk is released with the fix and we can just use it right away. It’s kinda mixed feelings for us because just a day before we are going down to the bytecode level of a library to find a workaround but at least we learn a lot from it and it may come in handy whenever we encounter similar problems in the future so I think it’s still worth it that we tried.

It’s also a good thing because we don’t have to release the workaround into production and just in time before our sprint ends.

The Clean Way to Handle Sendbird Webhook Using Ruby on Rails

Wed, 22 Apr 2020 00:00:00 +0000

The Clean Way to Handle Sendbird Webhook Using Ruby on Rails

Hi, as I said before in my first blog I want to share about design patterns for Ruby. So I will share substantial reasons for the existence of design patterns and how a design pattern solves your common problems in Ruby. And to make it clear and understandable, I will explain it using a good example: Sendbird Webhook. Before start, you can read the documentation here.

And for you who doesn’t know about design pattern, a design pattern is a general, typical solution to common problems in Software Engineering. So I think it’s a must for a developer to know at least one design pattern, especially Ruby developer. Why? Because Ruby is flexible, so we need something that can keep our codebases clean and understandable for every developer.

In this blog, I will explain my favorite design pattern, which is the Command Pattern. So are you ready? Let’s start then.

Like the others, Sendbird only needs one endpoint to handle all kinds of events. That part is very interesting because the command pattern can solve that problem. So firstly create a controller for it.

**app 
 |_controllers
    |_sendbird_controller.rb**

And create a new action on it: webhook. Don’t forget to add it to routes.rb.

class SendbirdController
 def webhook
  status: 200
 end
end

Since Sendbird doesn’t care about our process, just respond with status: 200 *immediately. And create a worker to handle the payload from Sendbird. Why using the worker? First, because Sendbird only sends the request 3 times until it receives *status: 200. And our workers can save the payload and retry the process as many as we want if we got a problem until the problem is gone. Second, because we need to respond immediately to avoid too many requests to our server. Third, hmm I think that’s it.

app 
 |_controllers
    |_sendbird_controller.rb
** |_workers
    |_sendbird
       |_webhook_worker.rb**

And put the worker on sendbird_controller.

class SendbirdController
 def webhook
  ::Sendbird::WebhookWorker.perform_later(params)
  status: 200
 end
end

Before we start coding the worker, let’s see Sendbird request params:

{
    'category': 'open_channel:create',
    'created_at': 1540866408000,
    'operators': [
        {
            'user_id': 'Jay',
            'nickname': 'Mighty',
            'profile_url': '[https://sendbird.com/main/img/profiles/profile_26_512px.png'](https://sendbird.com/main/img/profiles/profile_26_512px.png'),
            'metadata': {}
        }
    ],
    'channel': {
        'name': 'Jeff and friends',
        'channel_url': 'sendbird_open_channel_1_2681099203cd6b78414fe672927a43fcf3a30f09',
        'custom_type': '',
        'is_distinct': false,
        'is_public': false,
        'is_super': false,
        'is_ephemeral': false,
        'is_discoverable': false,
        'data': ''
    },
    'app_id': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
}

The params above represent the command. And there is category *that represents the event of the command and can be a key for command pattern. We can see two parts on *category *value: *open_channel which is the resource and *create *which is the event of the resource. If we’re using the traditional way, the worker code will be like this:

module Sendbird
 class WebhookWorker
  def perform(params)
   if params['category'] == 'open_channel:create'
    # do something
   elsif params['category] == 'open_channel:update'
    # do something
   ...
   end
  end
 end
end

module Sendbird
 class WebhookWorker
  def perform(params)
   case params['category']
   when 'open_channel:create'
    # do something
   when 'open_channel:update'
    # do something
   ...
   end
  end
 end
end

So what will happen next if we want to implement all kinds of events? Can you imagine that? LOL

So here the clean way to solve that problem:

module Sendbird
 class WebhookWorker
  attr_reader :params, :klass
  
  def self.perform(params)
   new(params).perform
  end

  def initialize(params)
   module_name, klass_name = params['category'].split(':')
   @params = params
   @klass = "::Sendbird::Webhook::#{module_name.camelize}::#{klass.camelize}".constantize
  end

  def perform
   klass.new(params).perform
  end
 end
end

To put the logic for each resource and event, we only need to create a new service. For example, open_channel:create. Create a new service here:

app 
 |_controllers
    |_sendbird_controller.rb
 **|_services
    |_sendbird
       |_webhook
          |_open_channel
             |_create.rb**
 |_workers
    |_sendbird
       |_webhook_worker.rb

With this code:

module Sendbird
 module Webhook
  module OpenChannel
   class Create
    attr_reader params
    
    def initialize(params)
     @params = params
    end

    def perform
     # do something when create open_channel event happens
    end
   end
  end
 end
end

If we want to handle a new event, simply create a new service. For example, now we want to handle *group_channel:update. *Just create a new service:

app 
 |_controllers
    |_sendbird_controller.rb
 |_services
    |_sendbird
       |_webhook
          **|_group_channel
             |_update.rb**
          |_open_channel
             |_create.rb
 |_workers
    |_sendbird
       |_webhook_worker.rb

With this code:

module Sendbird
 module Webhook
  module GroupChannel
   class Update
    attr_reader params
    
    def initialize(params)
     @params = params
    end

    def perform
     # do something when update group_channel event happens
    end
   end
  end
 end
end

Simple right? With this method, we can follow *rubocop *rules to avoid long class or method line length and make the file readable. But you will have many files, which is that’s okay for me.

I think that’s all. Thank you!

This post originally shared at Medium

Styled System in Practice

Mon, 06 Apr 2020 08:38:00 +0000

In September of last year, I was assigned to a task force within the Quipper product team. We were formed to deploy a new app to market in roughly three months. Given the tight timeline, agility was a top priority, so every engineering decision had to be carefully considered.

I took it upon myself to prepare a styling framework for the React app we would be building. I was curious to explore a new CSS-in-JSS styling methodology I discovered, called Styled System. The project had over 5,000 stars on GitHub and apparently GitHub themselves used it to build their own design system.

The CSS-in-JS movement was alive and well by this time, but it was something I was lukewarm to because I hadn’t ever really used it at scale. Outside the JavaScript world, I’ve settled on writing my CSS the Atomic way because it’s served me very reliably through all my previous projects. I wanted a React-y way to do something similar.

<h1 class="text-lg font-bold text-center">
  I'm being styled with atomic CSS!
</h1>

An example of Atomic CSS (done with Tailwind CSS)

If you’re not familiar with, or even a fan of, Atomic CSS, I’d encourage that you read this blog post by Adam Wathan—host of the excellent Full Stack Radio podcast—because it chronicles our journey as an industry towards Atomic CSS and the rationale behind it. (I find that it closely parallels my own journey with CSS.) Styled System follows those same ideologies, so naturally, I had to build out the entire styling framework of our app with it. (Thanks for letting me run wild, team!)

A quick primer on Styled System

Styled System is a props-based styling methodology, meaning you style components by passing in styles as props (called style props):

<Text color="body" fontSize="2">
  Hello, Styled System!
</Text>

It looks a little like Atomic CSS! Awesome! (Or like inline CSS, but those style rules are applied to your component via auto-generated classes, so they don’t actually create the same issues with specificity.) But, one key difference is, being just plain CSS, Styled System doesn’t require that you memorize different utility class names to apply the styles you want. You use plain old regular (albeit camelCased) CSS.

Take note though that the values being passed in aren’t your typical CSS values. "body" is not a valid CSS color name and "2" not a valid value without a corresponding unit. These are actually theme values taken from a global theme object defined at the top level of your application:

const theme = {
  colors: {
    body: "#1e3f6b",
  },
  fontSizes: [12, 14, 16, 20],
};

color="body" points to theme.colors.body while fontSize="3" is theme.fontSizes[3]

You can use this to constrain styles within a particular set of rules, like say a brand style guide or a design system. This way, your components can be made to follow the specifications handed to you by your designers (and they don’t have to scold you for being 1 pixel off, again).

Though to me, the main advantage to styling components this way is how it enables rapid development. Previously, we’d have to write our markup, then open a separate file to manage all our styles, which can become a tiresome exercise in context switching. The worst part of that system though—and it may seem trivial, but really it isn’t—is having to come up with appropriate class names each time.

Sure, it’s easy if we’re talking about naming the primary button on your site, but how about when we’re trying to target a specific button in a specific context within a specific page?

There are only two hard things in Computer Science: cache invalidation and naming things. —Phil Karlton

But even if Styled System allows us to get away with those things, you’re likely not entirely convinced at this point. The number one thing on your mind right now might be:

Still, why would I want all my styles in my HTML? What is this madness?!

—which is a fair point and one the other engineers on the team weren’t shy of letting me know. But remember that because all this is happening in JavaScript, it can be easy to abstract away common patterns. If say, a heading used across the site needs a particular set of styles, it wouldn’t be ideal to have to write them over and over! You can actually create a component with all those base style rules passed in by default:

// Heading.js

const Heading = ({ children, ...props }) => <Text {...props}>{children}</Text>;

Heading.defaultProps = {
  color: "body",
  fontSize: "3",
  fontWeight: "bold",
};

The idea then is that instances of <Heading> will only need to be given context-specific styles like margin or textAlign. This way, the styles for headings appearing within larger contexts will only be minimal and all the complex styling can remain in the underlying components.

// ArticleBlock.js
import Card from "./Card";
import Heading from "./Heading";

const ArticleBlock = () => (
  <Card>
    <Heading mb="3">Is Styled System the future?</Heading>
    {/* ... */}
  </Card>
);

Styled System also supports property shorthands like mb, short for marginBottom

You can also opt to solve this problem using the Styled System variants API. Either method works, but my philosophy has been to use variants for rules specific to the design and components for those specific to the app.

It wasn’t all perfect

All that being said though, style props still became an issue for our team because, even if we were able to limit context-specific styles to no more than 3 lines of props, some components would still require many more of their own props aside from that. This became an ugly mess for components that required a large mix of style and logic props:

<Input
  flex="1"
  mt="2"
  ml="3"
  type="number"
  placeholder="--"
  value={score}
  required={hasCorrespondingCriteria}
  disabled={!hasCorrespondingCriteria}
  onChange={handleFormChange}
/>

An unfortunate example from our codebase

This was our biggest gripe with Styled System because it was difficult having to deal with styling and logic on the same level. When working with Atomic CSS, all the styles are at least confined under a single className prop, so the problem isn’t as pronounced there.

To address this issue, we thought at first about defining all the styles in separate objects at the top of each file, then spreading them onto each component, like so:

const scoreInputStyles = {
  flex: 1,
  mt: 2,
  ml: 3,
};

/**
 * Somewhere further down the file
 * ...
 * ...
 */

<Input
  {...scoreInputStyles}
  type="number"
  placeholder="--"
  value={score}
  required={hasCorrespondingCriteria}
  disabled={!hasCorrespondingCriteria}
  onChange={handleFormChange}
/>;

But that would eliminate the advantages we talked about earlier! We’re having to move up and down the same file just to define styles. But most of all, who wants to go back to naming things again?!

I then realized that we could just skip the initial declaration by defining the object inline, then spread it directly onto our components like so:

<Input
  {...{ flex: 1, mt: 2, ml: 3 }}
  type="number"
  placeholder="--"
  value={score}
  required={hasCorrespondingCriteria}
  disabled={!hasCorrespondingCriteria}
  onChange={handleFormChange}
/>

Using object notation for your style props

Great! Now the style props can appear visually distinct from the rest of the props. This will make it much easier to parse through component files when wanting to focus on programming just business logic.

Our bigger issue

That wasn’t the end of it though. We also had problems with style props not always working when applied to certain components. Ironically, this was something that occurred by design because Styled System actually recommends designing your base components to limit the style props they will accept.

const Text = styled.span(
  ({ theme }) => css`
    color: ${theme.colors.text};
    font-size: ${theme.fontSizes.body}px;
    font-family: ${theme.fonts.main};
    line-height: ${theme.lineHeights.main};
  `,
  color,
  space,
  typography
);

The initial <Text> component declaration in our app

The arguments passed in at the end (color, space, and typography) are what are called style prop functions. They dictate the style props that your components will respond to. Each “allows the passage” of their own group of CSS properties. Something like border="5px solid black" therefore, won’t work when applied to our <Text> component because that would require the border style prop function. But we can apply color, padding, margin, and type styles like fontWeight and others.

The intent is to prevent components from deviating from their intended design—which is a reasonable argument—but it slowed our team down more than anything! Styles sometimes didn’t just work. And this happened often enough that after about the nth time or so, I realized that the whole thing is more trouble than it’s worth. We wouldn’t be applying these styles if they didn’t need to be there one way or another!

To get around this problem, the documentation suggests two possible solutions—neither without their quirks. The first is to extend your components via the styled function of the styled-components library, then apply any additional styling through there, but this created the same issues as with defining objects like we did earlier.

import styled from "styled-components";

const CustomButton = styled(Button)`
  background-color: transparent;
  float: right;
`;

/**
 * Somewhere further down the file
 * ...
 * ...
 */

<CustomButton>Download</CustomButton>;

Scroll, scroll, scroll, scroll

Alternatively, styled-components also provides a css prop that will allow you to inline styles on any CSS property of your choosing, but it creates a messy API for our components because it leaves half your styles inside the css prop and half outside. How can we tell when to use which? Talk about confusing!

<Text
  {...{ textAlign: "center", fontSize: 2 }}
  css={{ flexGrow: 1, justifySelf: "flex-end" }}
/>

The bigger issue here though is that theme values no longer work inside the css prop, which basically brings us down to the level of writing inline styles—yikes! Fortunately, Styled System has an external css function helper package, which addresses just that issue. It opens us up to the core functionality of Styled System without the arbitrary constraints.

Now, we can have the benefit of applying styles to any property (through the css prop) with the ability to use theme values at the same time (via the css function)!

Getting there…

import css from "@styled-system/css";

<Text css={css({ color: "body" })}>{"I'm color #1e3f6b!"}</Text>;

css prop + css function = ✨

From our experience, the best way to go is to pair *the styled-components css *prop with with the styled-system css *function* and just leave style props by the wayside. Not only do we have themed CSS by styling our components this way, but—going back to our first issue with style props—because everything is confined to a single prop, styling and logic can still also remain separate.

The syntax feels a bit redundant right now, but we can fix that by abstracting the css function inside of our component declarations. Therefore, instead of defining your components the way we did earlier, write them like this instead:

const Text = ({ css: contextStyles, children, ...props }) => (
  <span
    css={css({
      color: "text",
      fontSize: "body",
      fontFamily: "main",
      lineHeight: "main",
      ...contextStyles,
    })}
    {...props}
  >
    {children}
  </span>
);

First, pass in the default styles, then layer any of the provided styles on top through the css function

The Holy Grail?

Did you catch all that? Now, we can write our styles like this:

<Text css={{ color: "body" }}>{"I'm color #1e3f6b!"}</Text>

At this point, we might not even need the main styled-system package and could get away with just @styled-system/css. We’d still need several of the utilities from styled-components (like the css prop), but consider it a win to be able to drop the main dependency altogether and rely on just Styled System’s core functionality! (And if you’re a bit more advanced and are wondering, yes, this does still allow us to use Styled System’s array props for responsive styles.)

Unfortunately for our project, I only figured all this out after we had shipped, but if we were to go through it all again, I would have done it this way 100%. This set-up, while still preserving the core of Styled System, would also have saved us our biggest gripes with it. No mixing of style and logic props. No more arbitrary style prop constraints.

Just simple, isolated, and reliable styling.

Life as a Vim User at Quipper

Tue, 19 Nov 2019 00:00:00 +0000

Life as a Vim user is not an easy life, but it’s also not though either. It’s an exciting life to be a Vim user. Vim itself is a unique text editor. Naturally, Vim is a terminal-based text editor with little to no graphical user interface. Vim uses keyboard as the main User Interface, it’s quite different from typical modern text editors. There are so many things to learn about Vim.

About Me

I’m a Software Engineer. I write some codes mostly using Ruby, Go and recently trying to code Typescript. I consider myself a casual Vim user. I’m by any means no expert in Vim. Experience wise I’m new, I’ve been coding vim only for the last 2 years.

Even though I’m using Vim for 2 years, I barely know Vim especially it’s native keystrokes. It’s only this past 3-4 months I’ve been starting to learn Vim native keystrokes by using vanilla Vim. I’ve learned a lot of keystrokes, and it’s not easy. Vim :help is very handy and it’s helping me so much.

Vim at Quipper

Fortunately, Quipper has a solid Vim community. There is more than 20 percent of Quipper software engineers use Vim at the moment. There are several regular agenda we used to do. We have quipper.vim which is a sharing session, sometimes we do pair programming in Vim, and the most important thing is that it’s a nice and welcoming community! Together, we help each other to do the best we can do at Vim.

For me, sometimes, I feel like :help is not enough. For example when I want to know people’s daily vim usage or maybe which plugins to use together to achieve better flow. I’m very glad that I’m joining Vim community in Quipper. There is a regular Vim discussion and sharing session called quipper.vim. The main agenda is to share one’s daily vim operation. I got tons of new insight and plugins recommendations. It’s a great agenda!

The other thing is about the people. Quipper Vim community is a nice and welcoming community. People are open to discussion, a new idea, even simple questions. I never felt intimidated by asking noobish questions there. Even, when I first joined Quipper Vim community, I was encouraged to be more active towards Vim community and attend VimConf even I was a new joiner at Quipper Vim community. It’s an amazing community.

The last but not least is about chance to contribute back to Vim community and learn from it. Quipper has been sponsoring VimConf regularly since VimConf 2018. Not only Quipper had sponsored VimConf, but Quipper also sponsored me to attend and speak at VimConf! I’ve always wanted to speak at a conference to contribute and share what I’ve learnt. Thanks to Quipper, I did my first conference talk and also learned a lot. Quipper supported all my accommodation from Jakarta to Tokyo as Quipper international conference package. It’s a nice thing to know that the company that I work for is caring about my growth!

I’m very grateful that I’m joining Quipper and being a member in Quipper Vim community. It’s a blessing for me. I can learn many things and meet so many nice people!

SRE Operation Trails

Tue, 21 May 2019 00:00:00 +0000

Intro

Hello! This is @rbmrclo from Site Reliability Engineering team. Today, let me share about “Operation Trails” (a term we use in our team) which is an important part of our workflow when performing tasks that involve manual operation.

Background

In the SRE team, we have a 50/50 rule for how we manage our time every day.

To summarise, half of our day usually goes to proactive tasks which are generally the main projects that contribute to our growth as a diverse tech team (we usually have a roadmap for this). The rest of our time is spent on reactive tasks which are essential to maintain the stability and reliability of our services, as well as to keep the development speed stable across each team.

It can be visualised in blocks like this:

In this article, I will be focusing on our reactive tasks and explain in detail how we manage to work seamlessly within our team and avoid mottainai (I’ll be explaining this later).

Daily Situation

As a global company, each SRE member attends to the needs of multiple teams in different timezones. This also means that each member is working at their own pace.

Some members might be working on a normal routine today with their proactive tasks; some will be performing a maintenance task tonight (midnight!); and some might already be attending to a service outage incident while I’m writing this blog post!

Let’s illustrate that again with my favorite blocks.

My point here is that most of the time, each of us is working in an isolated manner. However, there’s one exception and this is when Operation Trails comes in.

Operation Trails for Reactive Tasks

Imagine that you are working on a task, with your headphones on, enjoying your favorite bubble milk tea, listening to the playlist of Queen, in-the-zone and cannot be disturbed by humans.

Suddenly, an alert has been triggered for a specific monitor. Say the staging cluster died, hence, no developers could connect to the staging servers to test their newly implemented features - a major blocker!

Call of duty. Upon receiving the alert message, you quickly checked the issue and created an Operation Trail.

First, you informed the other SRE team members that you are now checking the issue.
- You are now considered as the assigned person. (ownership is part of our culture!)
- This is also when the operation trail starts.
- All SRE members are now informed that someone is checking the issue. They are also watching the operation trail in parallel.
Next, continuously post updates of what you’re currently doing. (who did what when - like audit trails!)
- While posting updates, other SRE team members could either give suggestions, join the ongoing operation, or just watch the trail. (it all depends on the severity of the situation)
Lastly, you inform everyone when the task is finished or when the issue has been resolved. :tada:

Here’s the bird’s eye view of what happened.

:memo: Every operation is in a single thread

:bell: Live reporting

:white_check_mark: Avoid operation conflicts by using call-to-actions

Summary

Slack Threads

In simple terms, operation trails are chat-based and happen real-time. We fully utilize slack threads for these.
An SRE member can start an operation trail and resolve it by himself/herself, or another SRE member can join the trail to speed up resolving the task at hand.

Avoid Mottainai (もったいない)

The term in Japanese conveys a sense of regret over waste; the exclamation “Mottainai!” can translate as “What a waste!”

By establishing a live reporting culture in your team, you can eliminate waste of time.
- For example, when an SRE member initiates that he/she is already responding to the issue, the other SRE members can just watch the trail while working on their current tasks normally. They don’t need to pause as well, maximizing the use of their time.
By actively posting updates in the operation trail, other members can provide relevant suggestions or possible solutions in order to speed up the operation.

Being a team-player

Operation Trails improve the communication skill of an individual by being able to explain what’s happening and what they are doing.
As spectator of the trail, you can determine if the operation is going smoothly or a call for help is needed - evolving into a “pair operation”.
It also improves harmony in the team since this is one of the times when all of us in SRE team can meet and collaborate with each other, given that we have individual tasks too.

Acknowledgements

There’s also a blog post in japanese which is the main inspiration of this post.
Many thanks to all SRE members for supporting and adopting this culture. (especially @lamanotrama who introduced this during his time in Quipper)

Do you also have a similar live reporting culture in your team? Share it in the comments below and let’s discuss! We are hiring SRE members. Check it out!

Quipper Engineering

How We Migrated Millions of Data Without Downtime

Background – Why Do We Need To Migrate, anyway?

Setup The Goals

Research & Preparation

Identify Entities To Migrate

What Kind of Database do We Need?

Migration Strategy

Migration Implementation

“Double Write” Any New Changes – a.k.a Replication

MongoDB Change Streams

Copy Data from Old DB to the New DB

Execute The Migration With Zero-Downtime

Conclusion

Automate handling a number of Pull Requests by Renovate in Terraform Monorepo

Background

Solution

1. Enable automerge

2. Enable platformAutomerge

Notes on GitHub Automerge

3. Set prHourlyLimit to 0

4. Set prConcurrentLimit to 5

5. Limit branchConcurrentLimit too

6. Update the feature branch and enable automerge automatically when the automerge is disabled due to the update of base branch

7. Close the pull request and delete the feature branch immediately when CI fails

8. Skip terraform plan and apply for updates other than Terraform and Terraform Provider

9. Install not only Renovate Approve but also Renovate Approve 2 to prevent approve omissions

10. Adjust prPriority properly

11. Replace GITHUB_TOKEN to GitHub App’s token to prevent API rate limiting

Conclusion

Vision, Mission and Values to make SRE team more sustainable

Vision, Mission, and Values of the SRE Team

Vision: Realize a development organization that can continue to create the best learning products

Mission: Create a platform and culture for self-contained teams to continue to deliver product quickly and safely.

Values

How we defined the Vision, Mission, and Values?

Migrate Terraform CI from AWS CodeBuild to GitHub Actions

Terraform Workflow so far (AWS CodeBuild)

OIDC support for GitHub Actions

Reasons for migrating to GitHub Actions

No more need to sign in to AWS to see CI logs and retry CI

GitHub Actions’ build matrix allows for a more natural dynamic workflow

Leverage of Action ecosystem

Adopt tfaction

Benefit of migration to GitHub Actions and tfaction

Least Privilege

tfaction-specific features

Conclusion

Scheduled-Scaling with Kubernetes HPA External Metrics

Background

Mechanism: HPA External Metrics and Datadog Custom Metrics Server

Architecture

Fetch data from our database and save it as a ConfigMap

Export the read data from TSV in Prometheus format

Get datadog-agent to scrape the exported metrics.

Use Datadog query to scale in HPA

How to apply

FAQ

What happens if the TSV file is invalid?

What happens if the communication with Datadog fails?

What happens if the timed-exam-schedule-exporter goes down?

Result

Conclusion

Modifying a third-pary library on a bytecode level

The Clean Way to Handle Sendbird Webhook Using Ruby on Rails

The Clean Way to Handle Sendbird Webhook Using Ruby on Rails

Styled System in Practice

A quick primer on Styled System

It wasn’t all perfect

Our bigger issue

Getting there…

The Holy Grail?

Life as a Vim User at Quipper

About Me

Vim at Quipper

SRE Operation Trails

Intro

Background

Daily Situation

Operation Trails for Reactive Tasks