Quipper EngineeringPowering the distributors of wisdom
https://devs.quipper.com/
Fri, 03 Feb 2023 07:11:09 +0000Fri, 03 Feb 2023 07:11:09 +0000Jekyll v3.9.3How We Migrated Millions of Data Without Downtime<p>Recently, my team and I managed to migrate millions of our users’ data with no downtime required. In this post, I’m going to share why we did it, how we did it, and what we’ve learned from this.</p>
<h2 id="background--why-do-we-need-to-migrate-anyway">Background – Why Do We Need To Migrate, anyway?</h2>
<p>Initially, our organization relies on a monolithic database system (MongoDB) to manage a wide range of functions, including user identification, authentication, authorization, content-checking, payment, etc. However, as the demands on our system have grown, it has become clear that this single, all-encompassing database hinders our ability to efficiently and effectively manage and expand our services.</p>
<p>To achieve separation of concern and improve efficiency, our organization has decided to undertake a database migration project. However, through this migration, we aim to create dedicated databases for specific functions, such as authentication & authorization, two-factor authentication, and passwordless login systems.</p>
<p>In short:<br />
We need a dedicated database to store users’ account data – for authentication and/or authorization purposes.</p>
<h2 id="setup-the-goals">Setup The Goals</h2>
<p>Before the migration, our architecture looked like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673317805539/533b23a0-c64d-4110-92a9-69ef788cb84c.png" alt="Original architecture before the database separation/migration" /></p>
<p>It has changed to this:
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673317935043/06d00324-3954-4137-b2cd-3ce659ab98d5.png" alt="The Architecture after implementing the migration" /></p>
<p>Please note, in this post, we refer to Main DB as the old database, and the New Account DB as the new database.</p>
<h2 id="research--preparation">Research & Preparation</h2>
<blockquote>
<p>“If I only had an hour to chop down a tree, I would have spent the first 45 minutes sharpening my axe.” – Abraham Lincoln</p>
</blockquote>
<p>This is the most important step. Before we start the migration, we have to spend some time preparing and doing some research to make the migration run as expected with minimum (or even zero) impact on the current live production environment.</p>
<p>In this step, we identify the entities to be migrated, calculate the size, then analyze what kind of database we need for the new Account DB, and finally, we define the migration steps.</p>
<h3 id="identify-entities-to-migrate">Identify Entities To Migrate</h3>
<p>After we analyzed the current implementation of the <code class="language-plaintext highlighter-rouge">auth</code> service, we found out that we have to migrate at least five entities (called <code class="language-plaintext highlighter-rouge">collections</code> in MongoDB, or <code class="language-plaintext highlighter-rouge">tables</code> in SQL databases). Those collections are: <code class="language-plaintext highlighter-rouge">users</code>, <code class="language-plaintext highlighter-rouge">access_tokens</code>, <code class="language-plaintext highlighter-rouge">teams</code>, <code class="language-plaintext highlighter-rouge">authentications</code>, and <code class="language-plaintext highlighter-rouge">clients</code>.</p>
<p>In total, there will be around 32 million documents to be migrated from our <code class="language-plaintext highlighter-rouge">Main DB</code> (old DB) to the new <code class="language-plaintext highlighter-rouge">Account DB</code> – it is about 25 GiB in size.</p>
<h3 id="what-kind-of-database-do-we-need">What Kind of Database do We Need?</h3>
<p>As I have mentioned earlier, our Main DB is running on a MongoDB server, and it’s running very well. Then, what kind of database that we are going to use in this separation/migration project?</p>
<p>This one might be debatable, but after going through several discussions, we go with MongoDB (again? Yes), here are our considerations:</p>
<ul>
<li>Easy to scale – High Availability</li>
<li>We don’t need a strong transactions mechanism here</li>
<li>No requirements for joining many different types of data</li>
<li>MongoDB also provides the <a href="https://www.mongodb.com/docs/manual/core/index-ttl/">Time To Live Index</a> – We will be able to auto-delete expired <code class="language-plaintext highlighter-rouge">access_tokens</code> without adding any additional service like a Job-scheduler or something.</li>
</ul>
<h3 id="migration-strategy">Migration Strategy</h3>
<p><strong>Can we do this migration in one go and that’s all?</strong><br />
The answer is <strong>NO</strong>. Many services currently read/write from and to our migratable collections in the Main DB. So we can’t just migrate those collections to the new DB and switch to read/write from/to the new DB.</p>
<p>I mean, in the simplest terms yes, the step is just migrating all of the data from the old DB to the new one, and then we switch the services to read/write to the new DB. But unfortunately, it is not that straightforward. Since we are working with a monolith database that receives a lot of read and/or write requests from many services. Instead, we have to split the overall migration process into several granular steps or phases.</p>
<p><strong>How to actually migrate those data?</strong>
So, let’s split the migration process into several actionable phases:</p>
<ul>
<li>“Double Write” Any New Changes</li>
<li>Copy data from the Main DB (old DB) to the new Account DB – Do we need downtime?</li>
<li>Make sure the data between Main DB and the new Account DB is always synch</li>
<li>After we’ve confident enough, we can start reading from the new Account DB</li>
<li>Then, finally, we can stop writing related data to the Main DB</li>
</ul>
<h2 id="migration-implementation">Migration Implementation</h2>
<h3 id="double-write-any-new-changes--aka-replication">“Double Write” Any New Changes – a.k.a Replication</h3>
<p>At this point, we need to find a mechanism for synchronizing any new changes in the Main DB (old DB) to the new Account DB, so the changes that happened in Main DB will be mirrored in the new Account DB. How? Here are several solutions that we can follow:</p>
<p><em>It’s important to note that there are other options available, but for the purpose of this post, we will be focusing specifically on the comparison of two options, and explaining why we chose one over the other.</em></p>
<p><strong>Synchronization via API end-points</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673574181465/32bb1758-c092-41a3-bdd3-9db535621673.png" alt="Synchronization via API end-points Architecture" /></p>
<p>So, here is the flow:</p>
<ul>
<li>The <code class="language-plaintext highlighter-rouge">auth</code> service establishes connections to the new Account DB</li>
<li>The <code class="language-plaintext highlighter-rouge">auth</code> service reserves API end-points for synchronizing new changes to <code class="language-plaintext highlighter-rouge">users</code>, <code class="language-plaintext highlighter-rouge">access_tokens</code>, <code class="language-plaintext highlighter-rouge">teams</code>, <code class="language-plaintext highlighter-rouge">clients</code>, and <code class="language-plaintext highlighter-rouge">authentications</code> collections in the new <code class="language-plaintext highlighter-rouge">Account DB</code></li>
<li>Quipper Platforms such as: <code class="language-plaintext highlighter-rouge">LEARN API</code>, <code class="language-plaintext highlighter-rouge">LINK API</code>, <code class="language-plaintext highlighter-rouge">Back-Office</code>, and other services will act as the clients of the API end-points defined by <code class="language-plaintext highlighter-rouge">auth</code> service</li>
<li>On every change that happens to those collections in Main DB, Quipper Platforms will make an HTTP (or RPC) call to <code class="language-plaintext highlighter-rouge">auth</code> service to synchronize those changes</li>
<li>The <code class="language-plaintext highlighter-rouge">auth</code> service will receive each request and process it (basically, CRUD to the new Account DB)</li>
</ul>
<p>We think this solution is quite simple, but there are several drawbacks:</p>
<ul>
<li>Looks like this solution is not reliable enough, it will be prone to network failure</li>
<li>This will also give a huge additional load to the <code class="language-plaintext highlighter-rouge">auth</code> service</li>
<li>This “mirroring” process is not <code class="language-plaintext highlighter-rouge">auth</code> service’s responsibility anyway – so we don’t want to compromise to risk our <code class="language-plaintext highlighter-rouge">auth</code> service being down for handling requests that are not even its responsibility</li>
<li>And, we have to make changes in many places: <code class="language-plaintext highlighter-rouge">API Learn</code>, <code class="language-plaintext highlighter-rouge">Educator API</code>, <code class="language-plaintext highlighter-rouge">Back-Office</code>, etc</li>
</ul>
<p>So, we didn’t choose this option.</p>
<p><strong>Change Data Capture: The MongoDB Changestreams</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673326943934/1d5c9ef6-b038-4f5f-b225-957c603d2d7c.png" alt="Synchronization via MongoDB ChangeStreams" /></p>
<p>After careful consideration, we chose this option. In this option, we fully utilize the feature of MongoDB called <a href="https://www.mongodb.com/docs/manual/changeStreams/">Change Streams</a>. Our Main DB has the ability to stream every event/change that happens inside, and at the other end, our app will be able to watch/listen to every streamed event/change to process it further.</p>
<p>We also introduced a new service called <code class="language-plaintext highlighter-rouge">auth-double-writer</code>. This service is written in Go. The responsibility of this service is to <em>replicate any changes</em> by watching (or listening) to any changes that happen to the collections in the Main DB and then write those changes to the new Account DB.</p>
<h4 id="mongodb-change-streams">MongoDB Change Streams</h4>
<blockquote>
<p>Change Streams allows applications to access real-time data changes without the complexity and risk of tailing the <a href="https://www.mongodb.com/docs/manual/reference/glossary/#std-term-oplog">oplog</a>. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them.</p>
</blockquote>
<ul>
<li>Change Streams is available for <a href="https://www.mongodb.com/docs/manual/replication/">replica sets</a> and <a href="https://www.mongodb.com/docs/manual/sharding/">sharded clusters</a></li>
<li>Watch a Collection, Database, or Deployment</li>
<li>Modify Change Stream Output – <code class="language-plaintext highlighter-rouge">$addFields</code>, <code class="language-plaintext highlighter-rouge">$match</code>, <code class="language-plaintext highlighter-rouge">$project</code>, etc</li>
<li>MongoDB Changestream is resumable – <code class="language-plaintext highlighter-rouge">resumeAfter</code>, <code class="language-plaintext highlighter-rouge">startAfter</code></li>
<li>Use Cases:
<ul>
<li>Extract, Transform, and Load (ETL) services</li>
<li>Cross-platform synchronization</li>
<li>Collaboration functionality</li>
<li>Notification services</li>
</ul>
</li>
<li>Change Events (v6.0):
<ul>
<li><code class="language-plaintext highlighter-rouge">create</code></li>
<li><code class="language-plaintext highlighter-rouge">createIndexes</code></li>
<li><code class="language-plaintext highlighter-rouge">delete</code></li>
<li><code class="language-plaintext highlighter-rouge">drop</code></li>
<li><code class="language-plaintext highlighter-rouge">dropDatabase</code></li>
<li><code class="language-plaintext highlighter-rouge">dropIndexes</code></li>
<li><code class="language-plaintext highlighter-rouge">insert</code></li>
<li><code class="language-plaintext highlighter-rouge">invalidate</code></li>
<li><code class="language-plaintext highlighter-rouge">modify</code></li>
<li><code class="language-plaintext highlighter-rouge">rename</code></li>
<li><code class="language-plaintext highlighter-rouge">replace</code></li>
<li><code class="language-plaintext highlighter-rouge">shardCollection</code></li>
<li><code class="language-plaintext highlighter-rouge">update</code></li>
</ul>
</li>
</ul>
<h3 id="copy-data-from-old-db-to-the-new-db">Copy Data from Old DB to the New DB</h3>
<p>We have two options on how we will execute the actual migration step. The first one is with downtime required, and the second one is without downtime.</p>
<p><strong>With Downtime</strong>
Here are the steps:</p>
<ol>
<li>Turn off <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service</li>
<li>Turn off Quipper services (downtime)</li>
<li>Run the job to copy the data from Main DB to the New Account DB</li>
<li>Turn back on <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service</li>
<li>Turn back on Quipper services</li>
</ol>
<p><strong>Without Downtime</strong>
It is possible to run the migration without downtime since our <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service has a “pause and resume” capability (thanks to MongoDB Changestream’s resume token). Here are the steps:</p>
<ol>
<li>Turn off <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service</li>
<li>Run the job to copy the data from Main DB to the New Account DB</li>
<li>Turn back on <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service</li>
</ol>
<p>We have determined that the latter option is preferable, the database migration with no downtime is the best course of action as it maintains continuity of service for our users and minimizes potential disruption.</p>
<h3 id="execute-the-migration-with-zero-downtime">Execute The Migration With Zero-Downtime</h3>
<p>We finally managed to execute the migration with zero downtime (cheers!). The question is how did we do that? Let me explain to you.</p>
<p>The main reason why we managed to execute the migration with zero downtime is that <strong>MongoDB Changestream is resumable</strong>. So, the <code class="language-plaintext highlighter-rouge">auth-double-writer</code> utilizes this capability very well.</p>
<p>We’ve designed our <code class="language-plaintext highlighter-rouge">auth-double-writer</code> with “paused” capability, and resume from the point it left. So it will be able to listen to the events stream continually as if there is no disruption.</p>
<p>This is what actually happened:</p>
<ul>
<li>When we turned off the <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service, the <code class="language-plaintext highlighter-rouge">auth-double-writer</code> stores the last Changestream token that has been processed to some datastore – we can use some persistent datastore to store the Changestream tokens here</li>
<li>Then, we executed our main task: copy the data from the Main DB to the new Account DB
<ul>
<li>We’ve carefully tested this step</li>
<li>We’ve run the job several times before (for testing purposes)</li>
<li>Based on the test result, we’ve calculated that on average this job will take 30 minutes to run. This is a safe number since our MongoDB Oplog can hold the Changestream tokens for one hour long</li>
</ul>
</li>
<li>We turned back on the <code class="language-plaintext highlighter-rouge">auth-double-writer</code> service. This service will pick up the Changestream token from the data store, so it will continue to listen from the point when it was being turned off</li>
<li>We checked the data integrity and compare the size and the number of records between the Main DB and the new Account DB. Thankfully, all is good</li>
<li>Now we’ve fully replicated data from the Main DB to the new Account DB and kept them synced</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>As we have seen throughout this article, database migration is a complex process that requires careful planning and execution. However, the benefits of separating concerns, improving performance, and ensuring continuity of service through zero-downtime migration make it worth the effort. With this migration, our organization will be better equipped to handle future demands, and we will continue to deliver the best possible service to our users.</p>
<p>Thank you for your time in reading this post. See you later.</p>
<p><em>Originally published at <a href="https://tirasundara.hashnode.dev">https://tirasundara.hashnode.dev</a></em>.</p>
Fri, 03 Feb 2023 00:00:00 +0000
https://devs.quipper.com/2023/02/03/how-we-migrated-millions-of-data-without-downtime.html
https://devs.quipper.com/2023/02/03/how-we-migrated-millions-of-data-without-downtime.htmlAutomate handling a number of Pull Requests by Renovate in Terraform Monorepo<p>Original article in Japanese: <em><a href="https://blog.studysapuri.jp/entry/2022/02/18/080000">Renovate の大量の Pull Request を処理する技術</a></em></p>
<p>In this post, I’d like to introduce techniques for handling a large number of pull requests from <a href="https://docs.renovatebot.com/">Renovate</a> in a Terraform Monorepo.</p>
<h2 id="background">Background</h2>
<p>We manage a Terraform Monorepo, and recently we’ve migrated its CI from AWS CodeBuild to GitHub Actions and <a href="https://github.com/suzuki-shunsuke/tfaction">tfaction</a>.</p>
<p><em><a href="https://devs.quipper.com/2022/02/25/terraform-github-actions.html">2022-02-25 Migrate Terraform CI from AWS CodeBuild to GitHub Actions</a></em></p>
<p>We have about 400 working directories (Terraform States), and the following tool versions are managed in each working directory.</p>
<ul>
<li>Terraform</li>
<li>Terraform Provider</li>
<li><a href="https://github.com/terraform-linters/tflint">tflint</a></li>
<li>tflint plugin</li>
<li><a href="https://github.com/aquasecurity/tfsec">tfsec</a></li>
<li>etc</li>
</ul>
<p>If a single package is used in multiple services in Monorepo, by default, Renovate updates them in a single pull request. We use <a href="https://docs.renovatebot.com/configuration-options/#additionalbranchprefix">additionalBranchPrefix</a> to separate pull requests per working directory.</p>
<p>e.g.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"additionalBranchPrefix"</span><span class="p">:</span><span class="w"> </span><span class="s2">"{{packageFileDir}}-"</span><span class="p">,</span><span class="w">
</span><span class="nl">"commitMessageSuffix"</span><span class="p">:</span><span class="w"> </span><span class="s2">"({{packageFileDir}})"</span><span class="p">,</span><span class="w">
</span><span class="nl">"matchManagers"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"terraform"</span><span class="p">,</span><span class="w">
</span><span class="s2">"regex"</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This way, when a tool is updated, nearly 400 pull requests need to be merged.
Reviewing such a large number of pull requests one by one by humans is difficult and not worth the effort.
Therefore, if CI is successful and the result of <code class="language-plaintext highlighter-rouge">terraform plan</code> is no change, it is desirable to merge automatically.
If the number of pull requests that can be merged a day is too small, we wouldn’t be able to fully process pull requests and tools wouldn’t be updated properly.</p>
<h2 id="solution">Solution</h2>
<p>To handle a large number of pull requests from Renovate automatically, we did the following actions.</p>
<ol>
<li>Enable <a href="https://docs.renovatebot.com/configuration-options/#automerge">automerge</a></li>
<li>Enable <a href="https://docs.renovatebot.com/configuration-options/#platformautomerge">platformAutomerge</a></li>
<li>Set <a href="https://docs.renovatebot.com/configuration-options/#prhourlylimit">prHourlyLimit</a> to 0</li>
<li>Set <a href="https://docs.renovatebot.com/configuration-options/#prconcurrentlimit">prConcurrentLimit</a> to 5</li>
<li>Limit branchConcurrentLimit too</li>
<li>Update the feature branch and enable automerge automatically when the automerge is disabled due to the update of base branch</li>
<li>Close the pull request and delete the feature branch immediately when CI fails</li>
<li>Skip terraform plan and apply for updates other than Terraform and Terraform Provider</li>
<li>Install not only <a href="https://github.com/apps/renovate-approve">Renovate Approve</a> but also <a href="https://github.com/apps/renovate-approve-2">Renovate Approve 2</a> to make pull requests certainly approved</li>
<li>Set <a href="https://docs.renovatebot.com/configuration-options/#prpriority">prPriority</a> to prevent some tools from blocking other tools’ update</li>
<li>Replace <code class="language-plaintext highlighter-rouge">GITHUB_TOKEN</code> to GitHub App’s token to prevent API rate limiting</li>
</ol>
<h3 id="1-enable-automerge">1. Enable automerge</h3>
<p>If you enable automerge, Renovate will merge pull requests automatically.
If one approval is required to merge pull requests, you can use <a href="https://github.com/apps/renovate-approve">Renovate Approve</a>.</p>
<p>However, it is known that it takes a little long time to merge pull requests with automerge.
It can take several hours.
So, you can enable platformAutomerge.</p>
<h3 id="2-enable-platformautomerge">2. Enable platformAutomerge</h3>
<p>When platformAutomerge is enabled, pull requests are merged as soon as the conditions are met by GitHub Automerge feature.</p>
<h4 id="notes-on-github-automerge">Notes on GitHub Automerge</h4>
<p>Please use GitHub Automerge carefully, otherwise your pull requests will be merged even if CI fails.</p>
<ul>
<li><a href="https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-auto-merge-for-pull-requests-in-your-repository">You have to enable Allow auto-merge in the repository setting</a></li>
<li>The base branch must be protected by <a href="https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/defining-the-mergeability-of-pull-requests/managing-a-branch-protection-rule">Branch Protection Rule</a>
<ul>
<li>You must select the status checks in <code class="language-plaintext highlighter-rouge">Status checks that are required</code> least at one, otherwise Automerge cannot be enabled</li>
</ul>
</li>
</ul>
<p>Be aware that the pull request will be merged even if the checks other than the ones you checked in <code class="language-plaintext highlighter-rouge">Status checks that are required</code> fail.
If a GitHub Actions job is skipped by <a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idif">if</a>, it will be merged.</p>
<p><a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/build-matrix">We run GitHub Actions’ multiple jobs in parallel by build matrix</a>, but it is difficult to add those jobs to <code class="language-plaintext highlighter-rouge">Status checks that are required</code> because executed jobs are changed dynamically.
So <a href="https://github.com/suzuki-shunsuke/tfaction-example/blob/c3dff91fbcd7df77171c13878e3382cf001c8232/.github/workflows/test.yaml#L163-L170">we add a job which depends on the build matrix, and add it to <code class="language-plaintext highlighter-rouge">Status checks that are required</code></a>.</p>
<p>There is still a problem that if other workflows fail the pull request would be merged, but we tolerate this because it rarely happens and we can fix it if it happens.</p>
<h3 id="3-set-prhourlylimit-to-0">3. Set prHourlyLimit to 0</h3>
<p>Renovate has several limits that restrict the creation of pull requests.
Note that even if they are unlimited by default, it may be limited by the preset <a href="https://docs.renovatebot.com/presets-config/#configbase">config:base</a>.</p>
<table>
<thead>
<tr>
<th>config</th>
<th>default</th>
<th><code class="language-plaintext highlighter-rouge">config:base</code></th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://docs.renovatebot.com/configuration-options/#prhourlylimit">prHourlyLimit</a></td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td><a href="https://docs.renovatebot.com/configuration-options/#prconcurrentlimit">prConcurrentLimit</a></td>
<td>0</td>
<td>10</td>
</tr>
<tr>
<td><a href="https://docs.renovatebot.com/configuration-options/#branchconcurrentlimit">branchConcurrentLimit</a></td>
<td><code class="language-plaintext highlighter-rouge">prConcurrentLimit</code></td>
<td> </td>
</tr>
</tbody>
</table>
<p>prHourlyLimit is limited to 2 by <code class="language-plaintext highlighter-rouge">config:base</code>, which means that only 2 pull requests will be created per hour.
So, explicitly set it to 0 so that an unlimited number of pull requests can be created.</p>
<h3 id="4-set-prconcurrentlimit-to-5">4. Set prConcurrentLimit to 5</h3>
<p>Renovate tries to create as many pull requests as possible within the above limit.
When you run terraform plan and apply in Terraform CI, probably CI would fail due to API rate limiting if you run a lot of them at the same time.
Also, GitHub Automerge may be automatically disabled when the base branch is updated.</p>
<p>For this reason, we set prConcurrentLimit to 5.</p>
<h3 id="5-limit-branchconcurrentlimit-too">5. Limit branchConcurrentLimit too</h3>
<p><a href="https://docs.renovatebot.com/configuration-options/#branchconcurrentlimit">branchConcurrentLimit</a> is a limit based on the number of branches.
I thought we didn’t have to limit pull requests by the number of branches, so I set it to 0 at first, but that was a mistake.
It seems that branches are created even if no pull requests are created, so more than 1000 branches were created unnecessarily.
Since branchConcurrentLimit is the same as prConcurrentLimit by default, we explicitly set only prConcurrentLimit and not branchConcurrentLimit.</p>
<h3 id="6-update-the-feature-branch-and-enable-automerge-automatically-when-the-automerge-is-disabled-due-to-the-update-of-base-branch">6. Update the feature branch and enable automerge automatically when the automerge is disabled due to the update of base branch</h3>
<p>GitHub Automerge may be automatically disabled when the base branch is updated.</p>
<p><img src="https://user-images.githubusercontent.com/13323303/150432569-0b1f3f01-d09d-4b26-842e-3d0cccb24f33.png" alt="image" /></p>
<p>To merge these pull requests automatically,
you can update feature branches and re-enable automerge automatically by GitHub Actions.</p>
<p><img src="https://user-images.githubusercontent.com/13323303/153967962-3f6c456c-b307-47f5-8125-47368fa252c2.png" alt="image" /></p>
<p><a href="https://github.com/suzuki-shunsuke/reenable-automerge-action">https://github.com/suzuki-shunsuke/reenable-automerge-action</a></p>
<h3 id="7-close-the-pull-request-and-delete-the-feature-branch-immediately-when-ci-fails">7. Close the pull request and delete the feature branch immediately when CI fails</h3>
<p>Since we have set prConcurrentLimit and branchConcurrentLimit, leaving Renovate pull requests open will limit the number of new pull requests that can be created.
Therefore, we decided to close pull requests that could not be automerged and delete feature branches automatically.</p>
<p><a href="https://github.com/suzuki-shunsuke/renovate-autoclose-action">https://github.com/suzuki-shunsuke/renovate-autoclose-action</a></p>
<p><a href="https://docs.github.com/en/search-github/searching-on-github/searching-issues-and-pull-requests">You can search closed pull requests with simple query</a> like <code class="language-plaintext highlighter-rouge">is:pr is:unmerged author:app/renovate</code>, and can also be found in Renovate’s <a href="https://docs.renovatebot.com/key-concepts/dashboard/">Dependency Dashboard</a>.</p>
<h3 id="8-skip-terraform-plan-and-apply-for-updates-other-than-terraform-and-terraform-provider">8. Skip terraform plan and apply for updates other than Terraform and Terraform Provider</h3>
<p><a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/renovate">With tfaction, CI would fail if the result of terraform plan of pull request by Renovate is not No Change to prevent dangerous changes from being applied by terraform apply</a>.
However, sometimes tools such as tfsec and tflint couldn’t be updated due to this failure.</p>
<p>tfsec and tflint are not related to terraform plan and apply, so you don’t have to run terraform plan and apply to update them.</p>
<p><a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/support-skipping-terraform-renovate-pr">Since tfaction v0.4.9, tfaction supports skipping terraform plan and apply in Renovate pull requests</a>,
so we’re using the feature.</p>
<p>This also speeds up CI and prevents API rate limiting.</p>
<h3 id="9-install-not-only-renovate-approve-but-also-renovate-approve-2-to-prevent-approve-omissions">9. Install not only Renovate Approve but also Renovate Approve 2 to prevent approve omissions</h3>
<p>We don’t know the reason, but sometimes <a href="https://github.com/apps/renovate-approve">Renovate approve</a> does not approve pull requests expectedly.
So we also installed <a href="https://github.com/apps/renovate-approve-2">Renovate Approve 2</a> to prevent the omission of approve.
This app is supposed to be used when two approval are needed, but we think it can also be used to prevent approval leaks.
So far, we haven’t had any approval leaks after we installed Renovate Approve 2.</p>
<h3 id="10-adjust-prpriority-properly">10. Adjust prPriority properly</h3>
<p>Tools like Terraform and AWS Provider may disturb other tools’ updates for a long time, because they are frequently updated.
If you want to prioritize other tools updates, you can adjust the <a href="https://docs.renovatebot.com/configuration-options/#prpriority">prPriority</a>.</p>
<h3 id="11-replace-github_token-to-github-apps-token-to-prevent-api-rate-limiting">11. Replace GITHUB_TOKEN to GitHub App’s token to prevent API rate limiting</h3>
<p>tfaction takes GitHub Access Token as input.
By default <code class="language-plaintext highlighter-rouge">secrets.GITHUB_TOKEN</code> is used, but if the number of builds’ an hour increases, API rate limiting may occur.
So we switched to a GitHub App token which has a less strict rate limit than <code class="language-plaintext highlighter-rouge">secrets.GITHUB_TOKEN</code>.
About the rate limit, please see the document.</p>
<ul>
<li><a href="https://docs.github.com/en/rest/overview/resources-in-the-rest-api#requests-from-github-actions">https://docs.github.com/en/rest/overview/resources-in-the-rest-api#requests-from-github-actions</a></li>
<li><a href="https://docs.github.com/en/developers/apps/building-github-apps/rate-limits-for-github-apps">https://docs.github.com/en/developers/apps/building-github-apps/rate-limits-for-github-apps</a></li>
</ul>
<p>To switch, you need to modify GitHub App permissions (<code class="language-plaintext highlighter-rouge">issues: read</code> is required).
Furthermore, you also need to switch GitHub Access Token for <a href="https://github.com/suzuki-shunsuke/tfaction-example/blob/c3dff91fbcd7df77171c13878e3382cf001c8232/.github/workflows/hide_comment.yaml#L19-L21">github-comment hide</a>, because <a href="https://github.com/suzuki-shunsuke/github-comment#hide">github-comment hide</a> only hides comments from the same user.</p>
<h2 id="conclusion">Conclusion</h2>
<p>As a result of the above actions, we are now able to create and merge about 500 pull requests a day into a single repository.
This number still has room for improvement (we think it could be up to 700 or so), but it is still sufficient for the current situation.
We used to check and respond to open pull requests from time to time, but by automating tasks as much as possible, we only have to deal with those that really need to be dealt with by humans, and this has reduced our workload.</p>
Tue, 29 Mar 2022 05:00:00 +0000
https://devs.quipper.com/2022/03/29/automate-handling-a-number-of-pull-requests-by-renovate-in-terraform-monorepo.html
https://devs.quipper.com/2022/03/29/automate-handling-a-number-of-pull-requests-by-renovate-in-terraform-monorepo.htmlVision, Mission and Values to make SRE team more sustainable<p>My name is <a href="https://github.com/yuya-takeyama">@yuya-takeyama</a> and I am the Engineering Manager in the Global SRE Team.</p>
<p>Previously, our company had only one SRE team and I was the manager of that team, but we split the SRE team between Japan and Global because we were developing different products in Japan and other countries.
I am now in charge of launching the Global SRE Team.</p>
<p>This article is about what I did with my pre-split SRE team.
At that time, we defined our Vision, Mission, and Values with our team.</p>
<p>Quipper has a company Vision, Mission, and Identities.</p>
<ul>
<li>Vision: Distributors of Wisdom</li>
<li>Mission: Bringing the Best Education to Every Corner of the World</li>
<li>Identities: User-first, Diversity, Ownership, Fact-based, Growth</li>
</ul>
<p>Although these have been established for more than a few years, they are still as important to Quipper employees as ever.</p>
<p>However, the day-to-day work of SREs does not directly contribute to teaching and learning.
Of course, we do support them in ways they cannot see.</p>
<p>Therefore, we decided to establish a Vision as a future that is more intuitive to our team, Mission as what we should do to achieve it, and Values as the values that are important in our daily activities.
The current team, after the team split, is still working under this Vision, Mission, and Values.</p>
<p>The following is a quick introduction.</p>
<h2 id="vision-mission-and-values-of-the-sre-team">Vision, Mission, and Values of the SRE Team</h2>
<h3 id="vision-realize-a-development-organization-that-can-continue-to-create-the-best-learning-products">Vision: Realize a development organization that can continue to create the best learning products</h3>
<p>Vision is a future that is not there yet, but should be aimed for and created.</p>
<p>In this context, we consider the “development organization” to be the direct customer for the team.
The easiest way to explain the development organization is that it is the people who belong to the Product Development Division which makes Quipper products.
Designers, Developers, Product Managers, QA Engineers and SREs are part of this division.</p>
<p>In a broader sense, however, product development involves a diverse range of people on a daily basis.
In our products, the people who create content (learning materials) also play an important role in the product.</p>
<p>In order to execute our mission without falling into local optimization, it is important to understand that a wide variety of people are involved in product development in many different ways.</p>
<h3 id="mission-create-a-platform-and-culture-for-self-contained-teams-to-continue-to-deliver-product-quickly-and-safely">Mission: Create a platform and culture for self-contained teams to continue to deliver product quickly and safely.</h3>
<p>Mission is what we do every day to realize our Vision.</p>
<p>A particularly important keyword in this context is “self-contained team”.
This has been a theme I have been talking about ever since I became an Engineering Manager, even before we defined this Vision, Mission, and Values.</p>
<p>In a nutshell, the relationship between the development team and the SRE team should not be one of “ask” and “receive”.</p>
<p>For example, if a new database is needed for the development of a new service, and the development team requests it, and the SRE team creates it and hands it over to them, what kind of problems will there be?
The lead time for infrastructure provisioning becomes long because of the waiting time after the request is made. And because it is difficult for the development team to control, it becomes an uncertainty in the development schedule.
In addition, such a structure makes it difficult to motivate the development team to think about the optimal database and architecture to use, which in turn will affect the quality of the product.</p>
<p>To prevent this from happening, we provide a <a href="https://devs.quipper.com/2022/02/25/terraform-github-actions.html">Terraform Platform for self-contained teams</a>, which allows development team members to manage any kind of cloudd resources like databases, cache servers and messages queues by themselves.</p>
<p>And for applications, we similarly provide a Kubernetes Platform for self-contained teams that allows developers to build new services by themselves. They can continuously deploy services without the help of the SRE team.</p>
<p>Platforms that include tools such as CI/CD are visible in the form of source code, but to enable development teams to work as self-contained teams, it is not enough to have the tools alone. It is also necessary to understand and practice the methodology for developing as a self-contained team.</p>
<p>At Quipper, all development teams have SLOs, and all teams are able to regularly monitor and take action when there are problems. In addition, when problems such as failures occur, each team reviews the situation through postmortem.</p>
<p>Such measures may not require much effort for a one-time event. However, to make them sustainable, it is necessary to have a system and culture, not just a mentality.</p>
<p>We aim to realize our vision by having both a platform and a culture, and by continuously evolving them.</p>
<h3 id="values">Values</h3>
<p>We have defined four Values.
Quipper Identities are five, but we kept it to four because we are conscious of the number of chunks that can fit in short-term memory.</p>
<ul>
<li>Fail smart
<ul>
<li>Do not blame failure, but use it as a learning opportunity. Also, control the scope of impact and incorporate failure into the process so that the greatest return can be obtained from the least risk.</li>
<li>Failure is to be avoided if possible, but in complex systems, failure cannot be reduced to zero. It is important to face failures properly with Postmortem and other tools, rather than treating them as absolute evils.</li>
<li>It is also important to actively utilize the remaining 0.1% of the 99.9% SLO through methodologies like Canary Release.</li>
</ul>
</li>
<li>Learning
<ul>
<li>Continue to see everything as an opportunity to learn and make necessary changes in order to discover and solve unknown problems.</li>
<li>In Peter Senge’s <a href="https://www.amazon.com/dp/0385517254">The Fifth Discipline: The Art & Practice of The Learning Organization</a> and in <a href="https://www.amazon.com/dp/1942788339">Accelerate: The Science of Lean Software and DevOps</a>, the importance of learning and improvement as an organization is emphasized.</li>
<li>“What we can do now” is important, but “what we will be able to do in the future” is even more important. It is necessary for the organization to continue to overcome the challenges while continuously updating the issue setting.</li>
</ul>
</li>
<li>Borderless
<ul>
<li>Communicate and collaborate across organizational boundaries to achieve greater results.</li>
<li>Individual ability is important, but a major job that has an impact on the product cannot be accomplished by itself.</li>
<li>Especially since we are a functional organization, we cannot achieve results without actively working together to overcome the borders between ourselves and the development team.</li>
</ul>
</li>
<li>Metrics-driven
<ul>
<li>Measure all issues and things, see problems not as dots but as lines, and aim for flexible and automatic solutions.</li>
<li>Even if it is not difficult to solve each thing one by one, it is necessary to monitor and detect by indicators to avoid it continuously or in advance.</li>
<li>To solve problems systematically, it is necessary to model problems as indicators and control them or deal with their side effects.</li>
</ul>
</li>
</ul>
<p>These are the ways of being that we must value and build on in each of our actions as we carry out our Mission.</p>
<p>All of the members on our team are highly capable as individuals, but we feel that there are an increasing number of problems that cannot be solved by that alone.
In order to overcome them, we need to improve the quality of our problem-solving and try to solve them at a higher level, and to do so, I believe we need to make these our action guidelines.</p>
<h2 id="how-we-defined-the-vision-mission-and-values">How we defined the Vision, Mission, and Values?</h2>
<p>Although I was the one who proposed the idea of defining the Vision, Mission, and Values, the process was done by all team members at that time.</p>
<p>We would have liked to get together and discuss the Vision, Mission, and Values on a whiteboard, but due to COVID-19, we had to discuss the vision and values remotely.</p>
<p>After explaining why we were doing this, we presented our ideas of Vision, Mission, and Values to each other, and then we explored the values within each of them in depth, breaking them down into their elements and reconstructing them.</p>
<p>The reasons for doing this were explained as follows.</p>
<ul>
<li>To enable teams to work and make decisions in the same direction.</li>
<li>As a tool for matching in recruitment</li>
<li>To make the team more attractive to work with</li>
</ul>
<p>We are now explaining our Vision, Mission, and Values in our hiring interviews and hope to attract more candidates who understand our ideas.</p>
Mon, 28 Mar 2022 08:00:00 +0000
https://devs.quipper.com/2022/03/28/vision-mission-values-of-sre-team.html
https://devs.quipper.com/2022/03/28/vision-mission-values-of-sre-team.htmlMigrate Terraform CI from AWS CodeBuild to GitHub Actions<p><em>Author: <a href="https://github.com/suzuki-shunsuke">@suzuki-shunsuke</a>, SRE in Quipper</em></p>
<p>Original article in Japanese: <em><a href="https://blog.studysapuri.jp/entry/2022/02/04/080000">Terraform の CI を AWS CodeBuild から GitHub Actions + tfaction に移行しました</a></em></p>
<p>In this post, I’d like to talk about how we migrated our Terraform CI from AWS CodeBuild to GitHub Actions + tfaction.</p>
<h2 id="terraform-workflow-so-far-aws-codebuild">Terraform Workflow so far (AWS CodeBuild)</h2>
<p>Originally, we used to run CI on AWS CodeBuild.
Before that, we used CircleCI, but we migrated to AWS CodeBuild.</p>
<p>There are two main reasons why we migrated to AWS CodeBuild.</p>
<ul>
<li>Security
<ul>
<li>You can manage AWS resources without persistent Access Keys</li>
<li>Google Cloud Platform (GCP) can also be managed without Service Account Key by <a href="https://cloud.google.com/iam/docs/workload-identity-federation">Workload Identity Federation</a></li>
</ul>
</li>
<li>Dynamic workflow
<ul>
<li>In Monorepo, we would like to run CI only in the working directory where the code was changed by pull request</li>
<li>This can be achieved by generating a <a href="https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html">buildspec</a> dynamically during build and executing <a href="https://docs.aws.amazon.com/codebuild/latest/userguide/batch-build.html">Batch Build</a> with AWS CLI</li>
<li>CircleCI now supports dynamic workflow, but at the time it did not</li>
</ul>
</li>
</ul>
<p>The first reason in particular was a major strength of AWS CodeBuild.</p>
<p>Another advantage of AWS CodeBuild is that it can be run in AWS VPC.
We manage <a href="https://www.mongodb.com/atlas">MongoDB Atlas</a> by Terraform.
The Atlas supports restriction of API usage by source IP addresses, if it is not restricted, it is very risky when the key leaked.
So, we only allow access from the Elastic IP address of a specific AWS VPC NAT Gateway.</p>
<h2 id="oidc-support-for-github-actions">OIDC support for GitHub Actions</h2>
<p>However, the situation has changed dramatically since GitHub Actions started supporting OIDC to access AWS and GCP without a persistent access key.</p>
<p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect">https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect</a></p>
<p>You can also run GitHub Actions in VPC by <a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners">GitHub Actions’ Self-hosted Runner</a>.
Since we originally run Self-hosted Runner, we thought it would be relatively easy to run Terraform with Self-hosted Runner.</p>
<p>As the strengths of AWS CodeBuild became available in GitHub Actions, the momentum to migrate to GitHub Actions grew.</p>
<h2 id="reasons-for-migrating-to-github-actions">Reasons for migrating to GitHub Actions</h2>
<p>We decided to migrate from AWS CodeBuild to GitHub Actions for the following reasons.</p>
<ul>
<li>No more need to sign in to AWS to see CI logs and retry CI</li>
<li>GitHub Actions’ build matrix allows for a more natural dynamic workflow</li>
<li>Leverage Action ecosystem</li>
</ul>
<h3 id="no-more-need-to-sign-in-to-aws-to-see-ci-logs-and-retry-ci">No more need to sign in to AWS to see CI logs and retry CI</h3>
<p>It’s bothersome to sign in to AWS to see CI logs and retry CI.
In case of GitHub Actions, you don’t have to sign in to AWS for them.</p>
<h3 id="github-actions-build-matrix-allows-for-a-more-natural-dynamic-workflow">GitHub Actions’ build matrix allows for a more natural dynamic workflow</h3>
<p>We achieved dynamic workflow in AWS CodeBuild by generating buildspec and uploading it to AWS S3 and running a <a href="https://docs.aws.amazon.com/codebuild/latest/userguide/batch-build.html">Batch Build</a> with AWS CLI.
So the build is executed in two stages, and CI takes a bit of time.
Batch Build itself takes some time to start and finish too.</p>
<p><a href="https://docs.github.com/en/actions/using-jobs/using-a-build-matrix-for-your-jobs">GitHub Actions’ build matrix</a> allows for a more natural dynamic workflow.
There is no need to dynamically generate a buildspec and upload to S3.
It also makes CI faster.</p>
<h3 id="leverage-of-action-ecosystem">Leverage of Action ecosystem</h3>
<p>you can leverage GitHub Actions’ Action ecosystem.
By replacing existing shell scripts with Actions, you can reduce the number of maintenance targets and improve maintainability.</p>
<h2 id="adopt-tfaction">Adopt tfaction</h2>
<p>We have adopted tfaction, which is GitHub Actions collection for Opinionated Terraform Workflow.</p>
<p><a href="https://github.com/suzuki-shunsuke/tfaction">https://github.com/suzuki-shunsuke/tfaction</a></p>
<p>tfaction supports almost all the features that we have originally implemented with shell scripts by ourselves,
so we expected that we could eliminate shell scripts entirely.</p>
<h2 id="benefit-of-migration-to-github-actions-and-tfaction">Benefit of migration to GitHub Actions and tfaction</h2>
<p>Migrating to GitHub Actions with tfaction has improved things in the following ways.</p>
<ul>
<li>Least Privilege</li>
<li>No more need to sign in to AWS to view CI logs or retry</li>
<li>Faster CI</li>
<li>Elimination of shell scripts</li>
<li>tfaction provides useful features such as <a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/follow-up-pr">automatic generation of Follow up Pull Requests</a>, <a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/auto-update-related-prs">automatic update of Pull Requests</a>, <a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/scaffold-working-dir">scaffolding working directory with GitHub Actions</a>, and so on</li>
</ul>
<h3 id="least-privilege">Least Privilege</h3>
<p>One of the main problems we had was that all builds of Terraform used IAM Role with very strong privileges.
GitHub Actions’ OIDC allows you to change IAM Roles per branch,
so you can use IAM Role with strong privileges only in the default branch that executes terraform apply,
and you can use IAM Role with almost ReadOnly privileges for pull requests.
Furthermore, you can use IAM Role with very limited permission in builds for tfmigrate and for non-AWS Terraform Providers.</p>
<p>tfaction provides a Terraform Module to create IAM Roles with minimal privileges.</p>
<p><a href="https://github.com/suzuki-shunsuke/terraform-aws-tfaction">https://github.com/suzuki-shunsuke/terraform-aws-tfaction</a></p>
<p>And tfaction allows to configure IAM Roles for each working directory and GitHub Actions job (terraform plan, terraform apply, tfmigrate plan, tfmigrate apply),
so you can achieve the least privilege easily.</p>
<h3 id="tfaction-specific-features">tfaction-specific features</h3>
<p>tfaction provides various useful features. Please see the official document.</p>
<ul>
<li><a href="https://speakerdeck.com/szksh/tfaction-build-terraform-workflow-with-github-actions">https://speakerdeck.com/szksh/tfaction-build-terraform-workflow-with-github-actions</a></li>
<li><a href="https://suzuki-shunsuke.github.io/tfaction/docs/feature/build-matrix">https://suzuki-shunsuke.github.io/tfaction/docs/feature/build-matrix</a></li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, I introduced the migration of Terraform Monorepo Workflow from AWS CodeBuild to GitHub Actions and tfaction.
This migration has improved Developer Experience and achieved the lest privilege.</p>
Fri, 25 Feb 2022 09:00:00 +0000
https://devs.quipper.com/2022/02/25/terraform-github-actions.html
https://devs.quipper.com/2022/02/25/terraform-github-actions.htmlScheduled-Scaling with Kubernetes HPA External Metrics<p>Original article in Japanese: <a href="https://quipper.hatenablog.com/entry/2020/11/30/scheduled-scaling-with-hpa">Kubernetes HPA External Metrics を利用した Scheduled-Scaling</a></p>
<p>Hi, I’m @chaspy from Site Reliability Engineering Team.</p>
<p>At Quipper, we use <a href="https://quipper.hatenablog.com/entry/2020/04/10/hpa">Kubernetes Horizontal Pod Autoscaler</a> (HPA) to achieve pod auto-scaling.</p>
<p>The HPA can handle most ups and downs in the traffic. However, in general, it cannot deal with spike in traffic caused by unexpectedly high number of users accessing the platform at once. When the unexpected increase in CPU utilization happens, it would still take about 5 minutes to scale out the node even if HPA immediately increased the Desired Replicas.</p>
<p>Compared to the scaling mechanism based on the CPU utilization, Scheduled-Scaling can be defined as a method to schedule a fixed number of nodes/pods to be scaled at a specific time in the future. The simplest way to perform Scheduled-Scaling is to just change the <code class="language-plaintext highlighter-rouge">minReplicas</code> of the HPA at a specified time. This method may be efficient if the change is only made once or around the same time every day. However, if the spikes are expected at different times, it may be difficult to change the <code class="language-plaintext highlighter-rouge">minReplicas</code> every time.</p>
<p>In this article, I will explain a case study using <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-metrics-not-related-to-kubernetes-objects">Kubernetes HPA External Metrics</a> to perform Scheduled-Scaling for traffic spike during regularly scheduled exams in the Philippines.</p>
<h2 id="background">Background</h2>
<p>In the Philippines, Quipper is already being used in the schools. Teachers and students have been using it for scheduled exams e.g., term end exams. The teachers register the questions for the examinations in the system before the exam.</p>
<p>One day while one of such scheduled exam was about to start, some students could not login into the portal at all. Schools and Customer Success teams got really confused because they suddenly started receiving complaints about students not being able to take the exam. After some investigation we found that this was due to a sudden traffic spike.</p>
<p>As a temporary solution, firstly, we avoided service downtime by setting the HPA minReplicas to a high enough value during daytime hours. However, this resulted in redundant server costs because we didn’t scale down the replicas during night time or during time when there was no traffic spike.</p>
<p><img src="https://user-images.githubusercontent.com/10370988/124416293-a13c1d80-dd91-11eb-9cc3-2740f61d94b6.png" alt="image" /></p>
<p>Description: The number of pods. It scales out up to 400 uniformly from 6:30 am to 7:30 pm.</p>
<p><img src="https://user-images.githubusercontent.com/10370988/124416327-b44eed80-dd91-11eb-8bc1-ccb73b63a4a3.png" alt="image" /></p>
<p>Description: The number of Nodes also increases in proportion to the number of Pods.</p>
<p>To solve this problem, @naotori, Global Division Director, asked me if it would be possible to scale the server based on the starting time of the exam and the expected number of users in advance. Then, @bdesmero, Global Product Development VPoE, wrote a batch script to get the above data from our database. When we observed this data and the actual server metrics we found out that the server load correlates to the starting time of the exam and the expected number of users. We also found out the maximum number of users our current architecture could handle from the metrics.</p>
<p>Therefore, to optimize the number of pods/nodes which were being scaled out excessively, we decided to use the data obtained by @bdesmero as external metrics for HPA and used it together with CPU auto-scaling to achieve Scheduled-Scaling safely.</p>
<h2 id="mechanism-hpa-external-metrics-and-datadog-custom-metrics-server">Mechanism: HPA External Metrics and Datadog Custom Metrics Server</h2>
<p>HPA is widely known for auto-scaling based on CPU, but autoscaling based on External Metrics is available since apiversion <a href="https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#horizontalpodautoscaler-v2beta1-autoscaling">auto-scaling/v2beta1</a>. Since Quipper uses Datadog, I decided to use Datadog metrics as External Metrics.</p>
<p>So how do you autoscale by using the Datadog metric? HPA Controller is designed to get metrics from Kubernetes metrics <code class="language-plaintext highlighter-rouge">api(metrics.k8s.io, custom.metrics.k8s.io, external.metrics.k8s.io)</code>.</p>
<p>Setting it up in accordance with <a href="https://docs.datadoghq.com/agent/cluster_agent/external_metrics/">the documentation of the Datadog Custom Metrics Server</a>, the API Service is added. <a href="https://kubernetes.io/docs/tasks/extend-kubernetes/setup-extension-api-server/">By registering the API Service, it is registered in the Aggregation Layer of the Kubernetes API</a>, and HPA can retrieve metrics from Datadog’s Metrics Server via the Kubernetes API. Here is a diagram.</p>
<p><a href="https://mermaid-js.github.io/mermaid-live-editor/edit/##eyJjb2RlIjoiZ3JhcGggTFJcbiAgQVtIUEFdIC0tPnxHZXQgbWV0cmljc3wgQltBUEkgc2VydmVyXVxuICBCIC0tPiBDW0FQSVNlcnZpY2VdXG4gIEMgLS0-IERbU2VydmljZSBkYXRhZG9nLWN1c3RvbS1tZXRyaWNzLXNlcnZlcl1cbiAgRCAtLT4gRVtkYXRhZG9nLWNsdXN0ZXItYWdlbnRdXG4gIEUgLS0-IEZbRGF0YWRvZ10iLCJtZXJtYWlkIjoie1xuICBcInRoZW1lXCI6IFwiZGVmYXVsdFwiLFxuICBcInRoZW1lVmFyaWFibGVzXCI6IHtcbiAgICBcImJhY2tncm91bmRcIjogXCJ3aGl0ZVwiLFxuICAgIFwicHJpbWFyeUNvbG9yXCI6IFwiI0VDRUNGRlwiLFxuICAgIFwic2Vjb25kYXJ5Q29sb3JcIjogXCIjZmZmZmRlXCIsXG4gICAgXCJ0ZXJ0aWFyeUNvbG9yXCI6IFwiaHNsKDgwLCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSlcIixcbiAgICBcInByaW1hcnlCb3JkZXJDb2xvclwiOiBcImhzbCgyNDAsIDYwJSwgODYuMjc0NTA5ODAzOSUpXCIsXG4gICAgXCJzZWNvbmRhcnlCb3JkZXJDb2xvclwiOiBcImhzbCg2MCwgNjAlLCA4My41Mjk0MTE3NjQ3JSlcIixcbiAgICBcInRlcnRpYXJ5Qm9yZGVyQ29sb3JcIjogXCJoc2woODAsIDYwJSwgODYuMjc0NTA5ODAzOSUpXCIsXG4gICAgXCJwcmltYXJ5VGV4dENvbG9yXCI6IFwiIzEzMTMwMFwiLFxuICAgIFwic2Vjb25kYXJ5VGV4dENvbG9yXCI6IFwiIzAwMDAyMVwiLFxuICAgIFwidGVydGlhcnlUZXh0Q29sb3JcIjogXCJyZ2IoOS41MDAwMDAwMDAxLCA5LjUwMDAwMDAwMDEsIDkuNTAwMDAwMDAwMSlcIixcbiAgICBcImxpbmVDb2xvclwiOiBcIiMzMzMzMzNcIixcbiAgICBcInRleHRDb2xvclwiOiBcIiMzMzNcIixcbiAgICBcIm1haW5Ca2dcIjogXCIjRUNFQ0ZGXCIsXG4gICAgXCJzZWNvbmRCa2dcIjogXCIjZmZmZmRlXCIsXG4gICAgXCJib3JkZXIxXCI6IFwiIzkzNzBEQlwiLFxuICAgIFwiYm9yZGVyMlwiOiBcIiNhYWFhMzNcIixcbiAgICBcImFycm93aGVhZENvbG9yXCI6IFwiIzMzMzMzM1wiLFxuICAgIFwiZm9udEZhbWlseVwiOiBcIlxcXCJ0cmVidWNoZXQgbXNcXFwiLCB2ZXJkYW5hLCBhcmlhbFwiLFxuICAgIFwiZm9udFNpemVcIjogXCIxNnB4XCIsXG4gICAgXCJsYWJlbEJhY2tncm91bmRcIjogXCIjZThlOGU4XCIsXG4gICAgXCJub2RlQmtnXCI6IFwiI0VDRUNGRlwiLFxuICAgIFwibm9kZUJvcmRlclwiOiBcIiM5MzcwREJcIixcbiAgICBcImNsdXN0ZXJCa2dcIjogXCIjZmZmZmRlXCIsXG4gICAgXCJjbHVzdGVyQm9yZGVyXCI6IFwiI2FhYWEzM1wiLFxuICAgIFwiZGVmYXVsdExpbmtDb2xvclwiOiBcIiMzMzMzMzNcIixcbiAgICBcInRpdGxlQ29sb3JcIjogXCIjMzMzXCIsXG4gICAgXCJlZGdlTGFiZWxCYWNrZ3JvdW5kXCI6IFwiI2U4ZThlOFwiLFxuICAgIFwiYWN0b3JCb3JkZXJcIjogXCJoc2woMjU5LjYyNjE2ODIyNDMsIDU5Ljc3NjUzNjMxMjglLCA4Ny45MDE5NjA3ODQzJSlcIixcbiAgICBcImFjdG9yQmtnXCI6IFwiI0VDRUNGRlwiLFxuICAgIFwiYWN0b3JUZXh0Q29sb3JcIjogXCJibGFja1wiLFxuICAgIFwiYWN0b3JMaW5lQ29sb3JcIjogXCJncmV5XCIsXG4gICAgXCJzaWduYWxDb2xvclwiOiBcIiMzMzNcIixcbiAgICBcInNpZ25hbFRleHRDb2xvclwiOiBcIiMzMzNcIixcbiAgICBcImxhYmVsQm94QmtnQ29sb3JcIjogXCIjRUNFQ0ZGXCIsXG4gICAgXCJsYWJlbEJveEJvcmRlckNvbG9yXCI6IFwiaHNsKDI1OS42MjYxNjgyMjQzLCA1OS43NzY1MzYzMTI4JSwgODcuOTAxOTYwNzg0MyUpXCIsXG4gICAgXCJsYWJlbFRleHRDb2xvclwiOiBcImJsYWNrXCIsXG4gICAgXCJsb29wVGV4dENvbG9yXCI6IFwiYmxhY2tcIixcbiAgICBcIm5vdGVCb3JkZXJDb2xvclwiOiBcIiNhYWFhMzNcIixcbiAgICBcIm5vdGVCa2dDb2xvclwiOiBcIiNmZmY1YWRcIixcbiAgICBcIm5vdGVUZXh0Q29sb3JcIjogXCJibGFja1wiLFxuICAgIFwiYWN0aXZhdGlvbkJvcmRlckNvbG9yXCI6IFwiIzY2NlwiLFxuICAgIFwiYWN0aXZhdGlvbkJrZ0NvbG9yXCI6IFwiI2Y0ZjRmNFwiLFxuICAgIFwic2VxdWVuY2VOdW1iZXJDb2xvclwiOiBcIndoaXRlXCIsXG4gICAgXCJzZWN0aW9uQmtnQ29sb3JcIjogXCJyZ2JhKDEwMiwgMTAyLCAyNTUsIDAuNDkpXCIsXG4gICAgXCJhbHRTZWN0aW9uQmtnQ29sb3JcIjogXCJ3aGl0ZVwiLFxuICAgIFwic2VjdGlvbkJrZ0NvbG9yMlwiOiBcIiNmZmY0MDBcIixcbiAgICBcInRhc2tCb3JkZXJDb2xvclwiOiBcIiM1MzRmYmNcIixcbiAgICBcInRhc2tCa2dDb2xvclwiOiBcIiM4YTkwZGRcIixcbiAgICBcInRhc2tUZXh0TGlnaHRDb2xvclwiOiBcIndoaXRlXCIsXG4gICAgXCJ0YXNrVGV4dENvbG9yXCI6IFwid2hpdGVcIixcbiAgICBcInRhc2tUZXh0RGFya0NvbG9yXCI6IFwiYmxhY2tcIixcbiAgICBcInRhc2tUZXh0T3V0c2lkZUNvbG9yXCI6IFwiYmxhY2tcIixcbiAgICBcInRhc2tUZXh0Q2xpY2thYmxlQ29sb3JcIjogXCIjMDAzMTYzXCIsXG4gICAgXCJhY3RpdmVUYXNrQm9yZGVyQ29sb3JcIjogXCIjNTM0ZmJjXCIsXG4gICAgXCJhY3RpdmVUYXNrQmtnQ29sb3JcIjogXCIjYmZjN2ZmXCIsXG4gICAgXCJncmlkQ29sb3JcIjogXCJsaWdodGdyZXlcIixcbiAgICBcImRvbmVUYXNrQmtnQ29sb3JcIjogXCJsaWdodGdyZXlcIixcbiAgICBcImRvbmVUYXNrQm9yZGVyQ29sb3JcIjogXCJncmV5XCIsXG4gICAgXCJjcml0Qm9yZGVyQ29sb3JcIjogXCIjZmY4ODg4XCIsXG4gICAgXCJjcml0QmtnQ29sb3JcIjogXCJyZWRcIixcbiAgICBcInRvZGF5TGluZUNvbG9yXCI6IFwicmVkXCIsXG4gICAgXCJsYWJlbENvbG9yXCI6IFwiYmxhY2tcIixcbiAgICBcImVycm9yQmtnQ29sb3JcIjogXCIjNTUyMjIyXCIsXG4gICAgXCJlcnJvclRleHRDb2xvclwiOiBcIiM1NTIyMjJcIixcbiAgICBcImNsYXNzVGV4dFwiOiBcIiMxMzEzMDBcIixcbiAgICBcImZpbGxUeXBlMFwiOiBcIiNFQ0VDRkZcIixcbiAgICBcImZpbGxUeXBlMVwiOiBcIiNmZmZmZGVcIixcbiAgICBcImZpbGxUeXBlMlwiOiBcImhzbCgzMDQsIDEwMCUsIDk2LjI3NDUwOTgwMzklKVwiLFxuICAgIFwiZmlsbFR5cGUzXCI6IFwiaHNsKDEyNCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpXCIsXG4gICAgXCJmaWxsVHlwZTRcIjogXCJoc2woMTc2LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSlcIixcbiAgICBcImZpbGxUeXBlNVwiOiBcImhzbCgtNCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpXCIsXG4gICAgXCJmaWxsVHlwZTZcIjogXCJoc2woOCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpXCIsXG4gICAgXCJmaWxsVHlwZTdcIjogXCJoc2woMTg4LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSlcIlxuICB9XG59IiwidXBkYXRlRWRpdG9yIjpmYWxzZSwiYXV0b1N5bmMiOnRydWUsInVwZGF0ZURpYWdyYW0iOmZhbHNlfQ"><img src="https://mermaid.ink/img/eyJjb2RlIjoiZ3JhcGggTFJcbiAgQVtIUEFdIC0tPnxHZXQgbWV0cmljc3wgQltBUEkgc2VydmVyXVxuICBCIC0tPiBDW0FQSVNlcnZpY2VdXG4gIEMgLS0-IERbU2VydmljZSBkYXRhZG9nLWN1c3RvbS1tZXRyaWNzLXNlcnZlcl1cbiAgRCAtLT4gRVtkYXRhZG9nLWNsdXN0ZXItYWdlbnRdXG4gIEUgLS0-IEZbRGF0YWRvZ10iLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCIsInRoZW1lVmFyaWFibGVzIjp7ImJhY2tncm91bmQiOiJ3aGl0ZSIsInByaW1hcnlDb2xvciI6IiNFQ0VDRkYiLCJzZWNvbmRhcnlDb2xvciI6IiNmZmZmZGUiLCJ0ZXJ0aWFyeUNvbG9yIjoiaHNsKDgwLCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJwcmltYXJ5Qm9yZGVyQ29sb3IiOiJoc2woMjQwLCA2MCUsIDg2LjI3NDUwOTgwMzklKSIsInNlY29uZGFyeUJvcmRlckNvbG9yIjoiaHNsKDYwLCA2MCUsIDgzLjUyOTQxMTc2NDclKSIsInRlcnRpYXJ5Qm9yZGVyQ29sb3IiOiJoc2woODAsIDYwJSwgODYuMjc0NTA5ODAzOSUpIiwicHJpbWFyeVRleHRDb2xvciI6IiMxMzEzMDAiLCJzZWNvbmRhcnlUZXh0Q29sb3IiOiIjMDAwMDIxIiwidGVydGlhcnlUZXh0Q29sb3IiOiJyZ2IoOS41MDAwMDAwMDAxLCA5LjUwMDAwMDAwMDEsIDkuNTAwMDAwMDAwMSkiLCJsaW5lQ29sb3IiOiIjMzMzMzMzIiwidGV4dENvbG9yIjoiIzMzMyIsIm1haW5Ca2ciOiIjRUNFQ0ZGIiwic2Vjb25kQmtnIjoiI2ZmZmZkZSIsImJvcmRlcjEiOiIjOTM3MERCIiwiYm9yZGVyMiI6IiNhYWFhMzMiLCJhcnJvd2hlYWRDb2xvciI6IiMzMzMzMzMiLCJmb250RmFtaWx5IjoiXCJ0cmVidWNoZXQgbXNcIiwgdmVyZGFuYSwgYXJpYWwiLCJmb250U2l6ZSI6IjE2cHgiLCJsYWJlbEJhY2tncm91bmQiOiIjZThlOGU4Iiwibm9kZUJrZyI6IiNFQ0VDRkYiLCJub2RlQm9yZGVyIjoiIzkzNzBEQiIsImNsdXN0ZXJCa2ciOiIjZmZmZmRlIiwiY2x1c3RlckJvcmRlciI6IiNhYWFhMzMiLCJkZWZhdWx0TGlua0NvbG9yIjoiIzMzMzMzMyIsInRpdGxlQ29sb3IiOiIjMzMzIiwiZWRnZUxhYmVsQmFja2dyb3VuZCI6IiNlOGU4ZTgiLCJhY3RvckJvcmRlciI6ImhzbCgyNTkuNjI2MTY4MjI0MywgNTkuNzc2NTM2MzEyOCUsIDg3LjkwMTk2MDc4NDMlKSIsImFjdG9yQmtnIjoiI0VDRUNGRiIsImFjdG9yVGV4dENvbG9yIjoiYmxhY2siLCJhY3RvckxpbmVDb2xvciI6ImdyZXkiLCJzaWduYWxDb2xvciI6IiMzMzMiLCJzaWduYWxUZXh0Q29sb3IiOiIjMzMzIiwibGFiZWxCb3hCa2dDb2xvciI6IiNFQ0VDRkYiLCJsYWJlbEJveEJvcmRlckNvbG9yIjoiaHNsKDI1OS42MjYxNjgyMjQzLCA1OS43NzY1MzYzMTI4JSwgODcuOTAxOTYwNzg0MyUpIiwibGFiZWxUZXh0Q29sb3IiOiJibGFjayIsImxvb3BUZXh0Q29sb3IiOiJibGFjayIsIm5vdGVCb3JkZXJDb2xvciI6IiNhYWFhMzMiLCJub3RlQmtnQ29sb3IiOiIjZmZmNWFkIiwibm90ZVRleHRDb2xvciI6ImJsYWNrIiwiYWN0aXZhdGlvbkJvcmRlckNvbG9yIjoiIzY2NiIsImFjdGl2YXRpb25Ca2dDb2xvciI6IiNmNGY0ZjQiLCJzZXF1ZW5jZU51bWJlckNvbG9yIjoid2hpdGUiLCJzZWN0aW9uQmtnQ29sb3IiOiJyZ2JhKDEwMiwgMTAyLCAyNTUsIDAuNDkpIiwiYWx0U2VjdGlvbkJrZ0NvbG9yIjoid2hpdGUiLCJzZWN0aW9uQmtnQ29sb3IyIjoiI2ZmZjQwMCIsInRhc2tCb3JkZXJDb2xvciI6IiM1MzRmYmMiLCJ0YXNrQmtnQ29sb3IiOiIjOGE5MGRkIiwidGFza1RleHRMaWdodENvbG9yIjoid2hpdGUiLCJ0YXNrVGV4dENvbG9yIjoid2hpdGUiLCJ0YXNrVGV4dERhcmtDb2xvciI6ImJsYWNrIiwidGFza1RleHRPdXRzaWRlQ29sb3IiOiJibGFjayIsInRhc2tUZXh0Q2xpY2thYmxlQ29sb3IiOiIjMDAzMTYzIiwiYWN0aXZlVGFza0JvcmRlckNvbG9yIjoiIzUzNGZiYyIsImFjdGl2ZVRhc2tCa2dDb2xvciI6IiNiZmM3ZmYiLCJncmlkQ29sb3IiOiJsaWdodGdyZXkiLCJkb25lVGFza0JrZ0NvbG9yIjoibGlnaHRncmV5IiwiZG9uZVRhc2tCb3JkZXJDb2xvciI6ImdyZXkiLCJjcml0Qm9yZGVyQ29sb3IiOiIjZmY4ODg4IiwiY3JpdEJrZ0NvbG9yIjoicmVkIiwidG9kYXlMaW5lQ29sb3IiOiJyZWQiLCJsYWJlbENvbG9yIjoiYmxhY2siLCJlcnJvckJrZ0NvbG9yIjoiIzU1MjIyMiIsImVycm9yVGV4dENvbG9yIjoiIzU1MjIyMiIsImNsYXNzVGV4dCI6IiMxMzEzMDAiLCJmaWxsVHlwZTAiOiIjRUNFQ0ZGIiwiZmlsbFR5cGUxIjoiI2ZmZmZkZSIsImZpbGxUeXBlMiI6ImhzbCgzMDQsIDEwMCUsIDk2LjI3NDUwOTgwMzklKSIsImZpbGxUeXBlMyI6ImhzbCgxMjQsIDEwMCUsIDkzLjUyOTQxMTc2NDclKSIsImZpbGxUeXBlNCI6ImhzbCgxNzYsIDEwMCUsIDk2LjI3NDUwOTgwMzklKSIsImZpbGxUeXBlNSI6ImhzbCgtNCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpIiwiZmlsbFR5cGU2IjoiaHNsKDgsIDEwMCUsIDk2LjI3NDUwOTgwMzklKSIsImZpbGxUeXBlNyI6ImhzbCgxODgsIDEwMCUsIDkzLjUyOTQxMTc2NDclKSJ9fSwidXBkYXRlRWRpdG9yIjp0cnVlLCJhdXRvU3luYyI6dHJ1ZSwidXBkYXRlRGlhZ3JhbSI6dHJ1ZX0" alt="" /></a></p>
<p>Furthermore, if you want to use <a href="https://docs.datadoghq.com/dashboards/querying/">Datadog’s metrics query</a>, register <a href="https://github.com/DataDog/datadog-operator/blob/v0.3.1/pkg/apis/datadoghq/v1alpha1/datadogmetric_types.go">CRD called datadogmetric</a>.</p>
<p>First, <a href="https://github.com/DataDog/datadog-agent/blob/dca-1.9.0/pkg/clusteragent/externalmetrics/autoscaler_watcher.go#L222">the Datadog cluster-agent checks if the HPA spec.metrics field is external</a>, <a href="https://github.com/DataDog/datadog-agent/blob/dca-1.9.0/pkg/clusteragent/externalmetrics/autoscaler_watcher.go#L233">parses the metric name such as <code class="language-plaintext highlighter-rouge">datadogmetric@<namespace>:<name></code></a>, and then <a href="https://github.com/DataDog/datadog-agent/blob/dca-1.9.0/pkg/clusteragent/externalmetrics/autoscaler_watcher.go#L233">sets the HPA Reference field</a>.
First, the Datadog cluster-agent checks if the HPA spec.metrics field is external, parses the metric name such as <code class="language-plaintext highlighter-rouge">datadogmetric@<namespace>:<name></code>, and then sets the HPA Reference field.</p>
<p>The HPA then queries the metrics server for references, and the cluster-agent receives it and <a href="https://github.com/DataDog/datadog-agent/blob/dca-1.9.0/pkg/clusteragent/externalmetrics/provider.go#L113">returns the query retrieved from Datadog</a>. As a side note, <a href="https://github.com/DataDog/datadog-agent/blob/dca-1.9.0/pkg/clusteragent/externalmetrics/datadogmetric_controller.go#L197">the Controller seems to save the query retrieved as a Local Store</a> and <a href="https://github.com/DataDog/datadog-agent/blob/dca-1.9.0/pkg/clusteragent/externalmetrics/datadogmetric_controller.go#L207">sync it to the DatadogMetric resource</a> in the Reconcile Loop rather than querying Datadog each time.</p>
<h2 id="architecture">Architecture</h2>
<p>Next, I will explain the architecture of using the Datadog metrics server and HPA to achieve Scheduled-Scaling.</p>
<p><a href="https://mermaid-js.github.io/mermaid-live-editor/edit##eyJjb2RlIjoiZ3JhcGggVERcbiAgQVtzY2hlZHVsZXNfcmV0cmlldmVfdGltZWRfZXhhbWluYXRpb25zXVxuICBCW01vbmdvREJdXG4gIHN1YmdyYXBoIEt1YmVybmV0ZXNcbiAgR1tIUEFdXG4gIEhbYXBpIGRlcGxveW1lbnRdXG4gIEpbRGF0YWRvZyBDbHVzdGVyIEFnZW50XVxuICAgIHN1YmdyYXBoIHRpbWVkLWV4YW0tc2NoZWR1bGUtZXhwb3J0ZXIgbmFtZXNwYWNlICBcbiAgICAgIEVbdGltZWQtZXhhbS1zY2hlZHVsZS1leHBvcnRlcl1cbiAgICAgIElbYXBpLWV4YW0tZGF0YSBDb25maWdtYXBdXG4gICAgICBFIC0tPnxtb3VudHxJXG4gICAgZW5kXG4gIGVuZFxuICBGW0RhdGFkb2ddXG4gIHN1YmdyYXBoIEplbmtpbnNcbiAgICBBXG4gIGVuZFxuXG5cbiAgQSAtLT58UmV0cml2ZSBkYXRhfEJcbiAgQSAtLT58UHV0IDIwMjAtbW0tZGQudHN2fElcbiAgRiAtLT58R2V0IGhvc3Q6ODA4MC9tZXRyaWNzfEVcbiAgRyAtLT58Q2hlY2sgbWV0cmljc3xKXG4gIEogLS0-fENoZWNrIG1ldHJpY3N8RlxuICBHIC0tPnxDaGFuZ2UgcmVwbGljYXN8SCIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0IiwidGhlbWVWYXJpYWJsZXMiOnsiYmFja2dyb3VuZCI6IndoaXRlIiwicHJpbWFyeUNvbG9yIjoiI0VDRUNGRiIsInNlY29uZGFyeUNvbG9yIjoiI2ZmZmZkZSIsInRlcnRpYXJ5Q29sb3IiOiJoc2woODAsIDEwMCUsIDk2LjI3NDUwOTgwMzklKSIsInByaW1hcnlCb3JkZXJDb2xvciI6ImhzbCgyNDAsIDYwJSwgODYuMjc0NTA5ODAzOSUpIiwic2Vjb25kYXJ5Qm9yZGVyQ29sb3IiOiJoc2woNjAsIDYwJSwgODMuNTI5NDExNzY0NyUpIiwidGVydGlhcnlCb3JkZXJDb2xvciI6ImhzbCg4MCwgNjAlLCA4Ni4yNzQ1MDk4MDM5JSkiLCJwcmltYXJ5VGV4dENvbG9yIjoiIzEzMTMwMCIsInNlY29uZGFyeVRleHRDb2xvciI6IiMwMDAwMjEiLCJ0ZXJ0aWFyeVRleHRDb2xvciI6InJnYig5LjUwMDAwMDAwMDEsIDkuNTAwMDAwMDAwMSwgOS41MDAwMDAwMDAxKSIsImxpbmVDb2xvciI6IiMzMzMzMzMiLCJ0ZXh0Q29sb3IiOiIjMzMzIiwibWFpbkJrZyI6IiNFQ0VDRkYiLCJzZWNvbmRCa2ciOiIjZmZmZmRlIiwiYm9yZGVyMSI6IiM5MzcwREIiLCJib3JkZXIyIjoiI2FhYWEzMyIsImFycm93aGVhZENvbG9yIjoiIzMzMzMzMyIsImZvbnRGYW1pbHkiOiJcInRyZWJ1Y2hldCBtc1wiLCB2ZXJkYW5hLCBhcmlhbCIsImZvbnRTaXplIjoiMTZweCIsImxhYmVsQmFja2dyb3VuZCI6IiNlOGU4ZTgiLCJub2RlQmtnIjoiI0VDRUNGRiIsIm5vZGVCb3JkZXIiOiIjOTM3MERCIiwiY2x1c3RlckJrZyI6IiNmZmZmZGUiLCJjbHVzdGVyQm9yZGVyIjoiI2FhYWEzMyIsImRlZmF1bHRMaW5rQ29sb3IiOiIjMzMzMzMzIiwidGl0bGVDb2xvciI6IiMzMzMiLCJlZGdlTGFiZWxCYWNrZ3JvdW5kIjoiI2U4ZThlOCIsImFjdG9yQm9yZGVyIjoiaHNsKDI1OS42MjYxNjgyMjQzLCA1OS43NzY1MzYzMTI4JSwgODcuOTAxOTYwNzg0MyUpIiwiYWN0b3JCa2ciOiIjRUNFQ0ZGIiwiYWN0b3JUZXh0Q29sb3IiOiJibGFjayIsImFjdG9yTGluZUNvbG9yIjoiZ3JleSIsInNpZ25hbENvbG9yIjoiIzMzMyIsInNpZ25hbFRleHRDb2xvciI6IiMzMzMiLCJsYWJlbEJveEJrZ0NvbG9yIjoiI0VDRUNGRiIsImxhYmVsQm94Qm9yZGVyQ29sb3IiOiJoc2woMjU5LjYyNjE2ODIyNDMsIDU5Ljc3NjUzNjMxMjglLCA4Ny45MDE5NjA3ODQzJSkiLCJsYWJlbFRleHRDb2xvciI6ImJsYWNrIiwibG9vcFRleHRDb2xvciI6ImJsYWNrIiwibm90ZUJvcmRlckNvbG9yIjoiI2FhYWEzMyIsIm5vdGVCa2dDb2xvciI6IiNmZmY1YWQiLCJub3RlVGV4dENvbG9yIjoiYmxhY2siLCJhY3RpdmF0aW9uQm9yZGVyQ29sb3IiOiIjNjY2IiwiYWN0aXZhdGlvbkJrZ0NvbG9yIjoiI2Y0ZjRmNCIsInNlcXVlbmNlTnVtYmVyQ29sb3IiOiJ3aGl0ZSIsInNlY3Rpb25Ca2dDb2xvciI6InJnYmEoMTAyLCAxMDIsIDI1NSwgMC40OSkiLCJhbHRTZWN0aW9uQmtnQ29sb3IiOiJ3aGl0ZSIsInNlY3Rpb25Ca2dDb2xvcjIiOiIjZmZmNDAwIiwidGFza0JvcmRlckNvbG9yIjoiIzUzNGZiYyIsInRhc2tCa2dDb2xvciI6IiM4YTkwZGQiLCJ0YXNrVGV4dExpZ2h0Q29sb3IiOiJ3aGl0ZSIsInRhc2tUZXh0Q29sb3IiOiJ3aGl0ZSIsInRhc2tUZXh0RGFya0NvbG9yIjoiYmxhY2siLCJ0YXNrVGV4dE91dHNpZGVDb2xvciI6ImJsYWNrIiwidGFza1RleHRDbGlja2FibGVDb2xvciI6IiMwMDMxNjMiLCJhY3RpdmVUYXNrQm9yZGVyQ29sb3IiOiIjNTM0ZmJjIiwiYWN0aXZlVGFza0JrZ0NvbG9yIjoiI2JmYzdmZiIsImdyaWRDb2xvciI6ImxpZ2h0Z3JleSIsImRvbmVUYXNrQmtnQ29sb3IiOiJsaWdodGdyZXkiLCJkb25lVGFza0JvcmRlckNvbG9yIjoiZ3JleSIsImNyaXRCb3JkZXJDb2xvciI6IiNmZjg4ODgiLCJjcml0QmtnQ29sb3IiOiJyZWQiLCJ0b2RheUxpbmVDb2xvciI6InJlZCIsImxhYmVsQ29sb3IiOiJibGFjayIsImVycm9yQmtnQ29sb3IiOiIjNTUyMjIyIiwiZXJyb3JUZXh0Q29sb3IiOiIjNTUyMjIyIiwiY2xhc3NUZXh0IjoiIzEzMTMwMCIsImZpbGxUeXBlMCI6IiNFQ0VDRkYiLCJmaWxsVHlwZTEiOiIjZmZmZmRlIiwiZmlsbFR5cGUyIjoiaHNsKDMwNCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwiZmlsbFR5cGUzIjoiaHNsKDEyNCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpIiwiZmlsbFR5cGU0IjoiaHNsKDE3NiwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwiZmlsbFR5cGU1IjoiaHNsKC00LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSkiLCJmaWxsVHlwZTYiOiJoc2woOCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwiZmlsbFR5cGU3IjoiaHNsKDE4OCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpIn19fQ"><img src="https://mermaid.ink/img/eyJjb2RlIjoiZ3JhcGggVERcbiAgQVtzY2hlZHVsZXNfcmV0cmlldmVfdGltZWRfZXhhbWluYXRpb25zXVxuICBCW01vbmdvREJdXG4gIHN1YmdyYXBoIEt1YmVybmV0ZXNcbiAgR1tIUEFdXG4gIEhbYXBpIGRlcGxveW1lbnRdXG4gIEpbRGF0YWRvZyBDbHVzdGVyIEFnZW50XVxuICAgIHN1YmdyYXBoIHRpbWVkLWV4YW0tc2NoZWR1bGUtZXhwb3J0ZXIgbmFtZXNwYWNlICBcbiAgICAgIEVbdGltZWQtZXhhbS1zY2hlZHVsZS1leHBvcnRlcl1cbiAgICAgIElbYXBpLWV4YW0tZGF0YSBDb25maWdtYXBdXG4gICAgICBFIC0tPnxtb3VudHxJXG4gICAgZW5kXG4gIGVuZFxuICBGW0RhdGFkb2ddXG4gIHN1YmdyYXBoIEplbmtpbnNcbiAgICBBXG4gIGVuZFxuXG5cbiAgQSAtLT58UmV0cml2ZSBkYXRhfEJcbiAgQSAtLT58UHV0IDIwMjAtbW0tZGQudHN2fElcbiAgRiAtLT58R2V0IGhvc3Q6ODA4MC9tZXRyaWNzfEVcbiAgRyAtLT58Q2hlY2sgbWV0cmljc3xKXG4gIEogLS0-fENoZWNrIG1ldHJpY3N8RlxuICBHIC0tPnxDaGFuZ2UgcmVwbGljYXN8SCIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0IiwidGhlbWVWYXJpYWJsZXMiOnsiYmFja2dyb3VuZCI6IndoaXRlIiwicHJpbWFyeUNvbG9yIjoiI0VDRUNGRiIsInNlY29uZGFyeUNvbG9yIjoiI2ZmZmZkZSIsInRlcnRpYXJ5Q29sb3IiOiJoc2woODAsIDEwMCUsIDk2LjI3NDUwOTgwMzklKSIsInByaW1hcnlCb3JkZXJDb2xvciI6ImhzbCgyNDAsIDYwJSwgODYuMjc0NTA5ODAzOSUpIiwic2Vjb25kYXJ5Qm9yZGVyQ29sb3IiOiJoc2woNjAsIDYwJSwgODMuNTI5NDExNzY0NyUpIiwidGVydGlhcnlCb3JkZXJDb2xvciI6ImhzbCg4MCwgNjAlLCA4Ni4yNzQ1MDk4MDM5JSkiLCJwcmltYXJ5VGV4dENvbG9yIjoiIzEzMTMwMCIsInNlY29uZGFyeVRleHRDb2xvciI6IiMwMDAwMjEiLCJ0ZXJ0aWFyeVRleHRDb2xvciI6InJnYig5LjUwMDAwMDAwMDEsIDkuNTAwMDAwMDAwMSwgOS41MDAwMDAwMDAxKSIsImxpbmVDb2xvciI6IiMzMzMzMzMiLCJ0ZXh0Q29sb3IiOiIjMzMzIiwibWFpbkJrZyI6IiNFQ0VDRkYiLCJzZWNvbmRCa2ciOiIjZmZmZmRlIiwiYm9yZGVyMSI6IiM5MzcwREIiLCJib3JkZXIyIjoiI2FhYWEzMyIsImFycm93aGVhZENvbG9yIjoiIzMzMzMzMyIsImZvbnRGYW1pbHkiOiJcInRyZWJ1Y2hldCBtc1wiLCB2ZXJkYW5hLCBhcmlhbCIsImZvbnRTaXplIjoiMTZweCIsImxhYmVsQmFja2dyb3VuZCI6IiNlOGU4ZTgiLCJub2RlQmtnIjoiI0VDRUNGRiIsIm5vZGVCb3JkZXIiOiIjOTM3MERCIiwiY2x1c3RlckJrZyI6IiNmZmZmZGUiLCJjbHVzdGVyQm9yZGVyIjoiI2FhYWEzMyIsImRlZmF1bHRMaW5rQ29sb3IiOiIjMzMzMzMzIiwidGl0bGVDb2xvciI6IiMzMzMiLCJlZGdlTGFiZWxCYWNrZ3JvdW5kIjoiI2U4ZThlOCIsImFjdG9yQm9yZGVyIjoiaHNsKDI1OS42MjYxNjgyMjQzLCA1OS43NzY1MzYzMTI4JSwgODcuOTAxOTYwNzg0MyUpIiwiYWN0b3JCa2ciOiIjRUNFQ0ZGIiwiYWN0b3JUZXh0Q29sb3IiOiJibGFjayIsImFjdG9yTGluZUNvbG9yIjoiZ3JleSIsInNpZ25hbENvbG9yIjoiIzMzMyIsInNpZ25hbFRleHRDb2xvciI6IiMzMzMiLCJsYWJlbEJveEJrZ0NvbG9yIjoiI0VDRUNGRiIsImxhYmVsQm94Qm9yZGVyQ29sb3IiOiJoc2woMjU5LjYyNjE2ODIyNDMsIDU5Ljc3NjUzNjMxMjglLCA4Ny45MDE5NjA3ODQzJSkiLCJsYWJlbFRleHRDb2xvciI6ImJsYWNrIiwibG9vcFRleHRDb2xvciI6ImJsYWNrIiwibm90ZUJvcmRlckNvbG9yIjoiI2FhYWEzMyIsIm5vdGVCa2dDb2xvciI6IiNmZmY1YWQiLCJub3RlVGV4dENvbG9yIjoiYmxhY2siLCJhY3RpdmF0aW9uQm9yZGVyQ29sb3IiOiIjNjY2IiwiYWN0aXZhdGlvbkJrZ0NvbG9yIjoiI2Y0ZjRmNCIsInNlcXVlbmNlTnVtYmVyQ29sb3IiOiJ3aGl0ZSIsInNlY3Rpb25Ca2dDb2xvciI6InJnYmEoMTAyLCAxMDIsIDI1NSwgMC40OSkiLCJhbHRTZWN0aW9uQmtnQ29sb3IiOiJ3aGl0ZSIsInNlY3Rpb25Ca2dDb2xvcjIiOiIjZmZmNDAwIiwidGFza0JvcmRlckNvbG9yIjoiIzUzNGZiYyIsInRhc2tCa2dDb2xvciI6IiM4YTkwZGQiLCJ0YXNrVGV4dExpZ2h0Q29sb3IiOiJ3aGl0ZSIsInRhc2tUZXh0Q29sb3IiOiJ3aGl0ZSIsInRhc2tUZXh0RGFya0NvbG9yIjoiYmxhY2siLCJ0YXNrVGV4dE91dHNpZGVDb2xvciI6ImJsYWNrIiwidGFza1RleHRDbGlja2FibGVDb2xvciI6IiMwMDMxNjMiLCJhY3RpdmVUYXNrQm9yZGVyQ29sb3IiOiIjNTM0ZmJjIiwiYWN0aXZlVGFza0JrZ0NvbG9yIjoiI2JmYzdmZiIsImdyaWRDb2xvciI6ImxpZ2h0Z3JleSIsImRvbmVUYXNrQmtnQ29sb3IiOiJsaWdodGdyZXkiLCJkb25lVGFza0JvcmRlckNvbG9yIjoiZ3JleSIsImNyaXRCb3JkZXJDb2xvciI6IiNmZjg4ODgiLCJjcml0QmtnQ29sb3IiOiJyZWQiLCJ0b2RheUxpbmVDb2xvciI6InJlZCIsImxhYmVsQ29sb3IiOiJibGFjayIsImVycm9yQmtnQ29sb3IiOiIjNTUyMjIyIiwiZXJyb3JUZXh0Q29sb3IiOiIjNTUyMjIyIiwiY2xhc3NUZXh0IjoiIzEzMTMwMCIsImZpbGxUeXBlMCI6IiNFQ0VDRkYiLCJmaWxsVHlwZTEiOiIjZmZmZmRlIiwiZmlsbFR5cGUyIjoiaHNsKDMwNCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwiZmlsbFR5cGUzIjoiaHNsKDEyNCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpIiwiZmlsbFR5cGU0IjoiaHNsKDE3NiwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwiZmlsbFR5cGU1IjoiaHNsKC00LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSkiLCJmaWxsVHlwZTYiOiJoc2woOCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwiZmlsbFR5cGU3IjoiaHNsKDE4OCwgMTAwJSwgOTMuNTI5NDExNzY0NyUpIn19LCJ1cGRhdGVFZGl0b3IiOnRydWUsImF1dG9TeW5jIjp0cnVlLCJ1cGRhdGVEaWFncmFtIjp0cnVlfQ" alt="" /></a></p>
<h3 id="fetch-data-from-our-database-and-save-it-as-a-configmap">Fetch data from our database and save it as a ConfigMap</h3>
<p>See the area around <code class="language-plaintext highlighter-rouge">schedules_retrive_timed_examinations</code> at the bottom right (check the diagram above). @bdesmero created this part. <code class="language-plaintext highlighter-rouge">schedules_retrive_timed_examinations</code> gets the starting time of the exam and the corresponding number of students from our database and saves it as a TSV file. The TSV file looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>12:00 229
12:15 54
12:45 67
13:00 3684
13:15 91
13:30 4821
13:45 37
14:00 138
</code></pre></div></div>
<p>We divided the work between @bdesmero (as the web developer) and me (as the SRE).
The dependency on Jenkins and the use of ConfigMap is a drawback that increases the number of points of failure. Still, I think it was a reasonable choice for the shortest possible time and cooperation.</p>
<h3 id="export-the-read-data-from-tsv-in-prometheus-format">Export the read data from TSV in Prometheus format</h3>
<p>Next, let’s take a look at the timed-exam-schedule-exporter component on the bottom left. It is written in Go and runs as a Kubernetes Deployment.</p>
<p>This component does the following:</p>
<ul>
<li>Mount the ConfigMap</li>
<li>Read the file in an infinite loop</li>
<li>Compare with the current time</li>
<li>Export the corresponding number of users in Prometheus format</li>
</ul>
<p>The key point is to export the value 15 minutes after the current time because we want pods/nodes to start scaling out 15 minutes before users access it, given the time it takes to scale.</p>
<p>Let’s take a look at the code(it’s not that long, around 180 lines).</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"encoding/csv"</span>
<span class="s">"errors"</span>
<span class="s">"fmt"</span>
<span class="s">"io"</span>
<span class="s">"log"</span>
<span class="s">"net/http"</span>
<span class="s">"os"</span>
<span class="s">"strconv"</span>
<span class="s">"time"</span>
<span class="s">"github.com/prometheus/client_golang/prometheus"</span>
<span class="s">"github.com/prometheus/client_golang/prometheus/promhttp"</span>
<span class="p">)</span>
<span class="k">var</span> <span class="p">(</span>
<span class="c">//nolint:gochecknoglobals</span>
<span class="n">desiredReplicas</span> <span class="o">=</span> <span class="n">prometheus</span><span class="o">.</span><span class="n">NewGauge</span><span class="p">(</span><span class="n">prometheus</span><span class="o">.</span><span class="n">GaugeOpts</span><span class="p">{</span>
<span class="n">Namespace</span><span class="o">:</span> <span class="s">"timed_exam"</span><span class="p">,</span>
<span class="n">Subsystem</span><span class="o">:</span> <span class="s">"scheduled_scaling"</span><span class="p">,</span>
<span class="n">Name</span><span class="o">:</span> <span class="s">"desired_replicas"</span><span class="p">,</span>
<span class="n">Help</span><span class="o">:</span> <span class="s">"Number of desired replicas for timed exam"</span><span class="p">,</span>
<span class="p">})</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">interval</span> <span class="o">=</span> <span class="m">10</span>
<span class="n">prometheus</span><span class="o">.</span><span class="n">MustRegister</span><span class="p">(</span><span class="n">desiredReplicas</span><span class="p">)</span>
<span class="n">http</span><span class="o">.</span><span class="n">Handle</span><span class="p">(</span><span class="s">"/metrics"</span><span class="p">,</span> <span class="n">promhttp</span><span class="o">.</span><span class="n">Handler</span><span class="p">())</span>
<span class="k">go</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
<span class="n">ticker</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">NewTicker</span><span class="p">(</span><span class="n">interval</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">)</span>
<span class="c">// register metrics as background</span>
<span class="k">for</span> <span class="k">range</span> <span class="n">ticker</span><span class="o">.</span><span class="n">C</span> <span class="p">{</span>
<span class="n">err</span> <span class="o">:=</span> <span class="n">snapshot</span><span class="p">()</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}()</span>
<span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="n">http</span><span class="o">.</span><span class="n">ListenAndServe</span><span class="p">(</span><span class="s">":8080"</span><span class="p">,</span> <span class="no">nil</span><span class="p">))</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">snapshot</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">timeDifferencesToJapan</span> <span class="o">=</span> <span class="o">+</span><span class="m">9</span> <span class="o">*</span> <span class="m">60</span> <span class="o">*</span> <span class="m">60</span>
<span class="n">tz</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">FixedZone</span><span class="p">(</span><span class="s">"JST"</span><span class="p">,</span> <span class="n">timeDifferencesToJapan</span><span class="p">)</span>
<span class="n">t</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Now</span><span class="p">()</span><span class="o">.</span><span class="n">In</span><span class="p">(</span><span class="n">tz</span><span class="p">)</span>
<span class="n">today</span> <span class="o">:=</span> <span class="n">t</span><span class="o">.</span><span class="n">Format</span><span class="p">(</span><span class="s">"2006-01-02"</span><span class="p">)</span>
<span class="c">// Configmap is mounted</span>
<span class="n">filename</span> <span class="o">:=</span> <span class="s">"/etc/config/"</span> <span class="o">+</span> <span class="n">today</span> <span class="o">+</span> <span class="s">".tsv"</span>
<span class="n">file</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to open file: %w"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">defer</span> <span class="n">file</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span>
<span class="n">currentUsers</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">getCurrentUsers</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tz</span><span class="p">,</span> <span class="n">file</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to get the current number of users: %w"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">desiredReplicas</span><span class="o">.</span><span class="n">Set</span><span class="p">(</span><span class="n">currentUsers</span><span class="p">)</span>
<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">getCurrentUsers</span><span class="p">(</span><span class="n">now</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">,</span> <span class="n">tz</span> <span class="o">*</span><span class="n">time</span><span class="o">.</span><span class="n">Location</span><span class="p">,</span> <span class="n">file</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">)</span> <span class="p">(</span><span class="kt">float64</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">metricTimeDifference</span> <span class="o">=</span> <span class="o">+</span><span class="m">15</span>
<span class="c">// read input file</span>
<span class="n">reader</span> <span class="o">:=</span> <span class="n">csv</span><span class="o">.</span><span class="n">NewReader</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
<span class="n">reader</span><span class="o">.</span><span class="n">Comma</span> <span class="o">=</span> <span class="sc">'\t'</span>
<span class="c">// line[0] is time. i.e. "13:00"</span>
<span class="c">// line[1] is users. i.e. "350"</span>
<span class="k">var</span> <span class="n">previousNumberOfUsers</span> <span class="kt">float64</span> <span class="c">// A variable for storing the value of the previous loop</span>
<span class="k">var</span> <span class="n">index</span> <span class="kt">int64</span>
<span class="k">for</span> <span class="p">{</span>
<span class="n">index</span><span class="o">++</span>
<span class="n">parsedTSVLine</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">parseLine</span><span class="p">(</span><span class="n">reader</span><span class="p">,</span> <span class="n">now</span><span class="p">,</span> <span class="n">tz</span><span class="p">)</span>
<span class="k">if</span> <span class="n">errors</span><span class="o">.</span><span class="n">Is</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">io</span><span class="o">.</span><span class="n">EOF</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">previousNumberOfUsers</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to parse a line (line: %d): %w"</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="c">// Example:</span>
<span class="c">// line[0] line[1]</span>
<span class="c">// 17:00 4</span>
<span class="c">// 17:15 10</span>
<span class="c">//</span>
<span class="c">// Loop compares the current time with the time on line[0],</span>
<span class="c">// and if the current time is later than the current time,</span>
<span class="c">// the previous line[1] is used as gauge.</span>
<span class="c">//</span>
<span class="c">// To prepare the pods and nodes "metricTimeDifference" minutes in advance,</span>
<span class="c">// expose the value "metricTimeDifference" minutes ahead of the current value.</span>
<span class="c">// In the above example, it will expose 10 at 17:00.</span>
<span class="k">if</span> <span class="n">parsedTSVLine</span><span class="o">.</span><span class="n">time</span><span class="o">.</span><span class="n">After</span><span class="p">(</span><span class="n">now</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="n">metricTimeDifference</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Minute</span><span class="p">))</span> <span class="p">{</span>
<span class="c">// If the time of the first line is earlier than the time of the first line,</span>
<span class="c">// expose the value of the first line.</span>
<span class="k">if</span> <span class="n">previousNumberOfUsers</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">parsedTSVLine</span><span class="o">.</span><span class="n">numberOfUsers</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">previousNumberOfUsers</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">previousNumberOfUsers</span> <span class="o">=</span> <span class="n">parsedTSVLine</span><span class="o">.</span><span class="n">numberOfUsers</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">type</span> <span class="n">tsvLine</span> <span class="k">struct</span> <span class="p">{</span>
<span class="n">time</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span>
<span class="n">numberOfUsers</span> <span class="kt">float64</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">parseLine</span><span class="p">(</span><span class="n">reader</span> <span class="o">*</span><span class="n">csv</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">now</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">,</span> <span class="n">tz</span> <span class="o">*</span><span class="n">time</span><span class="o">.</span><span class="n">Location</span><span class="p">)</span> <span class="p">(</span><span class="n">tsvLine</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
<span class="n">line</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">readLineOfTSV</span><span class="p">(</span><span class="n">reader</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">tsvLine</span><span class="p">{},</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to read a line from TSV: %w"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">parsedTime</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">parseTime</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">now</span><span class="p">,</span> <span class="n">tz</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">tsvLine</span><span class="p">{},</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to parse time from string to time: %s: %w"</span><span class="p">,</span> <span class="n">line</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">parsedNumberOfUsers</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">strconv</span><span class="o">.</span><span class="n">ParseFloat</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="m">64</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">tsvLine</span><span class="p">{},</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"the TSV file is invalid. The value of second column must be float: %s: %w"</span><span class="p">,</span> <span class="n">line</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">tsvLine</span><span class="p">{</span>
<span class="n">time</span><span class="o">:</span> <span class="n">parsedTime</span><span class="p">,</span>
<span class="n">numberOfUsers</span><span class="o">:</span> <span class="n">parsedNumberOfUsers</span><span class="p">,</span>
<span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">parseTime</span><span class="p">(</span><span class="n">inputTime</span> <span class="kt">string</span><span class="p">,</span> <span class="n">t</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">,</span> <span class="n">tz</span> <span class="o">*</span><span class="n">time</span><span class="o">.</span><span class="n">Location</span><span class="p">)</span> <span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">layout</span> <span class="o">=</span> <span class="s">"15:04"</span>
<span class="c">// parse "13:00" -> 2020-11-05 13:00:00 +0900 JST</span>
<span class="n">startTime</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">ParseInLocation</span><span class="p">(</span><span class="n">layout</span><span class="p">,</span> <span class="n">inputTime</span><span class="p">,</span> <span class="n">tz</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">{},</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to parse a time string %s (layout: %s): %w"</span><span class="p">,</span> <span class="n">inputTime</span><span class="p">,</span> <span class="n">layout</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">parsedTime</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Date</span><span class="p">(</span>
<span class="n">t</span><span class="o">.</span><span class="n">Year</span><span class="p">(),</span> <span class="n">t</span><span class="o">.</span><span class="n">Month</span><span class="p">(),</span> <span class="n">t</span><span class="o">.</span><span class="n">Day</span><span class="p">(),</span>
<span class="n">startTime</span><span class="o">.</span><span class="n">Hour</span><span class="p">(),</span> <span class="n">startTime</span><span class="o">.</span><span class="n">Minute</span><span class="p">(),</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="n">tz</span><span class="p">)</span>
<span class="k">return</span> <span class="n">parsedTime</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">readLineOfTSV</span><span class="p">(</span><span class="n">reader</span> <span class="o">*</span><span class="n">csv</span><span class="o">.</span><span class="n">Reader</span><span class="p">)</span> <span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">columnNum</span> <span class="o">=</span> <span class="m">2</span>
<span class="n">line</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">reader</span><span class="o">.</span><span class="n">Read</span><span class="p">()</span>
<span class="k">if</span> <span class="n">errors</span><span class="o">.</span><span class="n">Is</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">io</span><span class="o">.</span><span class="n">EOF</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">line</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"end of file: %w"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">line</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"loading error: %w"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="c">// Check if the input tsv file is valid</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="o">!=</span> <span class="n">columnNum</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">line</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"the input tsv column is invalid. expected: %d actual: %d"</span><span class="p">,</span> <span class="n">columnNum</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">line</span><span class="p">))</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">line</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">main()</code> and the <code class="language-plaintext highlighter-rouge">snapshot()</code> functions are the essential parts to this design pattern.</p>
<p>In <code class="language-plaintext highlighter-rouge">main()</code>, we do some background processing using a <a href="https://gobyexample.com/tickers">ticker</a> and listen on HTTP port 8080.</p>
<p>In <code class="language-plaintext highlighter-rouge">snapshot()</code>, we read the file, get the values we need, and set them as gauge metrics in <code class="language-plaintext highlighter-rouge">desiredReplicas.Set(currentUsers)</code>.</p>
<p>The rest of the code is to read and parse lines. Basically, in the <a href="https://github.com/prometheus/client_golang">Prometheus Go client library</a>, the timestamp is set to the current time. <a href="https://docs.datadoghq.com/api/v1/metrics/#submit-metrics">In Datadog, timestamps cannot be set more than 10 minutes in the future or more than 1 hour in the past</a>, so we export the value after 15 minutes to the current time.</p>
<p>Here is an example of getting the exported metrics.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># in another window
# kubectl port-forward timed-exam-schedule-exporter-775fcc7c5b-qg6q6 8080:8080 -n timed-exam-schedule-exporter
$ curl -s localhost:8080/metrics | grep timed_exam_scheduled_scali
ng_desired_replicas
# HELP timed_exam_scheduled_scaling_desired_replicas Number of desired replicas for timed exam
# TYPE timed_exam_scheduled_scaling_desired_replicas gauge
TYPE timed_exam_scheduled_scaling_desired_replicas gauge
</code></pre></div></div>
<h3 id="get-datadog-agent-to-scrape-the-exported-metrics">Get datadog-agent to scrape the exported metrics.</h3>
<p>We use <a href="https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes">Datadog Kubernetes Integration Autodiscovery</a>, which looks at the Pod’s annotation and fetches the metrics for us.</p>
<p>Here is the deployment manifest:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">timed-exam-schedule-exporter</span>
<span class="na">namespace</span><span class="pi">:</span> <span class="s">timed-exam-schedule-exporter</span>
<span class="na">labels</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">timed-exam-schedule-exporter</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="na">replicas</span><span class="pi">:</span> <span class="m">3</span>
<span class="na">selector</span><span class="pi">:</span>
<span class="na">matchLabels</span><span class="pi">:</span>
<span class="na">app</span><span class="pi">:</span> <span class="s">timed-exam-schedule-exporter</span>
<span class="na">template</span><span class="pi">:</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">labels</span><span class="pi">:</span>
<span class="na">app</span><span class="pi">:</span> <span class="s">timed-exam-schedule-exporter</span>
<span class="na">annotations</span><span class="pi">:</span>
<span class="s">ad.datadoghq.com/timed-exam-schedule-exporter.check_names</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">["prometheus"]</span>
<span class="s">ad.datadoghq.com/timed-exam-schedule-exporter.init_configs</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">[{}]</span>
<span class="s">ad.datadoghq.com/timed-exam-schedule-exporter.instances</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">[</span>
<span class="s">{</span>
<span class="s">"prometheus_url": "http://%%host%%:8080/metrics",</span>
<span class="s">"namespace": "timed_exam",</span>
<span class="s">"metrics": ["*"]</span>
<span class="s">}</span>
<span class="s">]</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="na">containers</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">image</span><span class="pi">:</span> <span class="s"><aws-account-id>.dkr.ecr.<region-name>.amazonaws.com/timed-exam-schedule-exporter:<commit hash></span>
<span class="na">name</span><span class="pi">:</span> <span class="s">timed-exam-schedule-exporter</span>
<span class="na">ports</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">http</span>
<span class="na">containerPort</span><span class="pi">:</span> <span class="m">8080</span>
<span class="na">livenessProbe</span><span class="pi">:</span>
<span class="na">initialDelaySeconds</span><span class="pi">:</span> <span class="m">1</span>
<span class="na">httpGet</span><span class="pi">:</span>
<span class="na">path</span><span class="pi">:</span> <span class="s">/metrics</span>
<span class="na">port</span><span class="pi">:</span> <span class="m">8080</span>
<span class="na">resources</span><span class="pi">:</span>
<span class="na">limits</span><span class="pi">:</span>
<span class="na">memory</span><span class="pi">:</span> <span class="s">100Mi</span>
<span class="na">requests</span><span class="pi">:</span>
<span class="na">cpu</span><span class="pi">:</span> <span class="s">100m</span>
<span class="na">memory</span><span class="pi">:</span> <span class="s">100Mi</span>
<span class="na">volumeMounts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/etc/config</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">config-volume</span>
<span class="na">volumes</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">configMap</span><span class="pi">:</span>
<span class="na">defaultMode</span><span class="pi">:</span> <span class="m">420</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">api-exam-data</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">config-volume</span>
</code></pre></div></div>
<h3 id="use-datadog-query-to-scale-in-hpa">Use Datadog query to scale in HPA</h3>
<p>Finally, take a look at the upper left part of the diagram. It’s probably easier to understand if you look at the manifest.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">autoscaling/v2beta2</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">HorizontalPodAutoscaler</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">api</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="na">scaleTargetRef</span><span class="pi">:</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">api</span>
<span class="na">minReplicas</span><span class="pi">:</span> <span class="m">40</span>
<span class="na">maxReplicas</span><span class="pi">:</span> <span class="m">1000</span>
<span class="na">metrics</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">type</span><span class="pi">:</span> <span class="s">Resource</span>
<span class="na">resource</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">cpu</span>
<span class="na">target</span><span class="pi">:</span>
<span class="na">type</span><span class="pi">:</span> <span class="s">Utilization</span>
<span class="na">averageUtilization</span><span class="pi">:</span> <span class="m">60</span> <span class="c1"># want 570 mcore of cpu usage. 570 / 950(requests) = 0.6</span>
<span class="pi">-</span> <span class="na">type</span><span class="pi">:</span> <span class="s">External</span>
<span class="na">external</span><span class="pi">:</span>
<span class="na">metric</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">datadogmetric@production:timed-exam</span>
<span class="na">target</span><span class="pi">:</span>
<span class="na">type</span><span class="pi">:</span> <span class="s">AverageValue</span>
<span class="na">averageValue</span><span class="pi">:</span> <span class="m">1</span>
</code></pre></div></div>
<p>The “type: External” part is the new part that we added to our existing HPA. Note that HPA allows us to specify multiple metrics, and <a href="https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/controller/podautoscaler/horizontal.go#L261-L276">it uses the higher value once it has been calculated</a>. Thanks to this mechanism, it is possible to achieve a combination of scaling by different metrics at specific times while we do CPU scaling.</p>
<p>Here is the DatadogMetric referred.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">datadoghq.com/v1alpha1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">DatadogMetric</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">timed-exam</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="c1"># throughput: 10 = 500 / 5000. 500 pods accept 5000 users.</span>
<span class="c1"># ref: https://github.com/quipper/xxxxxxx/issues/xxxxx</span>
<span class="na">query</span><span class="pi">:</span> <span class="s">ceil(max:timed_exam.timed_exam_scheduled_scaling_desired_replicas{environment:production}/10)</span>
</code></pre></div></div>
<p>Datadog custom metrics <code class="language-plaintext highlighter-rouge">timed_exam.timed_exam_scheduled_scaling_expected_users</code> represents the number of users written in the TSV file. Datadog query calculates how many pods are required per user.</p>
<p>By using the Datadog query, I was able to reduce the amount of code to write.</p>
<h2 id="how-to-apply">How to apply</h2>
<p>After having confirmed the operation in Staging, I applied the following steps in Production:</p>
<p>Deploy ConfigMap and timed-exam-schedule-exporter to Production, and send metrics to Datadog.
Apply Datadog Metrics and test HPA to confirm if the HPA works as expected.
Update the HPA with the production application. The minReplicas should be large at this time.
Gradually lower the value of minReplicas while we observe the situation.</p>
<p>Since this is a configuration change related to production scaling, and there are many integrated parts, I had to apply it carefully.</p>
<p>Note that even if you only apply DatadogMetric, HPA Controller does not retrieve the metric unless HPA references the DatadogMetric. That’s because the cluster-agent executes a DatadogMetric Query and updates the status only when HPA Controller retrieves the metric. Therefore, we used a dummy application and HPA for the verification at step 2.</p>
<p>Once I knew that the Datadog custom metric and HPA settings were all in place, to test the setup, I set minReplicas to a high value; then, I gradually decreased the replicas while keeping an eye on the actual TSV file to make sure the number of replicas changed based on the data in the TSV file. I was able to confirm the replicas scaled out properly.</p>
<h2 id="faq">FAQ</h2>
<h3 id="what-happens-if-the-tsv-file-is-invalid">What happens if the TSV file is invalid?</h3>
<p>The timed-exam-schedule-exporter exposes 0 value. In which case, it scales by the CPU.</p>
<h3 id="what-happens-if-the-communication-with-datadog-fails">What happens if the communication with Datadog fails?</h3>
<p>Datadog cluster-agent sets Invalid status to <code class="language-plaintext highlighter-rouge">DatadogMetric</code> Custom Resource, and the result of the HPA external metric calculation shows unknown. In this case, It is scaled by CPU.</p>
<h3 id="what-happens-if-the-timed-exam-schedule-exporter-goes-down">What happens if the timed-exam-schedule-exporter goes down?</h3>
<p>The metrics are not exposed to Datadog so the metric will result in <code class="language-plaintext highlighter-rouge">No Data</code> in Datadog. The metrics will be scaled by CPU as above.</p>
<p>In both cases, thanks to HPA’s behavior regarding multiple metrics, CPU scaling kicks in even if something goes wrong with the external metrics.</p>
<h2 id="result">Result</h2>
<p>The number of pods and nodes we have got so far:</p>
<p><img src="https://user-images.githubusercontent.com/10370988/124416898-f4fb3680-dd92-11eb-893f-2a56a2219e0f.png" alt="image" /></p>
<p>The number of pods scale out up to 400 uniformly from 6:30 am to 7:30 pm.</p>
<p><img src="https://user-images.githubusercontent.com/10370988/124416910-fb89ae00-dd92-11eb-91b7-5a0ad42de28b.png" alt="image" /></p>
<p>The number of Nodes also increases in proportion to the number of Pods.</p>
<p>And this is the number of Pods and custom metrics one week after we started using Scheduled-Scaling :</p>
<p><img src="https://user-images.githubusercontent.com/10370988/124416923-02b0bc00-dd93-11eb-8a71-d7f9dec7702b.png" alt="image" /></p>
<p>The yellow line is the metric registered with <code class="language-plaintext highlighter-rouge">DatadogMetric</code> Custom Resource, and the purple line is the HPA Desired Replicas.</p>
<p>How amazing it is! When there’s no traffic spike, the scaling is executed by CPU. On the other hand, when many users are expected to use the platform, scaling is executed by the External Metrics.</p>
<p>The number of Nodes is also lower than before. The decreasing of the area graph indicates that we reduced costs. The daily usage cost has gone down to $145 from $250, and estimated cost reduction is about $3150 monthly.</p>
<p><img src="https://user-images.githubusercontent.com/10370988/124416931-093f3380-dd93-11eb-840c-8fa0104cbe07.png" alt="image" /></p>
<p>The purple line is the number of Nodes before we started using Scheduled-Scaling. The blue line is the number of Nodes after we started using Scheduled-Scaling.</p>
<p>We achieved flexible scaling based on the domain data of the number of users in scheduled exam. Furthermore we were able to eliminate human intervention, and reduce the redundant infrastructure cost. That’s great!</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this article, I explained how to send the number of users in Datadog custom metrics and then the way to scale them as External Metrics with HPA. The multiple metrics of the HPA enabled us to achieve Scheduled-Scaling safely while using it together with the CPU-based scaling. As a result, we were able to ensure both of resource efficiency and reliability.</p>
<p>This case study has led us to adopt Datadog metrics as an external metric for HPA. We have confirmed cases where CPU scaling does not work well for some services using a messaging/queue system like Google Cloud Pub/Sub. I think that auto-scaling by queue length metric in Datadog might help scale those services properly.</p>
<p>Besides, I think this is a great example of problem solving through communication among different teams with different roles and responsibilities, including SRE, Web Developer, and Business Developer. We SREs may know how to use Kubernetes HPA and Datadog, but we don’t know the details of the database and application features, such as the domain knowledge of the service. I think this is an excellent example of a problem solving by close communication. We were able to share the problems and face them together. That led us to the success!</p>
<p><a href="https://career.quipper.com">Quipper is looking for people who want to Bring the Best Education to Every Corner of the World</a>. <a href="https://career.quipper.com/jp/jobs/sre/">SRE Team is also hiring</a>.</p>
Tue, 13 Jul 2021 14:00:00 +0000
https://devs.quipper.com/2021/07/13/schduled-scaling-with-kubernetes-external-metrics.html
https://devs.quipper.com/2021/07/13/schduled-scaling-with-kubernetes-external-metrics.htmlModifying a third-pary library on a bytecode level<p>As developers in an EdTech company it is important for us to be updated with the latest tech trends especially if it involves one of the vital libraries our application can’t launch without, the Android Support Library. This is important because it enables us to reach a wide range of user base by supporting a lot of android platforms as possible.</p>
<p>But since Android Support Library is on its end of life, we recently migrated to androidx and updated a lot of dependencies. We survived a huge code change but not without some hiccups. On the UI side of things we need to update material-component libraries from version alpha01 to alpha05 and eventually alpha06 to fix an issue with our login. But this came with a price because on our happy path testing, a side effect appeared. Not on our code base but to a third-party library that we’re using.</p>
<p>We are using a third party sdk on our app for messaging communication between student and a teacher so this feature is critical for our app. For now let’s call it “<strong>ChatSDK</strong>”. Upgrading the material component can really impact a third party library that uses UI components.</p>
<p>Android <strong>ChatSDK</strong> crashes when <strong>ChatSDK</strong>BaseActivity call <code class="language-plaintext highlighter-rouge">applyOverrideConfiguration</code> with an</p>
<p><code class="language-plaintext highlighter-rouge">IllegalStateException(" Override configuration has already been set ")</code></p>
<p>This is because <code class="language-plaintext highlighter-rouge">getResources()</code> or <code class="language-plaintext highlighter-rouge">getAssets()</code> has already been called</p>
<p>In this case getResources is called before the <code class="language-plaintext highlighter-rouge">applyOverrideConfiguration</code> in the ContextLocaliser class</p>
<p>So after we find out the root cause, we have speculated that maybe the recent library upgrades affects the third-party library’s crash. We’ve tried downgrading the material libraries to make it work but it’s not a good practice since a lot of UI stuff from our code base will be affected just to fix this crash so we investigated further. Good thing there’s a dedicated forum for <strong>ChatSDK</strong> users and we found out that we’re not alone. Surprisingly, others have been having this issue since 5 months ago. Around 3 months after the reported issue, one of the company representatives informed in a comment that their engineers are working for a fix but there is no ETA and possible damage assessment.</p>
<p>Other users reported that downgrading the material support library fixes it for them, but for us it isn’t an option because appcompat is a dependency for many other (Jetpack) libraries. We are in the middle of a sprint and this feature is critical to our paying users so we have to look for a workaround. Our manager suggested that if we can reverse engineer the aar library and delete the offending line, it might work for us temporarily until <strong>ChatSDK</strong> releases a fix.</p>
<p><img src="/assets/article_images/2020-08-21-modifying-third-party-library-from-bytecode-level/1.png" alt="Before and after modification" /></p>
<p>In our team, we practice pair programming so I’m working with another developer to figure out the feasibility of the workaround.</p>
<p>First we need a java decompiler and since we are working on different machines we went on trying decompilers on our own system.</p>
<p>Step 1 - Extract java classes from the package</p>
<p>Step 2 - Modify Java Bytecode</p>
<p>Step 3 - Verify and repackage it again</p>
<p>Step 4 - Import the repackaged aar together with other dependencies</p>
<p>Then on our project we will import it manually together with other dependencies instead of using gradle for this work around.</p>
<p>We tried a couple of decompilers but we can’t modify up to the bytecode level and repackage it again.</p>
<p>We found a library called <a href="https://github.com/Col-E/Recaf">Recaf</a> which is a modern java bytecode editor</p>
<p>But first we have to copy rename the base library and change the file extension from .aar to jar in order to extract the classes folder before we can import it on the bytecode editor</p>
<p><img src="/assets/article_images/2020-08-21-modifying-third-party-library-from-bytecode-level/2.png" alt="Dependencies" /></p>
<p>We find it easier to use because it has a GUI. Just execute the downloaded jar file and a window will open. Drag the classes.jar file and browse the specific class file that you want to edit. Then right click on the class name on the right window and change the class mode to table.</p>
<p><img src="/assets/article_images/2020-08-21-modifying-third-party-library-from-bytecode-level/4.png" alt="Recaf" /></p>
<p>When on the table view, go to the methods tab and remove the target line.</p>
<p><img src="/assets/article_images/2020-08-21-modifying-third-party-library-from-bytecode-level/5.png" alt="Class mode to table mode" /></p>
<p>Go back to the decompiler view again and see that the method call is now removed!</p>
<p>Now we can export the classes.jar with the modified target class</p>
<p><img src="/assets/article_images/2020-08-21-modifying-third-party-library-from-bytecode-level/6.png" alt="Target line removed" /></p>
<p>Now we can repackage the entire library and include it to our project manually.</p>
<p>Voila! The workaround works like a charm!</p>
<p>However, there is a side effect on this workaround as it now only works on a single language which is English by default. But it’s still better than an app crash though for the meantime until the <strong>ChatSDK</strong> developers release a fixed version.</p>
<p>Apparently the next day, a new version of the sdk is released with the fix and we can just use it right away. It’s kinda mixed feelings for us because just a day before we are going down to the bytecode level of a library to find a workaround but at least we learn a lot from it and it may come in handy whenever we encounter similar problems in the future so I think it’s still worth it that we tried.</p>
<p>It’s also a good thing because we don’t have to release the workaround into production and just in time before our sprint ends.</p>
Fri, 21 Aug 2020 10:00:00 +0000
https://devs.quipper.com/2020/08/21/modifying-third-party-library-from-bytecode-level.html
https://devs.quipper.com/2020/08/21/modifying-third-party-library-from-bytecode-level.htmlThe Clean Way to Handle Sendbird Webhook Using Ruby on Rails<h1 id="the-clean-way-to-handle-sendbird-webhook-using-ruby-on-rails">The Clean Way to Handle Sendbird Webhook Using Ruby on Rails</h1>
<p>Hi, as I said before in my <a href="https://medium.com/@rizky.syaban/why-i-choose-and-use-ruby-for-6-years-81e9322d4352">first blog</a> I want to share about design patterns for Ruby. So I will share substantial reasons for the existence of design patterns and how a design pattern solves your common problems in Ruby. And to make it clear and understandable, I will explain it using a good example: Sendbird Webhook. Before start, you can read the documentation <a href="https://docs.sendbird.com/platform/webhooks">here</a>.</p>
<p><img src="/assets/article_images/2020-04-22-the-clean-way-to-handle-sendbird-webhook-using-ruby-on-rails/rails_sendbird.png" alt="rails_sendbird" /></p>
<p>And for you who doesn’t know about design pattern, a design pattern is a general, typical solution to common problems in Software Engineering. So I think it’s a <strong>must</strong> for a developer to know at least one design pattern, especially Ruby developer. Why? Because Ruby is flexible, so we need something that can keep our codebases clean and understandable for every developer.</p>
<p>In this blog, I will explain my favorite design pattern, which is the Command Pattern. So are you ready? Let’s start then.</p>
<p>Like the others, Sendbird only needs one endpoint to handle all kinds of events. That part is very interesting because the command pattern can solve that problem. So firstly create a controller for it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>**app
|_controllers
|_sendbird_controller.rb**
</code></pre></div></div>
<p>And create a new action on it: webhook. Don’t forget to add it to <em>routes.rb.</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class SendbirdController
def webhook
status: 200
end
end
</code></pre></div></div>
<p>Since Sendbird doesn’t care about our process, just respond with <em>status: 200 *immediately. And create a worker to handle the payload from Sendbird. Why using the worker? First, because Sendbird only sends the request 3 times until it receives *status: 200</em>. And our workers can save the payload and retry the process as many as we want if we got a problem until the problem is gone. Second, because we need to respond immediately to avoid too many requests to our server. Third, hmm I think that’s it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app
|_controllers
|_sendbird_controller.rb
** |_workers
|_sendbird
|_webhook_worker.rb**
</code></pre></div></div>
<p>And put the worker on <em>sendbird_controller</em>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class SendbirdController
def webhook
::Sendbird::WebhookWorker.perform_later(params)
status: 200
end
end
</code></pre></div></div>
<p>Before we start coding the worker, let’s see Sendbird request params:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
'category': 'open_channel:create',
'created_at': 1540866408000,
'operators': [
{
'user_id': 'Jay',
'nickname': 'Mighty',
'profile_url': '[https://sendbird.com/main/img/profiles/profile_26_512px.png'](https://sendbird.com/main/img/profiles/profile_26_512px.png'),
'metadata': {}
}
],
'channel': {
'name': 'Jeff and friends',
'channel_url': 'sendbird_open_channel_1_2681099203cd6b78414fe672927a43fcf3a30f09',
'custom_type': '',
'is_distinct': false,
'is_public': false,
'is_super': false,
'is_ephemeral': false,
'is_discoverable': false,
'data': ''
},
'app_id': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
}
</code></pre></div></div>
<p>The params above represent the command. And there is <em>category *that represents the event of the command and can be a key for command pattern. We can see two parts on *category *value: *open_channel</em> which is the resource and *create *which is the event of the resource. If we’re using the traditional way, the worker code will be like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module Sendbird
class WebhookWorker
def perform(params)
if params['category'] == 'open_channel:create'
# do something
elsif params['category] == 'open_channel:update'
# do something
...
end
end
end
end
</code></pre></div></div>
<p>Or</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module Sendbird
class WebhookWorker
def perform(params)
case params['category']
when 'open_channel:create'
# do something
when 'open_channel:update'
# do something
...
end
end
end
end
</code></pre></div></div>
<p>So what will happen next if we want to implement all kinds of events? Can you imagine that? LOL</p>
<p>So here the clean way to solve that problem:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module Sendbird
class WebhookWorker
attr_reader :params, :klass
def self.perform(params)
new(params).perform
end
def initialize(params)
module_name, klass_name = params['category'].split(':')
@params = params
@klass = "::Sendbird::Webhook::#{module_name.camelize}::#{klass.camelize}".constantize
end
def perform
klass.new(params).perform
end
end
end
</code></pre></div></div>
<p>To put the logic for each resource and event, we only need to create a new service. For example, <em>open_channel:create</em>. Create a new service here:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app
|_controllers
|_sendbird_controller.rb
**|_services
|_sendbird
|_webhook
|_open_channel
|_create.rb**
|_workers
|_sendbird
|_webhook_worker.rb
</code></pre></div></div>
<p>With this code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module Sendbird
module Webhook
module OpenChannel
class Create
attr_reader params
def initialize(params)
@params = params
end
def perform
# do something when create open_channel event happens
end
end
end
end
end
</code></pre></div></div>
<p>If we want to handle a new event, simply create a new service. For example, now we want to handle *group_channel:update. *Just create a new service:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app
|_controllers
|_sendbird_controller.rb
|_services
|_sendbird
|_webhook
**|_group_channel
|_update.rb**
|_open_channel
|_create.rb
|_workers
|_sendbird
|_webhook_worker.rb
</code></pre></div></div>
<p>With this code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module Sendbird
module Webhook
module GroupChannel
class Update
attr_reader params
def initialize(params)
@params = params
end
def perform
# do something when update group_channel event happens
end
end
end
end
end
</code></pre></div></div>
<p>Simple right? With this method, we can follow *rubocop *rules to avoid long class or method line length and make the file readable. But you will have many files, which is that’s okay for me.</p>
<p>I think that’s all. Thank you!</p>
<p>This post originally shared at <a href="https://medium.com/@rizky.syaban/the-clean-way-to-handle-sendbird-webhook-using-ruby-on-rails-334f5123703c">Medium</a></p>
Wed, 22 Apr 2020 00:00:00 +0000
https://devs.quipper.com/2020/04/22/the-clean-way-to-handle-sendbird-webhook-using-ruby-on-rails.html
https://devs.quipper.com/2020/04/22/the-clean-way-to-handle-sendbird-webhook-using-ruby-on-rails.htmlStyled System in Practice<p>In September of last year, I was assigned to a task force within the Quipper product team. We were formed to deploy a new app to market in roughly three months. Given the tight timeline, agility was a top priority, so every engineering decision had to be carefully considered.</p>
<p>I took it upon myself to prepare a styling framework for the React app we would be building. I was curious to explore a new CSS-in-JSS styling methodology I discovered, called <a href="https://styled-system.com/">Styled System</a>. The project had over 5,000 stars on GitHub and apparently GitHub themselves used it to build their own design system.</p>
<p><img src="/assets/article_images/2020-04-06-styled-system-in-practice/primer-components.png" alt="Primer Components is GitHub's design system built with Styled System" /></p>
<p>The CSS-in-JS movement was alive and well by this time, but it was something I was lukewarm to because I hadn’t ever really used it at scale. Outside the JavaScript world, I’ve settled on writing my CSS <a href="https://www.smashingmagazine.com/2013/10/challenging-css-best-practices-atomic-approach/">the Atomic way</a> because it’s served me very reliably through all my previous projects. I wanted a React-y way to do something similar.</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><h1</span> <span class="na">class=</span><span class="s">"text-lg font-bold text-center"</span><span class="nt">></span>
I'm being styled with atomic CSS!
<span class="nt"></h1></span>
</code></pre></div></div>
<p><em>An example of Atomic CSS (done with <a href="https://tailwindcss.com/">Tailwind CSS</a>)</em></p>
<p>If you’re not familiar with, or even a fan of, Atomic CSS, I’d encourage that you read <a href="https://adamwathan.me/css-utility-classes-and-separation-of-concerns">this blog post by Adam Wathan</a>—host of <a href="http://www.fullstackradio.com/">the excellent Full Stack Radio podcast</a>—because it chronicles our journey as an industry towards Atomic CSS and the rationale behind it. (I find that it closely parallels my own journey with CSS.) Styled System follows those same ideologies, so naturally, I had to build out the entire styling framework of our app with it. (Thanks for letting me run wild, team!)</p>
<h1 id="a-quick-primer-on-styled-system">A quick primer on Styled System</h1>
<p><img src="/assets/article_images/2020-04-06-styled-system-in-practice/styled-system.png" alt="" /></p>
<p>Styled System is a props-based styling methodology, meaning you style components by passing in styles as props (called <em>style props</em>):</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p"><</span><span class="nc">Text</span> <span class="na">color</span><span class="p">=</span><span class="s">"body"</span> <span class="na">fontSize</span><span class="p">=</span><span class="s">"2"</span><span class="p">></span>
Hello, Styled System!
<span class="p"></</span><span class="nc">Text</span><span class="p">></span>
</code></pre></div></div>
<p>It looks a little like Atomic CSS! Awesome! (Or like inline CSS, but those style rules are applied to your component via auto-generated classes, so they don’t actually create the same issues with specificity.) But, one key difference is, <em>being</em> <em>just plain CSS</em>, Styled System doesn’t require that you memorize different utility class names to apply the styles you want. You use plain old regular (albeit camelCased) CSS.</p>
<p>Take note though that the values being passed in aren’t your typical CSS values. <code class="language-plaintext highlighter-rouge">"body"</code> is not a valid CSS color name and <code class="language-plaintext highlighter-rouge">"2"</code> not a valid value without a corresponding unit. These are actually <em>theme values</em> taken from a global theme object defined at the top level of your application:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">theme</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">colors</span><span class="p">:</span> <span class="p">{</span>
<span class="na">body</span><span class="p">:</span> <span class="dl">"</span><span class="s2">#1e3f6b</span><span class="dl">"</span><span class="p">,</span>
<span class="p">},</span>
<span class="na">fontSizes</span><span class="p">:</span> <span class="p">[</span><span class="mi">12</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">20</span><span class="p">],</span>
<span class="p">};</span>
</code></pre></div></div>
<p><em><code class="language-plaintext highlighter-rouge">color="body"</code> points to <code class="language-plaintext highlighter-rouge">theme.colors.body</code> while <code class="language-plaintext highlighter-rouge">fontSize="3"</code> is <code class="language-plaintext highlighter-rouge">theme.fontSizes[3]</code></em></p>
<p>You can use this to constrain styles within a particular set of rules, like say a brand style guide or a design system. This way, your components can be made to follow the specifications handed to you by your designers (and they don’t have to scold you for being 1 pixel off, again).</p>
<p>Though to me, the main advantage to styling components this way is how it enables <em>rapid development</em>. Previously, we’d have to write our markup, then open a separate file to manage all our styles, which can become a tiresome exercise in context switching. The <em>worst part</em> of that system though—and it may seem trivial, but really it isn’t—is having to come up with appropriate class names each time.</p>
<p>Sure, it’s easy if we’re talking about naming the primary button on your site, but how about when we’re trying to target a specific button in a specific context within a specific page?</p>
<blockquote>
<p>There are only two hard things in Computer Science: cache invalidation and <em>naming things</em>.
—Phil Karlton</p>
</blockquote>
<p>But even if Styled System allows us to get away with those things, you’re likely not entirely convinced at this point. The number one thing on your mind right now might be:</p>
<blockquote>
<p>Still, why would I want all my styles in my HTML? What is this madness?!</p>
</blockquote>
<p>—which is a fair point and one the other engineers on the team weren’t shy of letting me know. But remember that because all this is happening in JavaScript, it can be easy to abstract away common patterns. If say, a heading used across the site needs a particular set of styles, it wouldn’t be ideal to have to write them over and over! You can actually create a component with all those base style rules passed in by default:</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Heading.js</span>
<span class="kd">const</span> <span class="nx">Heading</span> <span class="o">=</span> <span class="p">({</span> <span class="nx">children</span><span class="p">,</span> <span class="p">...</span><span class="nx">props</span> <span class="p">})</span> <span class="o">=></span> <span class="p"><</span><span class="nc">Text</span> <span class="si">{</span><span class="p">...</span><span class="nx">props</span><span class="si">}</span><span class="p">></span><span class="si">{</span><span class="nx">children</span><span class="si">}</span><span class="p"></</span><span class="nc">Text</span><span class="p">>;</span>
<span class="nx">Heading</span><span class="p">.</span><span class="nx">defaultProps</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">color</span><span class="p">:</span> <span class="dl">"</span><span class="s2">body</span><span class="dl">"</span><span class="p">,</span>
<span class="na">fontSize</span><span class="p">:</span> <span class="dl">"</span><span class="s2">3</span><span class="dl">"</span><span class="p">,</span>
<span class="na">fontWeight</span><span class="p">:</span> <span class="dl">"</span><span class="s2">bold</span><span class="dl">"</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The idea then is that instances of <code class="language-plaintext highlighter-rouge"><Heading></code> will only need to be given <em>context-specific styles</em> like <code class="language-plaintext highlighter-rouge">margin</code> or <code class="language-plaintext highlighter-rouge">textAlign</code>. This way, the styles for headings appearing within larger contexts will only be minimal and all the complex styling can remain in the underlying components.</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// ArticleBlock.js</span>
<span class="k">import</span> <span class="nx">Card</span> <span class="k">from</span> <span class="dl">"</span><span class="s2">./Card</span><span class="dl">"</span><span class="p">;</span>
<span class="k">import</span> <span class="nx">Heading</span> <span class="k">from</span> <span class="dl">"</span><span class="s2">./Heading</span><span class="dl">"</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">ArticleBlock</span> <span class="o">=</span> <span class="p">()</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">Card</span><span class="p">></span>
<span class="p"><</span><span class="nc">Heading</span> <span class="na">mb</span><span class="p">=</span><span class="s">"3"</span><span class="p">></span>Is Styled System the future?<span class="p"></</span><span class="nc">Heading</span><span class="p">></span>
<span class="si">{</span><span class="cm">/* ... */</span><span class="si">}</span>
<span class="p"></</span><span class="nc">Card</span><span class="p">></span>
<span class="p">);</span>
</code></pre></div></div>
<p><em>Styled System also supports property shorthands like <code class="language-plaintext highlighter-rouge">mb</code>, short for <code class="language-plaintext highlighter-rouge">marginBottom</code></em></p>
<p>You can also opt to solve this problem using <a href="https://styled-system.com/variants">the Styled System variants API</a>. Either method works, but my philosophy has been to use <code class="language-plaintext highlighter-rouge">variants</code> for rules specific to the design and components for those specific to the app.</p>
<h1 id="it-wasnt-all-perfect">It wasn’t all perfect</h1>
<p>All that being said though, style props still became an issue for our team because, even if we were able to limit context-specific styles to no more than 3 lines of props, some components would still require many more of <em>their own props</em> aside from that. This became an ugly mess for components that required a large mix of style and logic props:</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p"><</span><span class="nc">Input</span>
<span class="na">flex</span><span class="p">=</span><span class="s">"1"</span>
<span class="na">mt</span><span class="p">=</span><span class="s">"2"</span>
<span class="na">ml</span><span class="p">=</span><span class="s">"3"</span>
<span class="na">type</span><span class="p">=</span><span class="s">"number"</span>
<span class="na">placeholder</span><span class="p">=</span><span class="s">"--"</span>
<span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">score</span><span class="si">}</span>
<span class="na">required</span><span class="p">=</span><span class="si">{</span><span class="nx">hasCorrespondingCriteria</span><span class="si">}</span>
<span class="na">disabled</span><span class="p">=</span><span class="si">{</span><span class="o">!</span><span class="nx">hasCorrespondingCriteria</span><span class="si">}</span>
<span class="na">onChange</span><span class="p">=</span><span class="si">{</span><span class="nx">handleFormChange</span><span class="si">}</span>
<span class="p">/></span>
</code></pre></div></div>
<p><em>An unfortunate example from our codebase</em></p>
<p>This was our biggest gripe with Styled System because it was difficult having to deal with styling and logic on the same level. When working with Atomic CSS, all the styles are at least confined under a single <code class="language-plaintext highlighter-rouge">className</code> prop, so the problem isn’t as pronounced there.</p>
<p>To address this issue, we thought at first about defining all the styles in separate objects at the top of each file, then spreading them onto each component, like so:</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">scoreInputStyles</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">flex</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="na">mt</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="na">ml</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
<span class="p">};</span>
<span class="cm">/**
* Somewhere further down the file
* ...
* ...
*/</span>
<span class="p"><</span><span class="nc">Input</span>
<span class="si">{</span><span class="p">...</span><span class="nx">scoreInputStyles</span><span class="si">}</span>
<span class="na">type</span><span class="p">=</span><span class="s">"number"</span>
<span class="na">placeholder</span><span class="p">=</span><span class="s">"--"</span>
<span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">score</span><span class="si">}</span>
<span class="na">required</span><span class="p">=</span><span class="si">{</span><span class="nx">hasCorrespondingCriteria</span><span class="si">}</span>
<span class="na">disabled</span><span class="p">=</span><span class="si">{</span><span class="o">!</span><span class="nx">hasCorrespondingCriteria</span><span class="si">}</span>
<span class="na">onChange</span><span class="p">=</span><span class="si">{</span><span class="nx">handleFormChange</span><span class="si">}</span>
<span class="p">/>;</span>
</code></pre></div></div>
<p>But that would eliminate the advantages we talked about earlier! We’re having to move up and down the same file just to define styles. But most of all, who wants to go back to naming things again?!</p>
<p>I then realized that we could just skip the initial declaration by defining the object inline, then spread it directly onto our components like so:</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p"><</span><span class="nc">Input</span>
<span class="si">{</span><span class="p">...{</span> <span class="nl">flex</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="nx">mt</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="nx">ml</span><span class="p">:</span> <span class="mi">3</span> <span class="p">}</span><span class="si">}</span>
<span class="na">type</span><span class="p">=</span><span class="s">"number"</span>
<span class="na">placeholder</span><span class="p">=</span><span class="s">"--"</span>
<span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">score</span><span class="si">}</span>
<span class="na">required</span><span class="p">=</span><span class="si">{</span><span class="nx">hasCorrespondingCriteria</span><span class="si">}</span>
<span class="na">disabled</span><span class="p">=</span><span class="si">{</span><span class="o">!</span><span class="nx">hasCorrespondingCriteria</span><span class="si">}</span>
<span class="na">onChange</span><span class="p">=</span><span class="si">{</span><span class="nx">handleFormChange</span><span class="si">}</span>
<span class="p">/></span>
</code></pre></div></div>
<p><em>Using object notation for your style props</em></p>
<p>Great! Now the style props can appear <em>visually distinct</em> from the rest of the props. This will make it much easier to parse through component files when wanting to focus on programming just business logic.</p>
<h1 id="our-bigger-issue">Our bigger issue</h1>
<p>That wasn’t the end of it though. We also had problems with style props not always working when applied to certain components. Ironically, this was something that occurred <em>by design</em> because Styled System actually recommends <a href="https://styled-system.com/guides/build-a-box#style-props">designing your base components to limit the style props they will accept</a>.</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">Text</span> <span class="o">=</span> <span class="nx">styled</span><span class="p">.</span><span class="nx">span</span><span class="p">(</span>
<span class="p">({</span> <span class="nx">theme</span> <span class="p">})</span> <span class="o">=></span> <span class="nx">css</span><span class="s2">`
color: </span><span class="p">${</span><span class="nx">theme</span><span class="p">.</span><span class="nx">colors</span><span class="p">.</span><span class="nx">text</span><span class="p">}</span><span class="s2">;
font-size: </span><span class="p">${</span><span class="nx">theme</span><span class="p">.</span><span class="nx">fontSizes</span><span class="p">.</span><span class="nx">body</span><span class="p">}</span><span class="s2">px;
font-family: </span><span class="p">${</span><span class="nx">theme</span><span class="p">.</span><span class="nx">fonts</span><span class="p">.</span><span class="nx">main</span><span class="p">}</span><span class="s2">;
line-height: </span><span class="p">${</span><span class="nx">theme</span><span class="p">.</span><span class="nx">lineHeights</span><span class="p">.</span><span class="nx">main</span><span class="p">}</span><span class="s2">;
`</span><span class="p">,</span>
<span class="nx">color</span><span class="p">,</span>
<span class="nx">space</span><span class="p">,</span>
<span class="nx">typography</span>
<span class="p">);</span>
</code></pre></div></div>
<p><em>The initial <code class="language-plaintext highlighter-rouge"><Text></code> component declaration in our app</em></p>
<p>The arguments passed in at the end (<code class="language-plaintext highlighter-rouge">color</code>, <code class="language-plaintext highlighter-rouge">space</code>, and <code class="language-plaintext highlighter-rouge">typography</code>) are what are called <a href="https://styled-system.com/table">style prop functions</a>. They dictate the style props that your components will respond to. Each “allows the passage” of their own group of CSS properties. Something like <code class="language-plaintext highlighter-rouge">border="5px solid black"</code> therefore, won’t work when applied to our <code class="language-plaintext highlighter-rouge"><Text></code> component because that would require the <code class="language-plaintext highlighter-rouge">border</code> style prop function. But we <em>can</em> apply <code class="language-plaintext highlighter-rouge">color</code>, <code class="language-plaintext highlighter-rouge">padding</code>, <code class="language-plaintext highlighter-rouge">margin</code>, and type styles like <code class="language-plaintext highlighter-rouge">fontWeight</code> and others.</p>
<p>The intent is to prevent components from deviating from their intended design—which is a reasonable argument—but it slowed our team down more than anything! Styles sometimes didn’t <em>just work</em>. And this happened often enough that after about the <em>nth</em> time or so, I realized that the whole thing is more trouble than it’s worth. We wouldn’t be applying these styles if they didn’t need to be there one way or another!</p>
<p>To get around this problem, <a href="https://styled-system.com/guides/build-a-box#extending">the documentation suggests two possible solutions</a>—neither without their quirks. The first is to extend your components via the <code class="language-plaintext highlighter-rouge">styled</code> function of <a href="https://styled-components.com/">the <code class="language-plaintext highlighter-rouge">styled-components</code> library</a>, then apply any additional styling through there, but this created the same issues as with defining objects like we did earlier.</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nx">styled</span> <span class="k">from</span> <span class="dl">"</span><span class="s2">styled-components</span><span class="dl">"</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">CustomButton</span> <span class="o">=</span> <span class="nx">styled</span><span class="p">(</span><span class="nx">Button</span><span class="p">)</span><span class="s2">`
background-color: transparent;
float: right;
`</span><span class="p">;</span>
<span class="cm">/**
* Somewhere further down the file
* ...
* ...
*/</span>
<span class="p"><</span><span class="nc">CustomButton</span><span class="p">></span>Download<span class="p"></</span><span class="nc">CustomButton</span><span class="p">>;</span>
</code></pre></div></div>
<p><em>Scroll, scroll, scroll, scroll</em></p>
<p>Alternatively, <code class="language-plaintext highlighter-rouge">styled-components</code> also provides a <code class="language-plaintext highlighter-rouge">css</code> prop that will allow you to inline styles on any CSS property of your choosing, but it creates a messy API for our components because it leaves half your styles inside the <code class="language-plaintext highlighter-rouge">css</code> prop and half outside. How can we tell when to use which? Talk about confusing!</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p"><</span><span class="nc">Text</span>
<span class="si">{</span><span class="p">...{</span> <span class="nl">textAlign</span><span class="p">:</span> <span class="dl">"</span><span class="s2">center</span><span class="dl">"</span><span class="p">,</span> <span class="nx">fontSize</span><span class="p">:</span> <span class="mi">2</span> <span class="p">}</span><span class="si">}</span>
<span class="na">css</span><span class="p">=</span><span class="si">{</span><span class="p">{</span> <span class="na">flexGrow</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="na">justifySelf</span><span class="p">:</span> <span class="dl">"</span><span class="s2">flex-end</span><span class="dl">"</span> <span class="p">}</span><span class="si">}</span>
<span class="p">/></span>
</code></pre></div></div>
<p>The bigger issue here though is that theme values no longer work inside the <code class="language-plaintext highlighter-rouge">css</code> prop, which basically brings us down to the level of writing inline styles—yikes! Fortunately, Styled System has <a href="https://styled-system.com/css">an external <code class="language-plaintext highlighter-rouge">css</code> function helper package</a>, which addresses just that issue. It opens us up to the core functionality of Styled System without the arbitrary constraints.</p>
<p>Now, we can have the benefit of applying styles to any property (through the <code class="language-plaintext highlighter-rouge">css</code> <em>prop</em>) with the ability to use theme values at the same time (via the <code class="language-plaintext highlighter-rouge">css</code> <em>function</em>)!</p>
<h1 id="getting-there">Getting there…</h1>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nx">css</span> <span class="k">from</span> <span class="dl">"</span><span class="s2">@styled-system/css</span><span class="dl">"</span><span class="p">;</span>
<span class="p"><</span><span class="nc">Text</span> <span class="na">css</span><span class="p">=</span><span class="si">{</span><span class="nx">css</span><span class="p">({</span> <span class="na">color</span><span class="p">:</span> <span class="dl">"</span><span class="s2">body</span><span class="dl">"</span> <span class="p">})</span><span class="si">}</span><span class="p">></span><span class="si">{</span><span class="dl">"</span><span class="s2">I'm color #1e3f6b!</span><span class="dl">"</span><span class="si">}</span><span class="p"></</span><span class="nc">Text</span><span class="p">>;</span>
</code></pre></div></div>
<p><em><code class="language-plaintext highlighter-rouge">css</code> prop + <code class="language-plaintext highlighter-rouge">css</code> function = ✨</em></p>
<p>From our experience, the best way to go is to pair *<em>the <code class="language-plaintext highlighter-rouge">styled-components</code> <code class="language-plaintext highlighter-rouge">css</code> *prop</em> with with the <code class="language-plaintext highlighter-rouge">styled-system</code> <code class="language-plaintext highlighter-rouge">css</code> *function* and just leave style props by the wayside. Not only do we have themed CSS by styling our components this way, but—going back to our first issue with style props—because everything is confined to a single prop, styling and logic can still also remain separate.</p>
<p>The syntax feels a bit redundant right now, but we can fix that by abstracting the <code class="language-plaintext highlighter-rouge">css</code> <em>function</em> inside of our component declarations. Therefore, instead of defining your components the way we did earlier, write them like this instead:</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">Text</span> <span class="o">=</span> <span class="p">({</span> <span class="na">css</span><span class="p">:</span> <span class="nx">contextStyles</span><span class="p">,</span> <span class="nx">children</span><span class="p">,</span> <span class="p">...</span><span class="nx">props</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nt">span</span>
<span class="na">css</span><span class="p">=</span><span class="si">{</span><span class="nx">css</span><span class="p">({</span>
<span class="na">color</span><span class="p">:</span> <span class="dl">"</span><span class="s2">text</span><span class="dl">"</span><span class="p">,</span>
<span class="na">fontSize</span><span class="p">:</span> <span class="dl">"</span><span class="s2">body</span><span class="dl">"</span><span class="p">,</span>
<span class="na">fontFamily</span><span class="p">:</span> <span class="dl">"</span><span class="s2">main</span><span class="dl">"</span><span class="p">,</span>
<span class="na">lineHeight</span><span class="p">:</span> <span class="dl">"</span><span class="s2">main</span><span class="dl">"</span><span class="p">,</span>
<span class="p">...</span><span class="nx">contextStyles</span><span class="p">,</span>
<span class="p">})</span><span class="si">}</span>
<span class="si">{</span><span class="p">...</span><span class="nx">props</span><span class="si">}</span>
<span class="p">></span>
<span class="si">{</span><span class="nx">children</span><span class="si">}</span>
<span class="p"></</span><span class="nt">span</span><span class="p">></span>
<span class="p">);</span>
</code></pre></div></div>
<p><em>First, pass in the default styles, then layer any of the provided styles on top through the <code class="language-plaintext highlighter-rouge">css</code> function</em></p>
<h1 id="the-holy-grail">The Holy Grail?</h1>
<p>Did you catch all that? Now, we can write our styles like this:</p>
<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p"><</span><span class="nc">Text</span> <span class="na">css</span><span class="p">=</span><span class="si">{</span><span class="p">{</span> <span class="na">color</span><span class="p">:</span> <span class="dl">"</span><span class="s2">body</span><span class="dl">"</span> <span class="p">}</span><span class="si">}</span><span class="p">></span><span class="si">{</span><span class="dl">"</span><span class="s2">I'm color #1e3f6b!</span><span class="dl">"</span><span class="si">}</span><span class="p"></</span><span class="nc">Text</span><span class="p">></span>
</code></pre></div></div>
<p>At this point, we might not even need the main <code class="language-plaintext highlighter-rouge">styled-system</code> package and could get away with just <code class="language-plaintext highlighter-rouge">@styled-system/css</code>. We’d still need several of the utilities from <code class="language-plaintext highlighter-rouge">styled-components</code> (like the <code class="language-plaintext highlighter-rouge">css</code> prop), but consider it a win to be able to drop the main dependency altogether and rely on just Styled System’s core functionality! (And if you’re a bit more advanced and are wondering, <em>yes</em>, this does still allow us to use Styled System’s <a href="https://styled-system.com/responsive-styles">array props for responsive styles</a>.)</p>
<p>Unfortunately for our project, I only figured all this out <em>after</em> we had shipped, but if we were to go through it all again, I would have done it this way 100%. This set-up, while still preserving the core of Styled System, would also have saved us our biggest gripes with it. No mixing of style and logic props. No more arbitrary style prop constraints.</p>
<p>Just simple, isolated, and reliable styling.</p>
Mon, 06 Apr 2020 08:38:00 +0000
https://devs.quipper.com/2020/04/06/styled-system-in-practice.html
https://devs.quipper.com/2020/04/06/styled-system-in-practice.htmlLife as a Vim User at Quipper<p>Life as a Vim user is not an easy life, but it’s also not though either. It’s an exciting life to be a Vim user. Vim itself is a unique text editor. Naturally, Vim is a terminal-based text editor with little to no graphical user interface. Vim uses keyboard as the main User Interface, it’s quite different from typical modern text editors. There are so many things to learn about Vim.</p>
<h2 id="about-me">About Me</h2>
<p>I’m a Software Engineer. I write some codes mostly using Ruby, Go and recently trying to code Typescript. I consider myself a casual Vim user. I’m by any means no expert in Vim. Experience wise I’m new, I’ve been coding vim only for the last 2 years.</p>
<p>Even though I’m using Vim for 2 years, I barely know Vim especially it’s native keystrokes. It’s only this past 3-4 months I’ve been starting to learn Vim native keystrokes by using vanilla Vim. I’ve learned a lot of keystrokes, and it’s not easy. Vim <code class="language-plaintext highlighter-rouge">:help</code> is very handy and it’s helping me so much.</p>
<h2 id="vim-at-quipper">Vim at Quipper</h2>
<p>Fortunately, Quipper has a solid Vim community. There is more than 20 percent of Quipper software engineers use Vim at the moment. There are several regular agenda we used to do. We have <code class="language-plaintext highlighter-rouge">quipper.vim</code> which is a sharing session, sometimes we do pair programming in Vim, and the most important thing is that it’s a nice and welcoming community! Together, we help each other to do the best we can do at Vim.</p>
<p>For me, sometimes, I feel like <code class="language-plaintext highlighter-rouge">:help</code> is not enough. For example when I want to know people’s daily vim usage or maybe which plugins to use together to achieve better flow. I’m very glad that I’m joining Vim community in Quipper. There is a regular Vim discussion and sharing session called <code class="language-plaintext highlighter-rouge">quipper.vim</code>. The main agenda is to share one’s daily vim operation. I got tons of new insight and plugins recommendations. It’s a great agenda!</p>
<p>The other thing is about the people. Quipper Vim community is a nice and welcoming community. People are open to discussion, a new idea, even simple questions. I never felt intimidated by asking <em>noobish</em> questions there. Even, when I first joined Quipper Vim community, I was encouraged to be more active towards Vim community and attend VimConf even I was a new joiner at Quipper Vim community. It’s an amazing community.</p>
<p>The last but not least is about chance to contribute back to Vim community and learn from it. Quipper has been sponsoring VimConf regularly since VimConf 2018. Not only Quipper had sponsored VimConf, but Quipper also sponsored me to attend and speak at VimConf! I’ve always wanted to speak at a conference to contribute and share what I’ve learnt. Thanks to Quipper, I did my first conference talk and also learned a lot. Quipper supported all my accommodation from Jakarta to Tokyo as Quipper international conference package. It’s a nice thing to know that the company that I work for is caring about my growth!</p>
<p>I’m very grateful that I’m joining Quipper and being a member in Quipper Vim community. It’s a blessing for me. I can learn many things and meet so many nice people!</p>
Tue, 19 Nov 2019 00:00:00 +0000
https://devs.quipper.com/2019/11/19/life-as-a-vim-user-at-quipper.html
https://devs.quipper.com/2019/11/19/life-as-a-vim-user-at-quipper.htmlSRE Operation Trails<h2 id="intro">Intro</h2>
<p>Hello! This is <a href="https://github.com/rbmrclo">@rbmrclo</a> from Site Reliability Engineering team.
Today, let me share about <strong>“Operation Trails”</strong> (a term we use in our team) which is an important part of our workflow when performing tasks that involve manual operation.</p>
<h2 id="background">Background</h2>
<p>In the SRE team, we have a 50/50 rule for how we manage our time every day.</p>
<p>To summarise, half of our day usually goes to <strong>proactive</strong> tasks which are generally the main projects that contribute to our growth as a diverse tech team (we usually have a roadmap for this).
The rest of our time is spent on <strong>reactive</strong> tasks which are essential to maintain the stability and reliability of our services, as well as to keep the development speed stable across each team.</p>
<p>It can be visualised in blocks like this:</p>
<p><img src="/assets/article_images/2019-05-21-sre-operation-trails/proactive-reactive-sre.png" alt="SRE tasks in Quipper" /></p>
<p>In this article, I will be focusing on our <strong>reactive tasks</strong> and explain in detail how we manage to work seamlessly within our team and avoid <em>mottainai</em> (I’ll be explaining this later).</p>
<h2 id="daily-situation">Daily Situation</h2>
<p>As a global company, each SRE member attends to the needs of multiple teams in different timezones. This also means that each member is working at their own pace.</p>
<p>Some members might be working on a normal routine today with their proactive tasks; some will be performing a maintenance task tonight (midnight!); and some might already be attending to a service outage incident while I’m writing this blog post!</p>
<p>Let’s illustrate that again with my favorite blocks.</p>
<p><img src="/assets/article_images/2019-05-21-sre-operation-trails/sre-isolated.png" alt="" /></p>
<p>My point here is that most of the time, each of us is working in an isolated manner. However, there’s one exception and this is when <strong>Operation Trails</strong> comes in.</p>
<h2 id="operation-trails-for-reactive-tasks">Operation Trails for Reactive Tasks</h2>
<p>Imagine that you are working on a task, with your headphones on, enjoying your favorite bubble milk tea, listening to the playlist of Queen, in-the-zone and cannot be disturbed by humans.</p>
<p>Suddenly, an alert has been triggered for a specific monitor. Say the staging cluster died, hence, no developers could connect to the staging servers to test their newly implemented features - a major blocker!</p>
<p><strong>Call of duty.</strong> Upon receiving the alert message, you quickly checked the issue and created an <strong>Operation Trail</strong>.</p>
<ul>
<li>First, you informed the other SRE team members that you are now checking the issue.
<ul>
<li>You are now considered as the assigned person. (ownership is part of our culture!)</li>
<li>This is also when the operation trail starts.</li>
<li>All SRE members are now informed that someone is checking the issue. They are also watching the operation trail in parallel.</li>
</ul>
</li>
<li>Next, continuously post updates of what you’re currently doing. (who did what when - like audit trails!)
<ul>
<li>While posting updates, other SRE team members could either give suggestions, join the ongoing operation, or just watch the trail. (it all depends on the severity of the situation)</li>
</ul>
</li>
<li>Lastly, you inform everyone when the task is finished or when the issue has been resolved. :tada:</li>
</ul>
<h4 id="heres-the-birds-eye-view-of-what-happened">Here’s the bird’s eye view of what happened.</h4>
<p><img src="/assets/article_images/2019-05-21-sre-operation-trails/proactive-response.gif" alt="Responding to alert (reactive task)" /></p>
<p><strong>:memo: Every operation is in a single thread</strong></p>
<p><img src="/assets/article_images/2019-05-21-sre-operation-trails/operation-trail.png" alt="" /></p>
<p><strong>:bell: Live reporting</strong></p>
<p><img src="/assets/article_images/2019-05-21-sre-operation-trails/actual-trail-1.png" alt="" /></p>
<p><strong>:white_check_mark: Avoid operation conflicts by using call-to-actions</strong></p>
<p><img src="/assets/article_images/2019-05-21-sre-operation-trails/cta.png" alt="" /></p>
<h2 id="summary">Summary</h2>
<h4 id="slack-threads">Slack Threads</h4>
<ul>
<li>In simple terms, operation trails are chat-based and happen real-time. We fully utilize slack threads for these.</li>
<li>An SRE member can start an operation trail and resolve it by himself/herself, or another SRE member can join the trail to speed up resolving the task at hand.</li>
</ul>
<h4 id="avoid-mottainai-もったいない">Avoid <a href="https://en.wikipedia.org/wiki/Mottainai"><strong>Mottainai</strong></a> (もったいない)</h4>
<blockquote>
<p>The term in Japanese conveys a sense of regret over waste; the exclamation “Mottainai!” can translate as “What a waste!”</p>
</blockquote>
<ul>
<li>By establishing a live reporting culture in your team, you can eliminate waste of time.
<ul>
<li>For example, when an SRE member initiates that he/she is already responding to the issue, the other SRE members can just watch the trail while working on their current tasks normally. They don’t need to pause as well, maximizing the use of their time.</li>
</ul>
</li>
<li>By actively posting updates in the operation trail, other members can provide relevant suggestions or possible solutions in order to speed up the operation.</li>
</ul>
<h4 id="being-a-team-player">Being a team-player</h4>
<ul>
<li>Operation Trails improve the communication skill of an individual by being able to explain what’s happening and what they are doing.</li>
<li>As spectator of the trail, you can determine if the operation is going smoothly or a call for help is needed - evolving into a “pair operation”.</li>
<li>It also improves harmony in the team since this is one of the times when all of us in SRE team can meet and collaborate with each other, given that we have individual tasks too.</li>
</ul>
<h4 id="acknowledgements">Acknowledgements</h4>
<ul>
<li>There’s also a <a href="https://blog.kyanny.me/entry/2016/11/11/021955">blog post in japanese</a> which is the main inspiration of this post.</li>
<li>Many thanks to all SRE members for supporting and adopting this culture. (especially <a href="https://github.com/lamanotrama">@lamanotrama</a> who introduced this during his time in Quipper)</li>
</ul>
<p>Do you also have a similar live reporting culture in your team? Share it in the comments below and let’s discuss!
We are <a href="https://career.quipper.com/jp/jobs/sre/">hiring SRE members</a>. Check it out!</p>
Tue, 21 May 2019 00:00:00 +0000
https://devs.quipper.com/2019/05/21/sre-operation-trails.html
https://devs.quipper.com/2019/05/21/sre-operation-trails.html