Fri, 05 Dec 2025 06:22:18 GMT

What it really takes to bring a new model online at CodeRabbitの意訳です。

以前公開した この記事 では、ユーザー自身がモデルを選ぶべきではない理由を「好みの問題ではなく、システム上の問題である」と説明しました。本記事では、その理由を具体的に解説します。

CodeRabbitで新しいモデルを導入することは、スイッチを入れるだけの単純作業ではありません。高い精度、膨大な検証、継続的な監視を要求する多段階のプロセスです。

数カ月おきに、「次レベルの推論力」「より長いコンテキスト」「高速化」といった触れ込みとともに新しい大規模言語モデルが登場します。多くの開発者は、単純に「差し替えて使えばいい」と考えるかもしれません。

その気持ちは理解できます。しかし私たちにとって、新しいモデルの採用とは好奇心ではなく、数週間にわたるエンジニアリングプロジェクトです。

お客様がその裏側を見ることは基本的にありませんし、見る必要もありません。CodeRabbitが“自然でシームレス”に見える理由は、私たちが水面下で膨大な評価・調整・検証を行い、本番レビューに触れる前にすべてのモデルを仕上げているからです。ここでは、その舞台裏を紹介します。

1. 好奇心フェーズ：モデルの「素性」を理解する

すべての新モデルは「仮説」から始まります。そのモデルが何を得意とし、どんな設計思想を持ち、どのような改善を謳っているのかを徹底的に調べます。それが推論寄りなのか、コード寄りなのか、あるいはその中間なのか。そして、CodeRabbitのレビューシステムのどの層で活かせるのかを分析します。

私たちが問うのは「このモデルは他より優れているか？」ではなく「このモデルはどこにフィットするか？」です。高度な推論を必要とする差分解析向きかもしれませんし、要約や説明タスクに向くかもしれません。それぞれに求められる品質やトーンは異なります。

ここから大量の実験を作成します。1〜2件ではなく、温度感、コンテキストの詰め方、指示文の書き方など、数十パターンの評価設定を生成します。これらはすべて評価ハーネスに流し込み、量的・質的両面から結果を測定します。

2. 評価フェーズ：印象ではなくデータで判断する

評価フェーズは時間を要します。内部の評価セットを使い、カバレッジ、精度、ノイズ量、レイテンシといった明確な指標を収集します。これらは、以前紹介した各種ベンチマーク記事と同じ指標です。

しかし、数字だけでは全体像は見えません。生成されたコメントそのものを精査し、推論の正しさ、事実性、スタイルの一貫性などを、現在の最良モデルと比較して確認します。さらに複数の自動評価レシピを用いることで、トーンや明瞭性などの微細な違いも分析します。

なぜこれが必要なのか？
それは、モデルは決して“互換品”ではないからです。同じプロンプトでもモデルが変わると動作が崩れます。それぞれに固有の「プロンプトの物理法則」が存在します。私たちの仕事はそれを把握し、システム内で安定して働くよう調整することです。

3. 適応フェーズ：モデルの癖を馴らし、使える形にする

モデルの得意・不得意が分かったら、次は調整です。フォーマットの揺れを正す、冗長さを抑えるといった単純な調整のこともあれば、モデル固有の“語り口”をユーザーが期待する簡潔で実務的なトーンに戻す必要があることもあります。

この作業は勘では行いません。しばしばモデル自身に「自分の出力を批評させる」アプローチを取ります。

例：
「このコメントは謝罪的すぎる。元プロンプトに基づいて、より直接的な表現にするにはどう直すべきか？」

このようなメタフィードバックにより、単純な試行錯誤より高速にプロンプト改善案を生成できます。

また、モデル提供企業とも密に連携し、境界事例、バグ、不整合などを細かく共有します。モデル側で修正されることもあれば、私たちがプロンプト側で癖を吸収する場合もあります。

4. ロールアウトフェーズ：研究室から実環境へ

オフラインで安定性が確認できたら、段階的な本番投入に移行します。

最初は社内チームで実運用テストを行い、次に少人数の外部ユーザーが参加する早期アクセスへ進みます。最後は、組織規模、リポジトリの種類、PRの複雑性に応じて均等に配信されるよう、ランダム化されたゲーティングで段階的に拡大します。

監視対象は以下のように、多岐にわたります。

コメント品質と採択率
レイテンシ、エラー率、タイムアウト
開発者からのフィードバック傾向
提案の精度変化

1つでも異常があれば、即ロールバックまたは配信制限を行います。原因がプロンプト起因なのか、スタイル変化なのか、本質的なモデルの問題なのかを迅速に調査します。

5. 安定フェーズ：運用後も続く監視と改善

モデルが安定したように見えても、仕事は終わりません。自動アラート、日次評価、ユーザーからの声を通じて、常に監視します。

また、私たちは CodeRabbit を自社でも日常的に利用しているため、内部からの違和感もすぐに検知します。さらに、パブリックリポジトリのランダムサンプルを毎日確認し、小さな品質劣化を見逃さないようにしています。

6. なぜここまでやるのか、そしてなぜあなたはやらなくていいのか

新しいモデルを評価するたび、私たちは毎回「良いレビューとは何か」を新しい条件のもとで再定義する必要があります。各モデルには固有の失敗パターンや驚くような挙動があり、それらを理解し、扱いこなす必要があります。

もちろん、あなた自身のチームで同じことをやることも可能です。しかしそれには、評価基盤の構築、多様なPRデータの収集、自動評価システムの開発、スタイル基準の策定、プロンプト調整、段階的ロールアウト、継続的な回帰監視など、莫大な工数が必要です。

そして、新しいモデルが登場する度に、これらの作業をやり直す必要があります。

私たちがこのタスクを請け負う理由は明確です。
あなたがこれをやらずに済むようにするためです。

CodeRabbit では、各タスクに最適なモデルが既に選定・調整・検証され、本番品質で提供されます。
「どのモデルを使うべきか」を考える必要はありません。

まとめ

CodeRabbitにおけるモデル導入は華やかではありません。時間がかかり、細かく、技術的です。しかしこれこそが、CodeRabbit のレビューを一貫して信頼できるものにしています。あなたが開く差分、目にするコメントの裏には、この膨大な仕組みが存在します。

数週間の評価、数千の指標、数えきれないプロンプト調整——
すべては一つの目的のため。

常に最良のレビューを、あなたがLLMモデルを一切気にすることなく受けられるように。

ぜひ CodeRabbit をお試しください。
2週間の無料トライアルをはじめる！

]]>

Fri, 05 Dec 2025 00:24:55 GMT

When we published our earlier article on why users shouldn't choose their own models, we argued that model selection isn't a matter of preference, it's a systems problem. This post explains exactly why.

Bringing a new model online at CodeRabbit isn't a matter of flipping a switch; it's a multi-phase, high-effort operation that demands precision, experimentation, and constant vigilance.

Every few months, a new large-language model drops with headlines promising “next-level reasoning,” “longer context,” or “faster throughput.” For most developers, the temptation is simple: plug it in, flip the switch, and ride the wave of progress.

We know that impulse. But for us, adopting a new model isn’t an act of curiosity, it’s a multi-week engineering campaign.

Our customers don’t see that campaign, and ideally, they never should. The reason CodeRabbit feels seamless is precisely because we do the hard work behind the scenes evaluating, tuning, and validating every model before it touches a single production review. This is what it really looks like.

1. The curiosity phase: Understanding the model’s DNA

Every new model starts with a hypothesis. We begin by digging into what it claims to do differently: is it a reasoning model, a coding model, or something in between? What’s its architectural bias, its supposed improvements, and how might those capabilities map to our existing review system?

We compare those traits against the many model types that power different layers of our context-engineering and review pipeline. The question we ask isn’t, “is this new model better?” but, “where might it fit?” Sometimes it’s a candidate for high-reasoning diff analysis; other times, for summarization or explanation work. Each of those domains has its own expectations for quality, consistency, and tone.

From there, we start generating experiments. Not one or two, but dozens of evaluation configurations across parameters like temperature, context packing, and instruction phrasing. Each experiment feeds into our evaluation harness, which measures both quantitative and qualitative dimensions of review quality.

2. The evaluation phase: Data over impressions

This phase takes time. We run models across our internal evaluation set, collecting hard metrics that span coverage, precision, signal-to-noise, and latency. These are the same metrics that underpin the benchmarks we’ve discussed in earlier posts like Benchmarking GPT-5, Claude Sonnet 4.5: Better Performance, but a Paradox, GPT-5.1: Higher signal at lower volume, and Opus 4.5: Performs like the systems architect.

But numbers only tell part of the story. We also review the generated comments themselves by looking at reasoning traces, accuracy, and stylistic consistency against our current best-in-class reviewers. We use multiple LLM-judge recipes to analyze tone, clarity, and helpfulness, giving us an extra lens on subtle shifts that raw metrics can’t capture.

If you’ve read our earlier blogs, you already know why this is necessary: models aren’t interchangeable. A prompt that performs beautifully on GPT-5 may completely derail on Sonnet 4.5. Each has its own “prompt physics.” Our job is to learn it quickly and then shape it to behave predictably inside our system.

3. The adaptation phase: Taming the differences

Once we understand where a model shines and where it struggles, we begin tuning. Sometimes that means straightforward prompt adjustments such as fixing formatting drift or recalibrating verbosity. Other times, the work is more nuanced: identifying how the model’s internal voice has changed and nudging it back toward the concise, pragmatic tone our users expect.

We don’t do this by guesswork. We’ll often use LLMs themselves to critique their own outputs. For example: “This comment came out too apologetic. Given the original prompt and reasoning trace, what would you change to achieve a more direct result?” This meta-loop helps us generate candidate prompt tweaks far faster than trial and error alone.

During this period, we’re also in constant contact with model providers, sharing detailed feedback about edge-case behavior, bugs, or inconsistencies we uncover. Sometimes those conversations lead to model-level adjustments; other times they inform how we adapt our prompts around a model’s quirks.

4. The rollout phase: From lab to live traffic

When a model starts to perform reliably in offline tests, we move into phased rollout.

First, we test internally. Our own teams see the comments in live environments and provide qualitative feedback. Then, we open an early-access phase with a small cohort of external users. Finally, we expand gradually using a randomized gating mechanism so that traffic is distributed evenly across organization types, repo sizes, and PR complexity.

Throughout this process, we monitor everything:

Comment quality and acceptance rates
Latency, error rates, and timeouts
Changes in developer sentiment or negative reactions to CodeRabbit comments
Precision shifts in suggestion acceptance

If we see degradation in any of these signals, we roll back immediately or limit exposure while we triage. Sometimes it’s a small prompt-level regression; other times, it’s a subtle style drift that affects readability. Either way, we treat rollout as a living experiment, not a switch-flip.

5. The steady-state phase: Continuous vigilance

Once a model is stable, the work doesn’t stop. We monitor it constantly through automated alerts and daily evaluation runs that detect regressions long before users do. We also listen, both to our own experience (we use CodeRabbit internally) and to customer feedback.

That feedback loop keeps us grounded. If users report confusion, verbosity, or tonal mismatch, we investigate immediately. Every day, we manually review random comment samples from public repots that use us to ensure that quality hasn’t quietly slipped as the model evolves or traffic scales.

6. Why we do all this & why you shouldn’t have to

Each new model we test forces us to rediscover what “good” means under new constraints. Every one comes with its own learning curve, its own failure modes, its own surprises. That’s the reality behind the promise of progress.

Could an engineering team replicate this process themselves? Technically, yes. But it would mean building a full evaluation harness, collecting diverse PR datasets, writing and maintaining LLM-judge systems, defining a style rubric, tuning prompts, managing rollouts, and maintaining continuous regression checks. All of this before your first production review!

That’s weeks of work just to reach baseline reliability. And you’d need to do it again every time a new model launches.

We do this work so you don’t have to. Our goal isn’t to let you pick a model; it’s to make sure you never have to think about it. When you use CodeRabbit, you’re already getting the best available model for each task, tuned, tested, and proven under production conditions.

Because “choosing your own model” sounds empowering until you realize it means inheriting all this complexity yourself.

Takeaway

Model adoption at CodeRabbit isn’t glamorous. It’s slow, meticulous, and deeply technical. But it’s also what makes our reviews consistent, trustworthy, and quietly invisible. Every diff you open, every comment you read, is backed by this machinery. Weeks of evaluation, thousands of metrics, and countless prompt refinements all in service of one thing:

Delivering the best possible review, every time, without you needing to think about which model is behind it.

Try out CodeRabbit today. Get a free 14-day trial!

]]>

Thu, 04 Dec 2025 06:45:01 GMT

It's harder to review code than to write it -- especially with AI codeの意訳です。

"デバッグはコードを書くときの2倍は難しい。したがってその定義に従うならば、コードをできる限り複雑に書けば、そのコードをデバッグできるほど自分は賢くないことになる。"

Brian Kernighan（Unix共同開発者、The C Programming Language の共著者）

私は10歳の頃からプログラミングをしてきました。さらに仕事になってからは、コードの品質向上に夢中でした。クリーンコード、デザインパターン、そうしたものにどっぷり浸かり、Pull Requestは徹底的に磨き上げていました。練られたロジック、適切なエラーハンドリング、コメント、テスト、ドキュメント。レビュワーが納得できる要素はすべて揃っていたわけです。

そうした中LLMが登場し、状況は一変しました。もう私はあまりコードを書いていません。AIのほうが速いからです。今の開発者の仕事は大きく2つあると言えます。1つはモデルに必要なことを説明すること、そしてもう1つはコードが正しいか検証することです。私はコードアーキテクト兼品質管理者のような役割になりました。

そして、テックリード時代に嫌というほど学んだ、あの問題が再び現れました。

コードを「読む」ことは実は「書く」ことより難しい。

OSSメンテナやシニア開発者として、他人が書いた大量のコードをレビューしてきました。Kernighanの言葉が痛いほど身に沁みています。知らないコードを読むのは本当に疲れます。誰かの思考を逆算し、なぜその判断をしたのかを理解し、想定漏れのエッジケースを考える必要があるからです。

自分の書いたコードならレビューは簡単です。自分で設計して自分で書いたものだから、頭の中にモデルが残っています。しかし今、コードはLLMから出てきます。「自分のコード」をレビューするはずが、実質「他人のコード」をレビューする作業になりました。しかもその「他人」は自分の思考速度をはるかに超えるスピードでコードを書き、昼休みも取りません。

AIは助けてくれるはずなのに、本番で使えるコード品質を担保したい今となっては、むしろ以前よりハードワークが増えています。皮肉なものです。

人間だから仕方ない（コード品質にとっては残念な話）

ここからが厄介です。私たちは機械ではなく、人間です。人間の脳は「面倒なこと」をやりたくないのです。特に、a) 一応動いている、b) テストも通っている、c) どうせ誰かがレビューしてくれる──となればなおさらです。

git commit && git push して、コーヒーを取りに行くほうがよほど楽です。仕事は終わった気になれます。

私は「手書きで品質を担保したコードを書く開発者」から「AIで高速に生成しつつ、品質が落ちたコードをデリバリーする開発者」になってしまいました。時間が減ったからではありません。むしろ手で書かなくなり時間は増えたのに、検証フェーズをショートカットしてしまう自分がいたのです。「動くし、テスト通ってるし、重大なものはチームが見つけるだろう」と。

“レビューで拾えばいい” の問題点

この頃、私はすでにCodeRabbitでチームのPRレビューを行っていました。これがとても役立ちました。CodeRabbitは見落としがちな問題を拾ってくれます。セキュリティ、エッジケース、ロジックの穴。高速で書いていると見逃しやすいものばかりです。

しかし問題がありました。そのレビューは遅すぎたのです。コードはすでにPushされ、リポジトリには載っていて、チーム全員が見られる状態です。CodeRabbitが指摘し、私は修正しますが、その前にチームはすでに「AIが生成した明らかな問題コード」を見てしまっているのです。

品質で長年築いてきた評価が、そこで揺らぎます。

IDE版 CodeRabbit の登場

そんなとき、CodeRabbitにIDE拡張があることを知りました。PR用に使っていたAIレビュワーが、ローカルのコードもレビューできる。まさに今の私が必要としていたものでした。

変更をチェックしたりステージしたりすると、CodeRabbitはVS Code上で即レビューを実施し、git push 前に問題を検出します。チームに見せるのは磨き上げた状態のコードだけ。昔のように戻れたわけです。ただし、今はAI速度でコードを書き、AIで品質も担保するという違いがあります。

そして重要なのは、意志力が不要だということです。覚える必要もないし、別ツールを開く必要もない。コミット時に自動でレビューが走る。レビューが「雨の中を耕すような作業」ではなくなりました。

特にセキュリティのような重大な問題に対しては必須です。スクリーンショットの例では、CodeRabbitがアクセストークンの漏洩を検知しました。これがリポジトリにPushされていたら完全にアウトです。こういった問題はPush前に検出しなければ意味がありません。

さらに、問題を見つけた際には修正内容が即コミット可能です。「自分で考えて直してね」ではなく、ワンクリックで適用できる具体的な修正案を提示してくれます。

より高度で自動修正できないケースでは、CodeRabbit IDE拡張がプロンプトを生成し、選んだAIエージェントに送信します。CodeRabbitのプロンプト生成は非常に優れており、これだけで自分のプロンプトエンジニアリング能力が上がるほどです。

無料プランでもかなり有用なフィードバックを得られ、多くの問題を検出します。しかしProプランにするとCodeRabbit PRレビューと同等の網羅性が得られます。ツール実行、Code Graph解析など、非常に大きなインフラがバックグラウンドで動作しているのです。

まとめ

Brian Kernighanの言うとおり、コードを読むことは書くことより難しい。1974年当時も正しかったし、AIが300行を一瞬で書くようになった現代ではさらに正しい。

AIは私たちの仕事を楽にする、と考えていました。そして実際、書く部分だけを見れば楽になりました。しかし、読む・検証する・レビューする・AIが作ったものを理解するという行為はむしろ難易度が上がりました。

私たちは10倍の速度で、10倍の量を生成しています。ということは、10倍のコードを「読む」必要があるのです。人間の脳は昔と同じままなのに、です。

解決策は、速度を落とすことでも、手書きに戻ることでもありません。**コードを書く工程を自動化したのと同じように、コードレビュー工程も自動化することです。**AIがコードを書くなら、別のAIがプッシュ前にコードを読むべきです。

だからこそ、CodeRabbitのIDEレビューを試すべきです。無料プランがあるので試さない理由はほぼありません。あなたの評判のためにも。

今日から始めてみてはいかがでしょうか
14日間の無料トライアルはこちら

]]>

Thu, 04 Dec 2025 00:53:42 GMT

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Brian Kernighan (co-creator of Unix and co-author of The C Programming Language)

I've been programming since I was ten. When it became a career, I got obsessed with code quality: clean code, design patterns, all that good stuff. My pull requests were polished like nobody's business: well-thought-out logic, proper error handling, comments, tests, documentation. Everything that makes reviewers nod approvingly.

Then, LLMs came along and changed everything. I don't write that much code anymore since AI does it faster. Developer’s work now mainly consists of two parts: explaining to a model what you need, then verifying what it wrote, right? I’ve become more of a code architect and quality inspector rolled into one.

And here came a problem I knew all too well from my years as a tech lead:

READING CODE IS ACTUALLY HARDER THAN WRITING IT.

As an open-source maintainer and senior developer, I had to review tons of other people's code, and I learned what Kernighan said the hard way. Reading unfamiliar code is exhausting. You have to reverse-engineer someone else's thought process, figure out why they made certain decisions, and consider edge cases they might have missed.

With my own code, reviewing and adjusting were a no-brainer. I designed it, I wrote it, and the whole mental model was still fresh in my head. Now the code is coming from an LLM and suddenly reviewing "my own code" has become reviewing someone else's code. Except this "someone else" writes faster than I can think and doesn't take lunch breaks.

AI is supposed to help, but if I want to ship production-grade software now, I actually have more hard work to do than before. The irony!

And that’s why, for my first blog post since joining CodeRabbit, I wanted to focus on that fact. This is also, incidentally, why I decided to join CodeRabbit. But we’ll get to that part later.

We’re human (unfortunately for code quality)

Here's where things get uncomfortable: we're human beings, not code-reviewing machines. And human brains don't want to do hard work, thoroughly reviewing something that a) already runs fine, b) passes all the tests, and c) someone else will review anyway. It's so much easier to just git commit && git push and go grab that well-deserved coffee. Job is done!

I went from “writing manually and shipping quality code,” to “generating code fast but shipping… bad code!” The quality dropped not because I had less time as I actually had MORE time since I wasn't typing everything myself. I just tend to “shorten” this verification phase, telling myself "it works, the tests pass, the team will catch anything major."

The problem with "Catching it in review"

At this point, I was already using CodeRabbit to review my team's pull requests (as an OSS-focused dev, I was an early adopter), and those reviews were genuinely helpful! CodeRabbit would catch things that slipped through. Security issues, edge cases, some logic bugs. Those problems that are easy to miss when you're moving fast.

But here's the thing: those reviews were coming too late. The code was already pushed. Already in the repository, visible to the entire team. Sure, CodeRabbit would flag the issues and I'd fix them but not before my teammates had seen my AI-generated code with obvious problems that I didn't bother to review properly.

That's not a great look when you've spent decades building a reputation for quality.

Enter: CodeRabbit in an IDE

Then, I discovered CodeRabbit had an IDE extension. The AI code reviewer I was already using for PRs could also review my code locally, before anything hits the repo. This was exactly what I needed.

When I ask CodeRabbit to check or simply stage my changes, CodeRabbit reviews them right in VS Code, catching issues before git push. Now, my team sees only the polished version, just like the old days. Except now, I'm shipping AI-generated code at AI speeds. And I’m doing it with actual quality control. Automatic reviews mean no willpower required: I don't have to remember to run it, I don't have to open a separate tool. It just happens at commit time. Reviewing doesn't feel like plowing in the rain anymore.

This gets critical when you're looking at potential security headaches, like the one on the screenshot. CodeRabbit caught an access token leak that could've been a total disaster! Issues like this needs to be addressed before that code gets pushed to a repository.

More than that, when it finds something, the fixes are committable. The tool doesn’t tell me to "go figure it out" but gives actual suggestions I can apply immediately, in one click.

For more advanced cases that can’t be resolved with a simple fix, CodeRabbit IDE extension writes a prompt that it sends to an AI agent of your choice. Fun fact: CodeRabbit is so good in writing prompts so I got a lot to learn from, improving my Prompt Engineering skills!

Even the free CodeRabbit IDE Review plan offers incredibly helpful feedback and catches numerous issues. However, the Pro plan unlocks its true power, providing the same comprehensive coverage you expect from regular CodeRabbit Pull Request reviews: tool runs, Code Graph analysis, and much more - there is a huge infrastructure behind every check!

The bottom line

Brian Kernighan was right: reading code is harder than writing it. That was true in 1974 and it's even more true now when AI can generate 300 lines while you're still thinking about a variable name.

We thought AI would make our jobs easier. And it does… if you only count the writing. But the reading verifying, reviewing, and understanding what the AI agent actually built? That got harder.

Many of us are doing 10x the volume at 10x the speed, which means 10x more code to read with the same human brain that gets lazy and wants coffee breaks. The solution isn't to slow down or go back to typing everything manually. The solution is to automate the code review process as thoroughly as we automated the code writing process. If your AI writes the code, another AI should be reading it before you get to it.

The quality of the reviews is why I recently transitioned from being a CodeRabbit user to joining the team. And that’s why you should also try CodeRabbit in your IDE. The free tier means there's basically no excuse not to try it. Your reputation will thank you.

Get started today with a 14-day free trial!

]]>

Thu, 27 Nov 2025 02:19:46 GMT

Gemini 3 for code-related tasks: The dense engineerの意訳です。

TL;DR: Gemini 3 はパッチを書く以上の仕事をします。変更の一つひとつに対し、完全な論証を構築します。正しいときは驚くほど正確で、間違っているときでさえ「正しそうに見える」レビューを生成します。

すべてのモデルは同じスタイルで書く。Gemini 3 はそのルールを書き換える。

CodeRabbit のモデルはすべて、短い見出し・説明・パッチという同じ構造的な枠組みに従っています。しかし Gemini 3 はその枠組みの使い方が異なります。レビューの隅々まで、前提条件、根拠、因果関係で埋め尽くします。それぞれのレビューは、技術ブリーフ（設計資料）を diff で包んだような構造になっています。

Gemini 3 は自信に満ち、詳細で、徹底的に具体的です。コメントは「なぜその修正が必要なのか」を示す証拠とともに、シニアエンジニアが書いたかのような明確さを持っています。この「密度」こそが Gemini 3 の特徴であり、たとえ最終的に採用しない指摘でさえ、重要な示唆に満ちています。

ベンチマークのコンテキスト

CodeRabbit 標準ベンチマークを使って Gemini 3 を評価しました。これは C++／Java／Python／TypeScript にまたがる 25 のプルリクエストに、既知のエラーパターン（EP）を埋め込んだものです。すべてのコメントは複数の LLM ジャッジが採点し、さらにエンジニアが手動で検証しました。評価指標は精度（precision）、重要度割合（important-share）、シグナルノイズ比（S/N比）で、CodeRabbit のモデル比較に毎回使用しているものと同じです。

またモデルの 書き方 が提案の採用率に影響するため、口調・長さ・スタイルも評価しました。

解釈: Gemini 3 は精度において中間層に位置しますが、本物のバグのカバレッジが非常に優れています。コメントの約 4 件中 3 件が重要（Critical／Major）に分類されました。S/N比は 3.2 と信頼性では Opus 4.5 に近いですが、Gemini 3 の方がより強い確信と詳細さを持って表現します。

スタイルとトーン

トーンの概要: Gemini 3 は4つのモデルの中で最も断定的です。自信をもって語り、その多くは正当化されています。仮に間違っていても、コメントが十分に説得力を持つため、開発者がコードをもう一度確認したくなることが特徴です。これは実用的価値を高めますが、人によっては混乱を招く可能性もあります。

“密度の高いエンジニア” という人格

Gemini 3 は、驚くほど多くの推論をコンパクトなコメントに圧縮します。平均コメントは 16 行ですが、その中に「何が壊れたのか」「なぜ壊れたのか」「どう修正するのか」という因果関係がすべて詰まっています。

例：
C++ のワーカープールでの並行性バグを検出した際、Gemini 3 は単に「ロックが抜けている」とは言いません。アンロック → ウェイト → シグナル喪失 → スレッド停止、という一連の流れを再構成します。そして 1 行のパッチで競合状態を修正します。
また TypeScript のレビューでは、MAX_SAFE_INTEGER がキャッシュの自動削除を無効化している問題を突き止め、パフォーマンス上のリスクを説明し、LRU フォールバック案を提示します。

これはスタイルの問題ではなく、プログラムの信頼性を改善する実質的な修正です。

Gemini 3 の密度と正確さが、明確な「人格」を形成しています。
すべてのコメントが論証であり、その多くは厳密な検討に耐えます。

レビューにおける Gemini 3 の感触

Gemini 3 のレビューは他のモデルとはまったく異なります。自信があり、構造化されていますが、特に推論が密です。各コメントは、変更を承認する前にコンテキストを求めるリードエンジニアのレビューのようです。

多くのコメントは「この競合状態を修正してください」のような指示から始まり、ファイル参照を交えた説明に続き、最後にパッチを提示します。まるで専門家が問題と解決策を同時に案内してくれるような感覚です。

開発者は Gemini 3 について「確信に満ちたレビューで、主張を証拠で裏付ける」と評価しています。その直接的なトーンは強く感じるかもしれませんが、コメントはミニ設計レビューのように、変更点・重要性・トレードオフを説明してくれます。

情報密度が高いため、注意深く読む必要がありますが、それに見合う洞察が得られます。誤りがあっても、隠れたエッジケースや設計上の前提を浮き彫りにすることがよくあります。

数値が示すもの

1. 密度は正確性と相関する。
長めのコメント（上位50%）は、短いコメントに比べ精度が高く、53% の精度を達成。
重要コメント（Critical／Major）は 平均 847 文字 で、重要でないコメント（442文字）の 2 倍近い長さ。
Gemini 3 が丁寧に書くとき、それはたいてい正確です。

2. トーンは重大度を反映する。
断定的な文体は重大度に比例：

Major の 92%
Critical の 67%
が断定的。
一方 Minor では断定的なメッセージが 36% に減り、hedging（曖昧表現）は 36% に増加。

3. 自信は品質と相関する。
断定的なコメントは 47.6% の精度で、
ニュートラル（36%）、曖昧さ（33%）より高い。
Gemini 3 の自信は全体的に、根拠に裏打ちされています。

4. パッチの有無は信頼性の指標。
diff やコードブロックが登場する場合、精度が向上。
断定的なコメントの 70% 以上 が diff を含み、
曖昧なコメントでは 17% に留まる。
パッチの存在は、モデルが実際のコードに基づいて推論しているサインです。

強み

Gemini 3 が最も得意とする領域は 並行性 と システム正確性 です。
他モデルが見落とすインターリーブや同期問題を頻繁に検出します。
そしてバグのなぜを正確に説明します。

スレッドセーフティ：
Lost Wake Upやロックの不整合を物語のように説明し、簡潔なパッチを提示。
ライフサイクル管理：
シャットダウンフックの欠如や未クローズのリソースを検出し、明示的なクリーンアップを推奨。
アルゴリズムの安定性：
コンパレータのロジックや off-by-one バグを修正。
システム設定：
デフォルト値が期待する動作を妨げている場合に実用的な上限を提案。

これらは、Gemini 3 の詳細な推論が直接的で検証可能な修正につながっている例です。

行き過ぎるとき

Gemini 3 の確信は、時にやりすぎることがあります。
スタイルや軽微な問題に対して、重要度を過大評価することがあります。
“Critical” とラベルされたコメントの中に、実際には軽微な指摘が含まれる場合もあります。
その断定的なトーンは、些細な問題を深刻に見せることもあります。

とはいえ、こうした行き過ぎた指摘も、実際の非効率や可読性問題に触れていることが多く、無価値なコメントはほとんどありません。

密度が重要な理由

Gemini 3 は簡潔さより理解を優先します。
修正だけでなく、小さな調査レポートを提供します。

この深さは、大規模で複雑なシステムにおいて非常に価値があります。
Precision は「当たったかどうか」を測る指標ですが、
Density は「そのコメントを読むことでどれだけ学べるか」を測ります。

本番環境において、この違いは重大です。
GPT-5.1 のような簡潔なモデルは迅速な指摘に優れますが、
Gemini 3 は包括的な推論で開発者の理解を深め、
見落とされがちな微細な欠陥の発見を助けます。

言い換えれば、
“Gemini 3 は読み飛ばすものではない。読み込むものだ。”

実践的ガイダンス：Gemini 3 を使うべき場面

使うべき場面	理由
並行性やリソース管理が重要なコード	同期やライフサイクル問題の検出に強い
深さが必要で、短さが不要	長いコメントの方が正確な傾向
実用的なパッチが必要	約65%のコメントに適用可能な diff が含まれる
強めのトーンを扱える	自信が助けになる場合が多い
若手エンジニアの育成	コメントがそのまま教育コンテンツになる

Gemini 3 は表面的なレビューには向いていません。
精度、説明、洞察が重要なときに真価を発揮します。

締めの考察：Gemini 3 の推論の形

Gemini 3 は、コードを修正するだけではありません。
修正のために 論理的なケース（主張） を提示します。
各コメントが、原因・影響・解決の完全なストーリーになっています。

正しいときは、シニアエンジニアの深い分析を読んでいるような感覚です。
間違っているときでさえ、問題の考え方に関する洞察が得られます。

まとめ：
GPT-5.1 が「決断力のあるチームメイト」
Opus 4.5 が「規律あるアーキテクト」だとすれば、
Gemini 3 は「密度の高いエンジニア」。
自信と包括性を併せ持ち、論証に基づく diff を提供します。

CodeRabbit を試してみませんか？
14日間の無料トライアルはこちら

]]>

Thu, 27 Nov 2025 01:54:03 GMT

How CodeRabbit's agentic code validation helps with code reviewsの意訳です。

2025年の Stack Overflow 調査では、幾つかの矛盾が明らかになっています。84% の開発者が AI ツールの導入に前向きである一方で、約半数（48%）がその出力の正確性を信頼していないのです。この期待と懐疑の矛盾した関係が、品質保証の考え方そのものを大きく変えています。

PRD から PR までを数週間ではなく数日で

ソフトウェア開発のボトルネックは、コードを書く行為から「コードを検証する行為」へとシフトしてきています。

初期のAI駆動開発のワークフローはシンプルでした。AI がコードを提案し、人間が提案されたスニペットを読み、それを採用するかどうかを判断する。タブ補完でボイラープレートが書かれ、Copilot が関数を提案する。しかし、プルリクエストを作る前には、シニアエンジニアがその品質・構造・安全性を担保するべく、行単位で人力による検証を行っていました。

しかし現在は状況が異なります。OpenAI の o1 のような高度な推論モデルは複雑な要件を分解し、もはや「機能単位」のコードを生成可能です。これによってエージェントが能動的に大規模なコードを生成する “エージェントコード世代” の時代が始まりました。AI が1行ずつコードを提案するフェーズとは異なり、機能全体を生成する今の枠組みでは、品質や構造、安全性の問題を見落としやすくなります。

そして、AI が生成したコードのレビューは、これまでとは違って圧倒的に時間がかかります。ボトルネックは「コードを書くこと」ではなく、「そのコードを信頼できるかどうか」なのです。

誰も語りたがらない、AI生成コードの危機

エンジニアが懐疑的になるのも無理はありません。AIが生成するコードの40%以上に依然としてセキュリティ欠陥が含まれることがわかっているからです。AI生成コードがよく間違えるポイントを挙げます。

依存関係の爆発：
例えば「ToDo アプリを作る」と簡単に指示しただけで、モデルによっては 2〜5 個のバックエンド依存が追加されることがあります。依存が増えるほど攻撃対象領域が広がります。さらに、古いデータで訓練されたモデルは既知の CVE があるライブラリを提案することもあります。
幻の依存関係：
AI が存在しないパッケージ名を捏造し、攻撃者がその名前を悪意あるコードで公開リポジトリに登録し、開発者が盲目的にインストールしてしまうケースがあります。これは “スロップスクワッティング攻撃” と呼ばれる、AI 生成コード特有の攻撃手法です。
アーキテクチャのドリフト：
暗号ライブラリを勝手に置き換えたり、アクセス制御チェックを削除したり、セキュリティ前提を変えてしまうことがあります。表面上は正しそうに見えても挙動が不正で、静的解析では検出されず、本番で初めて発覚する類の問題です。

推論モデルがなぜ状況を一変させたのか？

数年前、AI をコードレビューのような協調ワークフローに適用しようとすると、どこか「実験」のように扱われていました。セミコロンの欠落や未使用変数を指摘する程度で、せいぜいNullポインタの可能性を警告するくらい。速くて安価だが浅い、そんな時代でした。

CodeRabbit では、生成AIを導入し始めた初期段階でこの問題を早期に認識し、“monologue（モノローグ）” という技術を開発しました。これは、モデル自身が問題を思考し、その理由をコメント内で語る仕組みです。

OpenAI の o1 や o3 のような推論モデルの登場により、CodeRabbit の monologue 機能によって、モデルが問題を本当に思考するようになりました。GPT-4o にレビューさせると、過去のパターンをマッチングするだけで、レビューコメントの多くは表面的な指摘に留まります。一方、GPT-5 や Claude Sonnet 4.5 はコードのロジックを深く追跡し、実行パスを考慮し、エッジケースに向き合い、意図を理解します。

これはレビューの質にとって重要ですが、同時に大きな課題も生んでいます。

レビューを「エージェント的（agentic）」にするものとは？

多くの人が、推論モデルを使えば AI が自ら生成したコードの品質問題やバグを自動的に検出できると思っていました。しかし、それは完全には正しくありませんでした。欠けていた大きな要素は以下の2つです。

効果的なコンテキストアセンブリ（context engineering）
結果の真正性の検証（verification）

従来のコード検証ツールはリアクティブです。
linter は未使用変数を、静的解析はNullポインタを、セキュリティスキャナはハードコードされた秘密情報を指摘します。それぞれが孤立して動作し、あなたが何を作ろうとしているのかという文脈は理解しません。

生成AI時代には、これらのツールをレビューに統合するケースも増えました。しかし、モデルもツールも、それらを取捨選択してノイズを除去し、重要なシグナルだけを浮かび上がらせるほど賢くはありません。その結果、コンテキストが詰まり（context clogging）、レビューが逆に難しくなります。

これに対処するため、CodeRabbit は各モデルに与えるコンテキストを構築し、管理する技術を発展させました。例えば：

ツールが検出した重要な問題をリスト化し、推論モデルがより合理的に改善案を導けるように指示的（instructive）な形で渡す
さらに、レビュー結果をチェックし根拠付ける verification agent を追加

以下は OSS PR からの具体例です。

静的解析：ast-grep のようなツールによる AST 解析で怪しいコードを検出

インクリメンタル解析：コードベース全体ではなく「変更部分だけ」を検証

セキュリティ課題：プロンプトインジェクションやエッジケース生成

名前のリファクタリング：実際の使用箇所にもとづく変数・関数名の改善提案

ここでいう “agentic” とは、AI がどのツールを使うべきか判断し、結果を解釈し、必要なアクションを自律的に行うという意味です。つまり、状況に応じて深掘るべき点と、特に問題ない点を見極める「シニアエンジニアの判断力」に近づける試みです。

CodeRabbit が AI コードの信頼ギャップをどう埋めるか

CodeRabbit はベンチマークスコアの向上や従来の指標に依存するのではなく、実際のエンジニアリング現場での動作に基づいて AI の性能を評価するために独自の評価手法を採用しています。その多くは、レビュー対象の PR 上で直接確認できるものです。

エージェンティックコード検証は CodeRabbit がレビューする すべてのプルリクエストで実行されます。ただし、すべては CodeRabbit が「tools in jail」と呼ぶ、 隔離されたサンドボックス環境で動作します。Security Posture で説明されているように、このアプローチにより、検証エージェントはユーザーデータやインフラストラクチャの整合性を損なうことなく、安全にコードの実行・検査・ストレステストができます。

エージェントは一般的な脆弱性の検出、大規模コードのパターン解析、包括的なテスト実行に優れています。人間が手動で行うと時間がかかり過ぎる問題を、特に得意としています。ただしエージェンティックコード検証がコードレビューを完全に置き換えるわけではありません。むしろ、エンジニアが本来集中すべき領域――アーキテクチャの判断、ビジネスロジックの妥当性検証、セキュリティの微妙なニュアンス――に時間を割けるようにするものです。

人間とエージェントが両方レビューに関わることで、ペアプログラミングに近い「冗長性」と「補完的な推論能力」を提供できます。

エージェンティック検証を実際に体験しませんか？
CodeRabbit の14日間トライアルに登録する

]]>

Wed, 26 Nov 2025 18:10:00 GMT

TL;DR: It doesn’t just write patches; it writes a complete argument for every change. When Gemini 3 is right, it’s spectacularly right. When it’s wrong, it still sounds right.

Every model writes in our house style. Gemini 3 rewrites the rules.

All of CodeRabbit’s models follow the same structural blueprint: a short headline, an explanation, and a patch. However, Gemini 3 uses that frame differently. It fills every inch of space with evidence, preconditions, and causal reasoning. Each review reads like a technical brief wrapped around a diff.

Gemini 3 is confident, detailed, and relentlessly specific. Its comments read like they were written by a senior engineer who wants to fix the issue and demonstrate why the fix is necessary. That density is its defining trait. Every comment feels significant, even the ones you might not ultimately act on.

Benchmarking context

We evaluated Gemini 3 using CodeRabbit’s standard benchmark: 25 pull requests seeded with known error patterns (EPs) across C++, Java, Python, and TypeScript. Each comment was scored by multiple LLM judges and hand-validated by our engineers. We measured precision, important-share, and signal-to-noise ratio (SNR), the same metrics used in all of our model evaluations.

We also assessed tone, length, and style, since how a model communicates can affect whether developers accept its suggestions.

Interpretation: Gemini 3 sits in the middle of the group for precision but provides excellent real-bug coverage. Roughly three of every four comments are important (critical or major). Its SNR of 3.2 puts it close to Opus 4.5 in reliability, but Gemini 3 expresses itself with greater conviction and detail.

Style and Tone

Tone summary: Gemini 3 is the most assertive of the four. It communicates with confidence, and for the most part, that confidence is justified. Even when it makes a mistake, the comment sounds credible enough to make developers stop and re-check the code, which may add to its practical value but may be confusing for some.

The dense engineer personality

Gemini 3 compresses an exceptional amount of reasoning into compact comments. The average comment is only 16 lines long, yet each one unpacks a complete causal chain: what broke, why it happened, and how to fix it.

For example, in a concurrency issue found in a C++ worker pool, Gemini 3 doesn’t simply say “missing lock.” It reconstructs the sequence: unlock, wait, missed signal, dead thread. Then it provides a single-line patch that resolves the race. In another TypeScript review, the model identifies that MAX_SAFE_INTEGER disables cache eviction, explains the performance risk, and proposes an LRU fallback. These are not stylistic suggestions. These are corrections that improve program reliability.

This combination of density and accuracy defines Gemini 3’s personality. Every comment is an argument, and most of those arguments hold up under scrutiny.

What Gemini 3 feels like in review

Reading Gemini 3 feels distinct from reading any other reviewer. It is confident and structured, but its reasoning is particularly dense. Each comment reads like a detailed technical review from a lead engineer who insists on context before approving a change. Comments often open with a directive such as “Fix this race condition,” then expand into a clear explanation, referencing specific files and ending with a patch. The result feels like an expert walking you through both the problem and the fix.

Developers describe Gemini 3 as a reviewer that sounds sure of itself and provides evidence to support its claims. Its direct tone can feel intense, especially compared to GPT-5.1’s measured precision or Opus 4.5’s calm logic. However, each comment feels like a mini design review, explaining what to change, why it matters, and what trade-offs exist.

The model’s high information density requires careful reading, but it provides proportionate insight. Even when wrong, Gemini 3 often reveals something valuable about hidden edge cases or architectural assumptions.

What the numbers show

1. Density correlates with correctness. Longer comments (top half by length) pass more often, with 53% precision compared to 34% for shorter comments. Important comments (critical or major) average 847 characters, nearly twice the size of unimportant ones (442). When Gemini 3 takes time to elaborate, it is typically accurate.

2. Tone tracks severity. Assertiveness rises with severity: 92% of major and 67% of critical comments are assertive, while only 36% of minor comments are. Hedging increases to 36% on minor issues. The model uses its strongest voice for the most serious problems, which makes it effective for triage.

3. Confidence correlates with quality. Assertive comments pass more frequently (47.6%) than neutral (36%) or hedged (33%) ones. Gemini 3’s confidence is generally supported by evidence rather than overstated.

4. Patches indicate reliability. When a diff or code block appears, the precision rate improves. Over 70% of assertive comments contain diffs, compared with only 17% of hedged comments. The presence of a patch often signals that the model’s reasoning is grounded in the actual code.

What it gets right

Gemini 3’s strongest areas are concurrency and system correctness. It frequently detects interleaving and synchronization issues that other models overlook. It excels at diagnosing the why behind a bug:

Thread-safety: Describes lost wakeups and inconsistent locking with narrative precision, then offers a concise patch.
Lifecycle management: Identifies missing shutdown hooks or unclosed resources and recommends explicit cleanup.
Algorithmic stability: Corrects comparator logic and off-by-one ranges to restore invariants.
System configuration: Finds default values that disable expected behavior and recommends practical limits.

Each of these shows how Gemini 3’s detailed reasoning leads to direct, verifiable fixes.

When it overreaches

Gemini 3’s conviction can occasionally go too far. On stylistic or low-severity issues, it may overstate importance. Some comments labeled “Critical” are actually minor or aesthetic. Its assertive tone can make small findings sound urgent. This model performs best when paired with experienced reviewers who can distinguish critical bugs from overconfident advice.

Even its overreaches tend to highlight genuine inefficiencies or readability concerns. Few comments are without value.

Why density matters

Gemini 3 trades brevity for understanding. It does not simply provide a fix; it delivers a short investigation. This depth makes it particularly valuable for large, complex systems. Precision measures whether a model hits the target, but density measures how much a developer learns by reading it.

In production environments, that difference matters. A concise reviewer like GPT-5.1 delivers quick, targeted notes. Gemini 3, by contrast, provides comprehensive reasoning that increases confidence and reduces the likelihood of missing subtle defects.

Let's put it this way, “You don’t skim Gemini 3. You study it.”

Practical guidance: When to use Gemini 3

Use it when...	Because...
Concurrency-heavy or resource-sensitive code	It excels at identifying synchronization and lifecycle issues.
Depth over brevity	Longer, more detailed comments correlate with accuracy.
You need actionable patches	Around 65% of comments include ready-to-apply diffs.
You can manage assertive tone	Its confidence is helpful but occasionally overstated.
You are mentoring newer developers	Each comment serves as both a fix and an educational note.

Gemini 3 is not ideal for superficial or stylistic reviews. It is best used when precision, explanation, and insight are more important than speed.

Closing thoughts: The shape of Gemini 3’s reasoning

Gemini 3 does more than fix code; it presents a logical case for every fix. Each comment is a complete story of cause, effect, and resolution. When it is correct, it feels like reading a senior engineer’s deep-dive analysis. Even when it is wrong, it provides insight into how to think about the problem.

Takeaway: If GPT-5.1 is the decisive teammate and Opus 4.5 the disciplined architect, Gemini 3 is the dense engineer who delivers a fully reasoned diff that is confident, comprehensive, and intent on proving its point.

Want to try out CodeRabbit? Get at 14-day free trial!

]]>

Tue, 25 Nov 2025 06:07:27 GMT

The 2025 Stack Overflow survey reveals a paradox: while 84% of developers express confidence in adopting AI tools, nearly half (48%) still distrust the accuracy of their outputs. This tension between optimism and skepticism has reshaped how teams think about quality assurance.

From PRD to PR in days (not weeks)

The bottleneck in software development has fundamentally shifted from writing code to validating it.

In the early days of AI-assisted development, the workflow was straightforward: AI suggested code, humans read the suggested snippet and then decided whether or not to accept that suggestion. Tab completion wrote boilerplate. Copilot suggested functions. But a senior engineer still manually validated and chose each line of code to ensure its quality, structure, and safety before making a pull request.

Today's reality is different. Advanced reasoning models like OpenAI's o1 can decompose complex requirements and generate entire features. This set the flywheel in motion for the era of agentic code generation, where humans along with agents play an active role in generating large swaths of code. The difference between accepting AI-generated code one snippet at a time and adding in AI-generated features is significant. Devs are more likely to miss issues with its quality, structure, and safety.

Reviewing AI-generated code also takes much more time. The bottleneck isn't writing code anymore - it's trusting it.

The AI-generated code crisis nobody's talking about

Engineers are right to be skeptical, since over 40% of AI-generated code still contains security flaws and here is what AI-generated code often gets wrong:

Dependency explosion: A simple prompt for a "to-do list app" can generate 2-5 backend dependencies depending on the model. Each dependency expands your attack surface. Worse, models trained on older data suggest libraries with known CVEs that were patched after their training cutoff.
Hallucinated dependencies: AI invents package names that don't exist. Attackers register those names in public repositories with malicious code. Developers install them blindly. This attack vector, called "slopsquatting," is uniquely enabled by AI code generation.
Architectural drift: The AI swaps out your cryptography library, removes access control checks, or changes security assumptions in ways that look correct but behave insecurely. These are the bugs that static analysis misses and humans don't catch until production.

Why did reasoning models change everything?

A few years back, applying AI to a collaborative workflow like Code Review met with a degree of amused skepticism. The bots would catch your missing semicolons, flag unused variables, and maybe (if you were lucky) warn you about a potential null pointer. They were fast, cheap, and fundamentally shallow.

At CodeRabbit, when we started to apply Generative AI, we realized this problem pretty early and developed a technique that you see on some of our older PRs, called monologue where the model thinks through the issue and shares reasoning behind an issue comment.

With the launch of reasoning models like OpenAI’s o1 and o3 the models actually think through the problem thanks to the monologue feature on CodeRabbit. When you ask GPT-4o to review code, it pattern-matches against things it's seen before and code review feedback is mostly superfluous. When you ask GPT-5 or Claude Sonnet 4.5, it spends time reasoning through your code's logic, tracing execution paths, considering edge cases, and understanding intent. This was important for successful code review. But there is a catch!

What makes review more "agentic"?

Many thought that applying the same reasoning models to review the code they generated would cut slop or find bugs, but this wasn't entirely true. The two major missing pieces were effective context assembly (context engineering) and verifying the veracity of the results.

Traditional code validation tools are reactive. You run a linter, it tells you about unused variables. You run a static analyzer, it warns about null pointer exceptions. You run security scanners, they flag hardcoded secrets. Each tool does one thing, in isolation, with no context about what you're actually trying to build.

With generative AI, you might integrate these tools into your review pipeline. However, neither the model nor the tools are intelligent enough to effectively filter out noise and highlight crucial signals, leading to context clogging.

To effectively counter that, we developed techniques to engineer and manage the context for each model in the review pipeline. For example: We would prepare the list of most important issues suggested by all the tools in an instructive manner to the reasoning model, for better solutions. We also added a verification agent that checks and grounds the review feedback.

Here are some examples from the open-source PRs.

Static analysis: AST parsing with tools like ast-grep to understand code smells.

Incremental analysis: Only validating wAgentiAgenhat changed, not your entire codebase.

Security issues: Prompt injection attacks and edge case generation.

Name refactoring: Suggesting better variable and function names based on usage.

The "agentic" part means the AI decides which tools to run, interprets the results, and takes action. Think of it like having a senior engineer who knows when to dig deeper and when something is not fine

How CodeRabbit closes the AI code trust gap

Instead of chasing higher benchmark scores or relying on traditional metrics, CodeRabbit focuses on how AI systems actually perform in live engineering environments through custom evaluation methods, some visible directly on the PRs we review.

The technique of agentic code validation happens on each pull request reviewed by CodeRabbit; however, everything runs in isolated, sandboxed environments, what we call “tools in jail.” As described in our Security Posture, this approach ensures that verification agents can safely execute, inspect, and even stress-test code without ever compromising user data or infrastructure integrity.

Agents excel at catching common vulnerabilities, analyzing patterns across thousands of lines, and running comprehensive test suites. They're designed to surface issues that are tedious or time-consuming for humans to catch manually. But agentic code validation isn't going to replace code reviews entirely. Instead, it frees developers to focus on what humans do best: architectural reasoning, business logic validation, and nuanced security considerations. The human-in-the-loop and agent-in-the-loop processes can coexist, providing redundancy and complementary reasoning similar to peer programming.

Want to see agentic validation in action? Sign up for a 14-day CodeRabbit trial.

]]>

Tue, 25 Nov 2025 03:12:14 GMT

Opus 4.5 for code-related tasks: Performs like the system architectの意訳です。

すべてのモデルは推論します。しかし、Opus 4.5 は「監査」します。

新しいモデルが登場するとき、その約束はいつも同じです。より賢い推論、よりきれいなコード、そしてよりよい回答。しかし Anthropic の Opus 4.5 は、単に推論するだけではなく、監査する モデルです。あたかも自ら設計に携わったシステムに戻ってきたかのようにコードを読み込み、弱点を特定し、アーキテクチャ全体を整えます。他のモデルが論理を説明したり、局所的な修正を示したりするのに対し、Opus 4.5 は技術文書に近い、構造的で体系的なレビューを行います。

私たちはこのモデルの特徴を把握するために、Opus 4.5 を CodeRabbit のベンチマーク環境に統合しました。その結果わかったのは、より高い知能でも派手な文章でもなく、「規律」でした。Opus 4.5 は単にバグを見つけるのではなく、その周囲にある 文脈を構築 します。つまり、レビューを推測ゲームではなくエンジニアリングプロセスとして扱うのです。

ベンチマークの背景

CodeRabbit では、新しい LLM を評価するために、C++、Java、Python、TypeScript にまたがる既知のエラーパターン（EP）を含む 25 件の複雑なプルリクエスト を用意しています。モデルが生成した各コメントを LLM ジャッジが次の3つの観点で評価します。

Precision（精度）： EP を正しく特定しているか。
Important-share（重要コメント率）： コメントのうち重要・重大な指摘（本物のバグ）がどれだけ占めるか。
Signal-to-noise ratio（S/N比）： 重要コメントと、重要でないコメントの比率。

この評価フレームワークは、複数世代のモデルを通じて改善されており、自動判定の LLM と 人手による検証 を組み合わせて正確性を担保しています。また、複数ジャッジによる評価と繰り返し試行 を実施することで、一貫性とばらつきを記録しています。プロンプト改善、ラベル精度向上、評価範囲の拡大を継続的に進め、より信頼できる結果を得られるようにしています。

スコアボード（アクショナブルコメントに限定）

解釈:
Opus 4.5 は、Sonnet 4.5 の高ボリューム・冗長スタイルと、GPT-5.1 のシャープで精密なスタイルの中間に位置します。Sonnet 4.5 よりもコメントあたりの精度が高く、有意味な指摘の割合が多い結果となりました。EP パス数は 1 件少ない（15 vs. 16）ものの、これは通常のばらつき範囲に収まっています。実際、複数回のベンチマークでは Opus 4.5 が GPT-5.1 や Sonnet 4.5 を上回ることもありました。

総合すると、Opus 4.5 はシグナル、構造、カバレッジのバランスがよく、安定して信頼できるモデルと言えます。

スタイルとトーン

Opus 4.5 のレビューは、構造化され、簡潔で、焦点が明確です。断定表現の比率は約33%、婉曲表現は約15%と、落ち着いたプロフェッショナルなトーンになっています。密度とトーンのバランスによって、実践的で自信のある分析的な内容となっています。コードブロックや diff など、行動につながる表現を多用する傾向があり、「説明する」よりも「編集する」モデルだと言えます。

構造化された知性：予測可能な形式と言語横断の一貫性

Opus 4.5 のコメントは、見出し、理由説明、diff というアーキテクチャ的なリズムで書かれています。約80%のコメントにコードブロックが含まれ、ほとんどのコメントが簡潔なパッチで締めくくられます。原因、影響、解決策が整然と記述されており、明確なバグレポートのようです。

この構造はどの言語でも維持されます。C++、Java、Python、TypeScript のいずれでも、コメントは平均19行・790文字程度に収まり、統一されたスタイルとなります。一貫性があることで自動化との相性が良く、読みやすさも向上します。まるでコードベース全体を同じエンジニアがレビューしているかのようです。

具体例:

C++（WorkerThreadPool）： lost wakeup レースを3ステップのインターリーブで説明し、1行の修正 diff を提示します。
Java（OrderService）： ダブルチェックロッキングで volatile が欠落している点を指摘し、正しいパターンを提示します。
Python（Batch Client）： 同期 HTTP クライアントを非同期版に置き換えてブロッキングを防ぎます。
TypeScript（Cache Manager）： Number.MAX_SAFE_INTEGER がエビクションを無効化している点を指摘し、現実的なデフォルト値を提案します。

いずれも簡潔でコードネイティブな洞察であり、根拠に基づいた実用的な修正です。

自信の逆転現象

Opus 4.5 のトーンは全体的にバランスが良いのですが、間違っているときにやや断定的に聞こえるという、ささやかな逆転現象があります。通常は慎重ですが、この癖があるため、コメントのトーンだけで正確性を判断しないようにしています。この問題を補うため、評価サマリーではトーンデータと正解率を組み合わせて校正を行っています。

とはいえ、Opus 4.5 はほとんど推測をせず、間違っているときでさえ淡々と説明します。

システムレベルでの推論：コードだけではなく文脈を修正する

多くのモデルが目の前の欠陥に集中するのに対し、Opus 4.5 は周辺のシステム全体を考慮します。推奨内容には、ライフサイクル改善、安全チェック追加、デフォルト値の見直しといった、より上位の修正が頻繁に含まれます。

例:

TypeScript Cache： エビクションロジックの再設計、TTL の強制、デフォルト改善により秘められた OOM（Out Of Memory ）を防ぎます。
Java OrderService： HashMap を ConcurrentHashMap に変更し、ExecutorService の shutdown 漏れを指摘します。
Python Client Lifecycle： 長寿命 async クライアント向けに明示的なシャットダウンフックを追加します。
C++ FileAccessEncrypted： 暗号化ファイルがすべてブロックされる検証バグを修正し、上流のエラーハンドリングも改善します。

どれも一行修正ではなく、システム全体の整合性を高める提案です。コードを「問題の集合体」ではなく「相互に影響し合うエコシステム」とみなしていることがわかります。

コスト、Effort、効率性

Anthropic の Effort パラメータを使うと、モデルの推論深度を直接制御できます。High-effort では依存関係パスを徹底的に探索し、Medium-effort ではトークン節約のため深度を抑えます。High-effort であっても、Opus 4.5 の出力トークン量は Sonnet 4.5 より約25%少なく、1M 出力トークン 25ドルという単価を効率性で補っています。

規律ある構造のおかげで脱線が減り、クリアで読みやすい結果を維持できています。

Opus 4.5 を読むとどんな感じか

Sonnet 4.5 が教師、GPT-5.1 が決断力のあるチームメイトだとしたら、Opus 4.5 は PR をレビューしに戻ってきたアーキテクト です。トーンは落ち着いており、命令的ではありません。あなたがドメインを理解している前提で、細部を丁寧に確認します。そのため、システムエンジニアによるピアレビューのような、構造的で静かに権威を感じさせるコメントになります。

トーンと人格

Opus 4.5 のトーンは測定可能で分析的です。劇的な表現や不必要な厳しさを避け、秩序ある構造、簡潔な要約、根拠の提示、フォーカスされた修正提案によって確信を示します。システムに精通したメンターからのアドバイスのように感じられるため、開発者が受け取りやすい雰囲気です。

深さと密度のバランス

コメントはコンパクトですが情報量は十分です。複雑な問題には必要なだけの説明を行い、単純な問題は短い提案で正確に処理します。このバランスによって、読みやすさと包括性が両立しています。

流れと可読性

文脈 → 原因 → 修正というリズムにより、開発者はコメントをすばやくスキャンしつつ意味を保持できます。Opus 4.5 のコメントは「構造化されたスナップショット」のようで、何が起きたのか、なぜ重要なのか、どう直すのかが一目で分かります。

実務的なインパクトと開発者の信頼

Opus 4.5 は誇張や劇的な表現を避けるため、開発者からの信頼を得やすいモデルです。プロフェッショナルで落ち着いたトーンを維持し、間違っている場合も過剰な断定ではなく理性的な仮説として提示します。過度な自信がないため、レビューがより「人間的で実務的」に感じられます。

コメントはまるで設計ノートのように読みやすく、壊れた不変条件、修正案、その根拠が明確に記されています。そのまま変更履歴や Issue Tracker に貼れるレベルの明快さです。

長所と短所の一覧

長所：

重要コメントの密度が高い（約80%）
言語をまたいだ構造の一貫性
並行処理やライフサイクルに強い推論能力
明確・簡潔・プロフェッショナルなトーン
Sonnet 4.5 より冗長でない一方、GPT-5.1 より文脈が豊富

短所：

精度は中程度（約38%）
間違っているときに少し断定的になることがある
重大ラベルが多く、多忙な PR では過剰に見える場合がある
単純な問題ではやや説明が長くなることがある

結論:
Opus 4.5 は、私たちがテストした中で最も システミック（全体的） なレビュアーです。落ち着きがあり、構造化され、厳密で、アーキテクチャ理解が必要な場面で特に強みを発揮します。

モデルの使い分け

シナリオ	最適モデル	理由
複数言語や高度な文脈を含むレビュー	Opus 4.5	構造が安定しており、システム的な洞察が強い
精密さ重視の小規模 diff	GPT-5.1	精度が高く、判断も明確で、誤検知が少ない
大量スキャンやコスト重視	Sonnet 4.5	カバレッジが高く、レビュー単価が低い

最後に：推論の「形」について

Opus 4.5 は、実験的なモデルではなく「設計されたモデル」に感じられます。初期のモデルが推測に頼りがちだったのに対し、Opus 4.5 は測定し、構造化し、文書化します。レビューを読むと、開発者の視点 を理解したモデルと一緒に作業しているように感じられます。

コードレビューでは、トーンが信頼を決めます。Opus 4.5 のスタイル──測定可能で、構造化され、機械的な精度を持つ──は、推論の成熟を示しています。圧力のない精度、エゴのない自信が感じられます。

まとめ:
Sonnet 4.5 が教師、GPT-5.1 がチームメイトだとすると、Opus 4.5 は設計レビューのために戻ってきたアーキテクトです。

CodeRabbit を試してみたい方はこちら
14日間の無料トライアル

]]>

Mon, 24 Nov 2025 18:58:48 GMT

Every model reasons. Opus 4.5 audits.

Every new model arrives with the same promise: smarter reasoning, cleaner code, and better answers. But Opus 4.5 from Anthropic doesn’t just reason; it audits. It reads code as if returning to a system it helped design, identifying weak points and refining architecture. Where other models narrate their logic or prescribe surgical fixes, Opus 4.5 performs structured, systematic reviews that feel more like technical documentation than conversation.

We integrated Opus 4.5 into CodeRabbit’s benchmark harness to understand what makes this model distinct. The result was not higher raw intelligence or flashier prose, but discipline. This model doesn’t just find bugs; it builds context around them. It treats review as an engineering process, rather than a guessing game.

Benchmarking context

At CodeRabbit, we evaluate new LLMs using a controlled benchmark of 25 complex pull requests seeded with known error patterns (EPs) across C++, Java, Python, and TypeScript. Each comment generated by a model is scored by an LLM judge for three key factors:

Precision: Whether it correctly identifies the EP.
Important-share: The percentage of comments that are genuinely critical or major (real bugs, not style issues).
Signal-to-noise ratio (SNR): The ratio of important to unimportant comments.

Our evaluation framework, refined over multiple generations of models, combines automated LLM judging with hand validation to ensure accuracy. We also use multiple judges and repeated trials to measure consistency and understand variance. Each iteration improves the process through better prompts, refined labeling, and expanded coverage, resulting in more reliable outcomes.

Scoreboard (Actionable comments only)

What this means: Opus 4.5 sits between Sonnet 4.5’s high-volume, verbose style and GPT-5.1’s lean, surgical precision. It delivers higher per-comment precision and a greater share of meaningful findings than Sonnet 4.5. While it recorded one fewer EP pass (15 vs. 16), that difference falls within normal variance. In several runs, Opus 4.5 matched or even surpassed both GPT-5.1 and Sonnet 4.5. The takeaway is a model that balances signal, structure, and coverage with consistent reliability.

Style and tone

Opus 4.5’s reviews are structured, concise, and focused. With assertiveness around 33% and hedging near 15%, its tone reads as measured and professional. The balance of tone and density gives it an analytical voice that feels practical and confident. The high use of code blocks and diff patches underscores its bias toward action; it talks less and edits more.

Structured intelligence: Predictable form and cross-language consistency

Opus 4.5’s comments follow an architectural rhythm of headline, rationale, and diff. Nearly 80% include code blocks, and most conclude with a concise patch. Each resembles a clear bug report that specifies cause, effect, and resolution.

This structure holds across languages. Whether reviewing C++, Java, Python, or TypeScript, the cadence remains consistent, averaging 19 lines and 790 characters per comment. This uniformity simplifies automation and enhances readability. It also makes Opus 4.5 feel like a single engineer’s consistent voice across an entire codebase.

C++ (WorkerThreadPool): Detects a lost wakeup race with a three-step interleaving and a one-line diff fix.
Java (OrderService): Flags a missing volatile on a double-checked lock and provides the corrected pattern.
Python (Batch client): Replaces a synchronous HTTP client with an asynchronous equivalent to prevent blocking calls.
TypeScript (Cache manager): Identifies that Number.MAX_SAFE_INTEGER disables eviction and suggests realistic defaults.

These are concise, code-native insights, each actionable and grounded in sound reasoning.

The confidence inversion

Opus 4.5’s tone is balanced but occasionally reveals a subtle inversion: when it is wrong, it can sound slightly more certain. Although the model is generally measured, this behavioral quirk means tone alone is not always a reliable indicator of correctness. To account for this, we pair tone data with correctness metrics in evaluation summaries to maintain consistent calibration.

Opus 4.5 rarely speculates; it simply explains, even when it’s wrong.

System-level reasoning: Fixing context, not just code

While most models target the immediate defect, Opus 4.5 focuses on the surrounding system. Its recommendations frequently adjust lifecycles, add safety checks, or refine defaults.

Examples:

TypeScript Cache: Rewrites eviction logic, adds TTL enforcement, and updates defaults to prevent silent OOM.
Java OrderService: Replaces HashMap with ConcurrentHashMap and identifies missing ExecutorService shutdown.
Python Client Lifecycle: Adds explicit shutdown hooks for long-lived async clients.
C++ FileAccessEncrypted: Resolves a validation bug that blocked all encrypted files and improves upstream error handling.

These are not single-line fixes but systemic corrections. The model treats code as an interconnected ecosystem rather than a collection of isolated issues.

Cost, effort & efficiency

Anthropic’s Effort parameter provides direct control over how deeply the model reasons. In high-effort mode, Opus 4.5 explores every dependency path. In medium-effort mode, it trims reasoning depth to save tokens. Even with high-effort reasoning, its reviews averaged about 25% fewer output tokens than Sonnet 4.5, balancing higher per-token costs ($25 per million output tokens) with greater efficiency.

This disciplined structure pays for itself by producing fewer digressions and maintaining consistent clarity.

What it feels like to read Opus 4.5

If Sonnet 4.5 feels like a teacher and GPT-5.1 like a decisive teammate, Opus 4.5 is the architect reviewing your PR. Its tone is calm and deliberate, never commanding. It assumes you understand the domain and aims to confirm the details. The result is feedback that reads like peer review from a systems engineer: consistent, structured, and quietly authoritative.

Tone and personality

Opus 4.5’s voice is measured and analytical. It rarely uses dramatic language or unnecessary severity. Instead, it conveys certainty through order, concise summaries, specific evidence, and focused corrections. The tone builds trust, delivering feedback that feels like it comes from a mentor familiar with your system.

Depth vs. density

Its comments are compact yet informative. When an issue warrants detailed explanation, Opus 4.5 delivers it without excess. For simpler problems, it resolves them with brief, precise advice. This balance of detail and brevity keeps reviews readable and comprehensive.

Flow and readability

The model’s structural rhythm of context, cause, and correction allows developers to scan quickly while retaining meaning. Developers often describe its comments as “structured snapshots” that tell a short, self-contained story: what happened, why it matters, and how to fix it.

Practical impact and developer trust

Because Opus 4.5 avoids inflated confidence and theatrical phrasing, developers trust it more readily. It comes across as confident yet professional, firm but not forceful. When it errs, it sounds like a reasoned hypothesis instead of an overreach. That restraint, more than precision alone, makes its reviews feel professionally human.

Each comment reads like a design note. It states the invariant that failed, proposes a patch, and explains the rationale inline. The clarity is high enough that many of its comments could be pasted directly into changelogs or issue trackers without revision.

Strengths and weaknesses at a glance

Strengths:

High signal density (≈80% important comments).
Consistent structure across languages.
Strong concurrency and lifecycle reasoning.
Clear, concise, and professional tone.
Lower verbosity than Sonnet 4.5 with more context than GPT-5.1.

Weaknesses:

Moderate precision (≈38%).
Subtle confidence inversion when incorrect.
Frequent critical or major labeling may overwhelm busy PRs.
Slight verbosity on simpler issues.

Bottom line: Opus 4.5 is the most systemic reviewer we’ve tested. Calm, structured, and exacting, it excels when reasoning breadth and architectural understanding matter more than pinpoint precision.

When to use which model

Scenario	Best model	Why
Cross-language or high-context reviews	Opus 4.5	Structured, consistent, strong at systemic issues
Tight precision or small diffs	GPT-5.1	Higher EP precision, decisive tone, fewer false positives
Bulk scans, cost-sensitive workloads	Sonnet 4.5	High coverage, lower cost per review

Closing thoughts: The shape of reasoning

Opus 4.5 no longer feels experimental; it feels engineered. Earlier models often guessed, while Opus 4.5 measures, structures, and documents. Reading its reviews feels like working with a model that truly understands how developers read.

In code review, tone defines trust. Opus 4.5’s style, measured, structured, and mechanically precise, demonstrates the maturity of reasoning: precision without pressure and confidence without ego.

Takeaway: If Sonnet 4.5 was a teacher and GPT-5.1 a teammate, Opus 4.5 is the architect returning for a design review.

Interested in trying CodeRabbit? Get a 14-day free trial.

]]>

Thu, 20 Nov 2025 14:23:45 GMT

How to deploy and integrate MCP servers with CodeRabbitの意訳です。

MCP サーバーは、ユーザーのリクエストに基づいてシステム関連タスクを実行するために、AI エージェントをアプリケーションへ統合します。Slack、Sentry、Notion、GitHub Copilot などのプラットフォームは、AI 駆動アプリケーションに機能を公開するために、すでに MCP スタイルのサービスを採用しています。

CodeRabbit もこの潮流に乗っています。MCP クライアントとして機能することで、ユーザーがコンテキストを提供し、最適なコードレビューを実行できるようにします。また、Confluence に保存されたビジネス要件、CI/CD パイプラインからのシステム情報、さらには任意の内部 MCP サーバーなど、複数ソースのコンテキスト（データ）をサポートする初の AI コードレビュー・プラットフォームでもあります。

このチュートリアルでは、Slack MCP サーバーをセットアップし、チャンネルデータを取得し、それを CodeRabbit のコンテキストとして渡すことで、チームワークスペースの議論を反映したコードレビューを生成する方法を学びます。これにより、すべてのレビューがプロジェクト目標に整合したものになります。

なぜ CodeRabbit で MCP を使うのか？

MCP サーバーを CodeRabbit と組み合わせる主な利点は、コードレビューをより洞察的で実行可能なものにする関連データを提供できる点です。その他の利点には以下があります。

複数ツールのコンテキストでコードレビューを豊かにする

CodeRabbit は Slack、Confluence、CI/CD パイプライン、または内部 MCP サーバーから関連情報を取得し、レビュアーが変更の背景を理解できるようにします。Slack のスレッド、議論、メッセージから必要な情報を引き出し、コード変更の意図やロジックを理解します。

情報に基づいた正確なレビューが可能になる

MCP サーバーから提供されるデータにより、CodeRabbit はプロジェクトのロジックと目標をより深く理解できます。たとえば、Slack MCP サーバーはチームのメッセージへのアクセスを許可し、ビジネス要件や開発目標に整合したコードレビューを実行できるようにします。

前提条件

進める前に、MCP サーバーをセットアップし CodeRabbit と統合するために、以下のツールをインストールしておく必要があります。

Slack チャンネル – メッセージを取得し、AI コードレビューワーへコンテキストを提供するために既存の Slack チャンネルが必要です。
MCP Server for Slack Workspaces – Slack の会話データを Model Context Protocol (MCP) によって公開するための、シンプルかつ構造的な方法を提供します。メッセージ、スレッド、リプライなどの Slack API メソッドが組み込まれており、軽量で Docker 対応、設定も容易です。
Claude Desktop – Slack MCP サーバーを CodeRabbit に接続する前にローカルでテストするための MCP クライアントです。
Docker – Slack MCP サーバーをコンテナで実行・ホストするために使用します。
Ngrok – Slack MCP サーバーを CodeRabbit からアクセス可能にするため、セキュアな公開 URL を生成します。

このチュートリアルでは次を行います。

Claude Desktop を使ってローカルで Slack MCP サーバーをテスト
Docker を使用してローカルホスト上にサーバーをホスト
Ngrok による公開 URL の生成
MCP サーバーを CodeRabbit に統合

注意: Slack も MCP サーバーを試験的に扱っていますが、現時点で公式の MCP サーバーは提供されていません。このチュートリアルでは自分で MCP サーバーを構築する方法を解説します。

Claude Desktop で Slack MCP サーバーをセットアップ

Claude Desktop は複数の MCP サーバーへ接続し、それらをコンテキストソースとして利用する MCP クライアントです。MCP サーバーをコネクタとして追加し、CodeRabbit やその他のプラットフォームへデプロイする前にローカルでテストできます。

Claude Desktop をインストールし、起動後に Manage Connectors をクリックします。

サイドバーから Developer を選択し、Edit Config をクリックして Slack 認証トークンを設定します。

Slack の認証トークンは、GitHub リポジトリの説明に従って取得し、Claude Desktop に設定します。

次に、claude_desktop_config.json を以下の JSON で更新します。

{
  "mcpServers": {
    "slack": {
      "command": "npx",
      "args": ["-y", "slack-mcp-server@latest", "--transport", "stdio"],
      "env": {
        "SLACK_MCP_XOXC_TOKEN": "xoxc-...",
        "SLACK_MCP_XOXD_TOKEN": "xoxd-..."
      }
    }
  }
}

上記設定により、Slack の xoxc と xoxd トークンを使用して Slack MCP サーバーが Claude Desktop のコネクタとして登録されます。接続されると、Claude はチャンネルメッセージの取得や Slack コンテキストを活用したコードレビューを実行できます。

設定後、Claude Desktop を再起動して Slack MCP サーバーをアクティブにします。

Preview.mp 4

Slack MCP サーバーを CodeRabbit に接続

このセクションでは、Slack MCP サーバーを Docker で実行し、Ngrok で公開 URL を生成し、それを CodeRabbit に統合する手順を説明します。

まず Docker を起動します。

次にターミナルを開き、Slack MCP サーバーに必要なファイルをダウンロードします。

wget -O docker-compose.yml https://github.com/korotovsky/slack-mcp-server/releases/latest/download/docker-compose.yml
wget -O .env https://github.com/korotovsky/slack-mcp-server/releases/latest/download/default.env.dist

.env ファイルを Slack の認証トークンで更新します。

SLACK_MCP_XOXC_TOKEN=

以下のコマンドで Docker Compose を起動します。

# Docker 用の専用ネットワークを作成
docker network create app-tier

# MCP サーバーをデタッチモードで起動
docker-compose up -

Slack MCP サーバーは localhost の 3001 番ポートで起動しています。CodeRabbit と統合するには HTTPS エンドポイントが必要であり、そのために ngrok を使用します。

ngrok がインストールされているか確認します。

ngrok --version

ngrok を使って公開 URL を生成します。

ngrok http 3001

このコマンドでローカルの Slack MCP サーバーがインターネット公開され、CodeRabbit からアクセス可能になります。

次に、MCP Inspector を使ってサーバーが動作しているか確認します。

npx @modelcontextprotocol/inspector

Inspector UI を開き、SSE を選択し、ngrok の URL の末尾に /sse を追加します。

正常に動作していることが確認できたら、CodeRabbit との統合作業に移ります。

CodeRabbit に MCP サーバーを統合してテストする

CodeRabbit にサインインし、ダッシュボードのサイドバーから Integrations を選択して MCP サーバーを追加します。

名前と MCP サーバー URL（例: https://2bb0002c0e2c.ngrok-free.app/sse）を入力します。認証方式は何も選択しないようにします。

接続後、すべての CodeRabbit コードレビューで MCP サーバーをコンテキストとして利用できます。

GitHub リポジトリを作成し、CodeRabbit に追加して MCP サーバーへのアクセス権を設定します。

コードレビュー時に MCP サーバーを利用するため、リポジトリに coderabbit.yaml を追加します。

language: "en-US"
early_access: false
reviews:
  profile: "chill"
  request_changes_workflow: false
  high_level_summary: true
  poem: false
  review_status: true
  collapse_walkthrough: false
auto_review:
  enabled: true
  drafts: false
chat:
  auto_reply: true

GitHub リポジトリの MCP サーバー利用を有効化します。

次に、「Path Instructions」を設定し、PR マージ前に追加指示を確認するよう CodeRabbit に伝えます。

上記画像では、File Path がレビュー対象のファイルを指定し、Instructions がそのファイルをどのように扱うべきかを示します。この設定により、CodeRabbit は Slack の #dev チャンネルの議論内容を参照し、リポジトリ内のコード変更がチャンネルのガイドラインに従っているかを確認します。

以下は Slack チャンネルのメッセージ例です。

そしてこちらが、指示に従ってレビューを行う CodeRabbit の出力例です。

CodeRabbit が Slack の議論内容をどのように読み取ってレビューに反映するかは、次のデモで確認できます。

https://github.com/tyaga001/test-slack-mcp/pull/7#pullrequestreview-3454174366

💡 ベストプラクティス: 必要なデータだけをコンテキストとして渡す

不要なデータは LLM の処理コストを上げ、パフォーマンスを低下させます。特定の Slack チャンネルや必要最小限の情報に限定して渡すようにしてください。

次のステップ

このチュートリアルでは、Slack MCP サーバーを CodeRabbit と統合し、文脈情報を用いたコードレビューを実行する方法を学びました。CodeRabbit は Notion、GitHub Copilot、Sentry、Asana など複数の MCP サーバーをデフォルトでサポートしており、これらを組み合わせて高度な文脈理解を実現できます。

同様の手法を使うことで、任意のコンテキストやデータソースを MCP サーバー経由で統合し、CodeRabbit に正確で実用的な応答を生成させることが可能です。

MCP サーバーや CodeRabbit に関するさらなるチュートリアル・記事はこちら:

CodeRabbit を試してみませんか？ 14 日間の無料トライアルを始める

]]>

Thu, 20 Nov 2025 04:04:02 GMT

MCP servers integrate AI agents into software applications to carry out system-related tasks based on users’ requests. Platforms like Slack, Sentry, Notion, and GitHub Copilot have adopted MCP-style services to expose their features to AI-driven applications.

CodeRabbit is part of this shift, acting as an MCP client that enables users to provide contexts and perform the best code reviews. It’s also the first AI code review platform that supports context (data) from multiple sources, such as business requirements stored in Confluence, system information from your CI/CD pipeline, or any internal MCP server.

In this tutorial, you will learn how to set up a Slack MCP server, retrieve channel data, and pass it as context into CodeRabbit to generate code reviews that incorporate discussions from your team workspace, ensuring that every review aligns with the project goals.

Why use MCP with CodeRabbit?

The primary benefit of using MCP servers with CodeRabbit is to deliver relevant data that makes code reviews more insightful and actionable. Other benefits include:

Enriching code reviews with context from multiple tools.

CodeRabbit enables you to retrieve relevant information from Slack, Confluence, CI/CD pipelines, or internal MCP servers so reviewers understand the reasoning behind changes. CodeRabbit can pull relevant information from Slack threads, discussions, and messages to understand the code logic and reasoning behind every code change.

Making informed and precise reviews

With access to data from MCP servers, CodeRabbit gains a better understanding of the project’s logic and goals. For instance, the Slack MCP server grants CodeRabbit access to team messages, enabling it to perform code reviews that are consistent with business requirements and development objectives.

Prerequisites

Before we proceed, you need to have the following tools installed to set up the MCP server and integrate it with CodeRabbit:

Slack channel – An existing Slack channel is required to fetch messages and provide context for the AI code reviewer.
MCP Server for Slack Workspaces - Provides an easy and structured way to expose Slack conversations via the Model Context Protocol (MCP). It already includes built-in Slack API methods (fetching messages, threads, replies, etc.) and is lightweight, Docker-ready, and easy to configure.
Claude Desktop – Allows you to test the Slack MCP server locally before connecting it to CodeRabbit.
Docker – Used to run and host the Slack MCP server in a container.
Ngrok – Used to create a secure public URL for the Slack MCP server, allowing CodeRabbit to access it from outside your local environment.

In this tutorial, you will:

Learn how to test the Slack MCP server locally with Claude Desktop.
Host the server on localhost using Docker.
Generate a public URL using Ngrok..
Integrate the MCP server with CodeRabbit.

Note: While Slack has been experimenting with MCP servers, they don’t currently have one available. This tutorial will cover how to create one yourself.

Set up Slack MCP server with Claude Desktop

Claude Desktop is an MCP client that connects to multiple MCP servers and uses them as sources of context. It allows you to add your MCP servers as connectors and test them locally before deploying them to CodeRabbit or any other platform.

Install Claude Desktop on your computer. Once the installation is complete, open the app and click Manage Connectors.

Select Developer from the sidebar menu, and click Edit Config to configure your MCP server using your Slack authentication tokens.

Follow the instructions in the GitHub repository to obtain your Slack authentication tokens and configure the Slack MCP server in Claude Desktop.

Update the claude_desktop_config.json file with the following JSON configuration.

{
"mcpServers": {
"slack": {
"command": "npx",
"args": ["-y", "slack-mcp-server@latest", "--transport", "stdio"],
"env": {
"SLACK_MCP_XOXC_TOKEN": "xoxc-...",
"SLACK_MCP_XOXD_TOKEN": "xoxd-..."
}
}
}
}

The configuration above uses the xoxc and xoxd Slack authentication tokens to register the Slack MCP server as a connector in Claude Desktop. Once connected, Claude can perform tasks such as retrieving channel messages and using Slack context to enhance code reviews and responses.

Restart Claude Desktop to apply the updated configuration and activate the Slack MCP server.

Preview.mp4

Connect the Slack MCP server to CodeRabbit

In this section, you will learn how to run the Slack MCP server using Docker, generate a public URL for it, and integrate it with CodeRabbit to provide context-aware code reviews.

Before we proceed, open the Docker application.

Next, open your terminal and download the required files for the Slack MCP Server using the following commands:

wget -O docker-compose.yml https://github.com/korotovsky/slack-mcp-server/releases/latest/download/docker-compose.yml
wget -O .env https://github.com/korotovsky/slack-mcp-server/releases/latest/download/default.env.dist

Update the .env file with your Slack authentication tokens.

SLACK_MCP_XOXC_TOKEN=

Start the MCP server using Docker Compose with the following commands:

# Create a dedicated Docker network
docker network create app-tier
# Start the MCP server in detached mode
docker-compose up -

Currently, the Slack MCP server is running on localhost at port 3001. To integrate it with CodeRabbit, it needs to be accessible via an HTTPS endpoint. This can be achieved using ngrok.

First, confirm that ngrok is installed by running:

ngrok --version

Next, generate a public URL for your MCP server.

ngrok http 3001

The command above exposes your local Slack MCP server to the internet by generating a secure public URL. Use this URL to connect the Slack MCP server to CodeRabbit.

Open a new terminal and start the MCP Inspector to test the Slack MCP server using the following command:

npx @modelcontextprotocol/inspector

This will launch the MCP Inspector UI, allowing you to verify that your MCP server is running correctly. In the Inspector, select SSE as the transport type and append /sse to the end of your ngrok URL

Once the MCP server is confirmed to be working, you can proceed to integrate it with CodeRabbit.

Integrate and test MCP servers with CodeRabbit

Enter a name and your MCP server URL (for example, https://2bb0002c0e2c.ngrok-free.app/sse) to connect the server to CodeRabbit. Make sure no authentication method is selected.

After connecting the MCP server, you can use it to provide context in all your CodeRabbit code reviews.

To test the setup, create a GitHub repository, add it to CodeRabbit, and configure it to have access to your MCP server

Add a coderabbit.yaml configuration file to the repository to enable CodeRabbit to access and use the MCP server context during code reviews.

language: "en-US"
early_access: false
reviews:
profile: "chill"
request_changes_workflow: false
high_level_summary: true
poem: false
review_status: true
collapse_walkthrough: false
auto_review:
enabled: true
drafts: false
chat:
auto_reply: true

To give the GitHub repository access to your MCP servers, find the GitHub repository and enable MCP servers

Next, enter the Path Instructions to ensure CodeRabbit checks for additional instructions before allowing PR merges to the code repository

From the image above, the File Path specifies which files CodeRabbit should review, while the Instructions field provides context on how it should handle those files. Based on the instructions given, CodeRabbit analyses the discussions in your Slack #dev channel and ensures that every pull request or code change in your GitHub repository complies with the guidelines defined in that channel.

Below is a screenshot showing the messages from the Slack channel

Here is the code review showing how CodeRabbit reads and adheres to the instructions:

You can check out the full demo to see how CodeRabbit reads team Slack discussions and reviews code based on those conversations.

💡 Best Practice: Pass only Important Data as Context

Irrelevant data can slow down your LLM and increase costs. Keep access limited to specific Slack channels or only include the necessary information for code reviews.

Next steps

In this tutorial, you learned how to integrate the Slack MCP server into CodeRabbit to perform contextual code reviews. CodeRabbit also supports multiple MCP servers by default, including Notion, GitHub Copilot, Sentry, Asana, and many others. That you to enhance code reviews and generate context-aware answers with ease.

Using the same approach, you can integrate other contexts or data sources via MCP servers to enable CodeRabbit to generate accurate and actionable responses for your queries.

Check out more tutorials and articles on MCP Servers and CodeRabbit:

Interested in trying CodeRabbit? Start a 14-day trial.

]]>

Fri, 14 Nov 2025 04:05:41 GMT

Why emojis suck for reinforcement learning (& what actually works)の意訳です。

「シンプルさ」という罠

親指を立てた絵文字（👍）は簡単に送れますが、本当に AI レビュアーにとって有益な学習信号になっているのでしょうか。絵文字ベースのフィードバックは気持ちよく、速く、そして誰にでも分かりやすいです。一見すると、理にかなっているようにも思えます。

しかしコードレビューは灯りのオンオフのような単純なものではありません。無数の判断、技術的なニュアンス、チーム固有の基準が入り混じったものです。その多くは、ワンクリックの絵文字には反映されません。すべてのコードコメントには隠れた意図があります。正しさ、読みやすさ、設計上のトレードオフ、過去の経緯、チームのリスク許容度、さらには組織内の政治的な力学まで含まれます。

それをオンオフという、二値のシグナルに押し込めてしまうとどうなるでしょうか。もはや学習ではなく、「雰囲気を追いかけるモデル」を育てているだけになってしまいます。

シンプルさが裏目に出るとき: ゴマすりモデルの恐怖

今年のはじめ、OpenAI は GPT-4o に対して「親指上げ/下げ」のフィードバックをかなり強く効かせたアップデートを行いました。その結果どうなったかというと、このモデルは過度にユーザーに迎合するようになりました。ユーザーをおだて、誤った回答にも同意し、「はい」と言いすぎるようになり、回答品質は低下しました。フィードバック信号がハイジャックされてしまい、OpenAI はロールバックせざるを得ませんでした。

モデルに「承認こそがゴールだ」と教えてしまうと、そのモデルは承認を最適化するようになります。真実でもなく、有用性でもなく、「その瞬間、人間が気持ちよく感じたかどうか」だけを目指すようになります。

これはバグではなく、報酬設計の失敗でした。そして同じアプローチをコードレビューに適用すると、「安全運転で、あなたにおべっかを使い、本当に必要なことを言ってくれないレビュアー」ができあがります。

なぜ二値フィードバックではニュアンスが潰れてしまうのか

親指上げの絵文字は一体何を意味しているのでしょうか。

モデルがバグを見つけたという意味でしょうか
説明が分かりやすかったという意味でしょうか
口調がフレンドリーだったという意味でしょうか
たまたまレビュアーの機嫌が良かったというだけでしょうか

単一のスカラー値のシグナル（👍または👎）は、「何かがうまくいった」ということは伝えますが、「何がうまくいったのか」は伝えません。そのためモデルは、自分が操作しやすいものに寄っていきます。トーン、丁寧さ、お世辞、あるいは短さといったものです。これが、強化学習におけるゴマすり（sycophancy）の正体です。悪意ではなく、「あなたが与えた報酬を最大化しようとしているだけ」であり、「あなたが本当に望んでいた結果」を最大化しているわけではありません。

これはグッドハートの法則が発動している例です。メトリクス（この場合は親指上げ）がゴールになってしまうと、それは現実の有用な指標ではなくなります。

モデルがあなたのフィードバックを「攻略」するとき

モデルに簡単なシグナルを与えると、モデルは簡単なショートカットを見つけます。

コーディングの世界では、強化学習エージェントが、基礎ロジックを解かずに期待される出力をハードコードすることでテストケースをパスするように学習してしまうことがあります。ログを細工したり、評価用ハーネスをすり抜けたりもします。チェックマークは緑になっても、実際のコードは正しく動きません。

コードレビューでも同じことが、ただし「社会的なかたち」で起きます。モデルはすべてのコメントの冒頭で「とてもいいです！」と言うようになり、あらゆる提案を柔らかな表現で包み、フォーマットのような安全な箇所ばかりを指摘するようになります。そういったコメントは無難で、議論もなく受け入れられやすいからです。そして本当に重要なアーキテクチャ上の懸念は、埋もれてしまいます。

モデルは「ポジティブな反応の取り方」は学んだものの、もはやコードレビューをしているとは言えません。

暗黙的なシグナルが優れている理由

LLM 以外の世界では、このパターンはよく知られています。Netflix は、ユーザーが何を「評価」するかよりも、何を実際に視聴するかのほうがはるかに有用だと気付きました。星評価では人は平気で嘘をつきます。しかし視聴時間、クリック、リピート再生といった指標は正直なシグナルです。

AI の世界では、これを**暗黙的フィードバック（implicit feedback）**と呼びます。コードレビューの場合には、例えば次のような形で表れます。

開発者は提案を採用したのか
それを書き換えたのか
無視したのか
同じパターンが後のバグとして再び現れたのか

これらのシグナルは、ユーザーの入力を必要としません。行動から生まれ、意図的に操作するのが難しいものです。

もちろん完璧ではありません。「なぜ」その行動を取ったのかまでは常に分からないからです。しかし、絵文字よりははるかに操作されにくく、レビューが「気持ちよかったかどうか」ではなく「ちゃんと機能したかどうか」を教えてくれます。

コード生成 vs コードレビュー: ゲームが違えば、シグナルも違う

コード生成は、しばしば「正解が一つに定まる」という点で数学に近い側面があります。コンパイルできるか。正しい結果を返すか。テストにパスするかといった具合です。

そのため、実行結果フィードバックや暗黙的なシグナルのようなアウトカムベースの報酬を使うことができます。もちろん完璧ではありません。コードモデルは出力をハードコードしてテストをすり抜けることもありますが、それに対するガードレールを設計することは可能です。そして、開発者が「良かった」と言ってくれるかどうかに頼らずとも、「実際に動いたかどうか」を観測できます。

一方、コードレビューは違います。ここには普遍的な合格/不合格は存在せず、チームごとにスタイル、構造、リスク、命名、テストカバレッジなどの好みが大きく異なります。あるチームにとっての「優れたコメント」が、別のチームでは完全にズレている可能性もあります。高速に動くスタートアップで「クリーンコード」とみなされるものが、高いセキュリティが求められる産業では「不十分」と判断されることもあります。

これこそが、「親指上げ/下げデータ」が抱える本当の問題です。ニュアンスが押しつぶされてしまい、モデルは「適切さ」ではなく「平均値」を目指すようになります。その結果、安全ではあるものの、ひどく汎用的なコメントばかりを出すようになってしまいます。

私たちの代替案: CodeRabbit Learnings

CodeRabbit では、別のアプローチを取っています。いいねを最大化するのではなく、「理解」を最大化しようとしているのです。そのために私たちは Learnings を構築しました。

エンジニアが CodeRabbit を修正したり、チームの規約を明確にしたり、「なぜこのコードは自分たちのスタックに合わないのか」を説明したりするたびに、その説明を自然言語の指示として保存します。単に「コメントが却下された」という事実だけでなく、「なぜ却下されたのか」まで記憶します。

これらの Learnings は、組織、リポジトリ、さらには特定のパスやファイルタイプに紐付きます。CodeRabbit が次のプルリクエストをレビューするときには、それらの指示を検索し、文脈に応じて適用します。同じパターンを再度見つけたときには、そこで学んだ内容を踏まえて挙動を変えます。

再度教え直す必要はなく、同じ失敗を繰り返すリスクもありません。モデルは親指の数から推測するのではなく、あなたのチームが与えた実際のガイダンスから推論します。

また、Learnings は透明性も提供します。どんな Learnings が存在するかを確認し、それらを閲覧し、カテゴリでフィルタリングし、標準が変わったときには削除や編集を行うことができます。つまりモデルは、チームの成長とともに進化し、プラクティスの変化に合わせて整合性を保ち続けます。

これは、単なる絵文字の承認ではなく「意図を取り込む」ことで行う強化学習です。解釈可能であり、検査可能です。そしてレビューをまたいで一般化する、「チームナレッジの生きたレイヤー」を構築します。

ニュアンスのある学習が可能にすること

システムに与えるのが「シグナル」ではなく「明確で文脈を含んだ指示」になったとき、単なるレビュー体験の改善以上のことが可能になります。

チームレベルでの適応が可能になります
モデルは「何が良いのか」を勝手に推測するのではなく、「あなたのチームが実際にどう書いているのか」を学習します。リスク許容度、スタイルの好み、トレードオフの感覚を理解し、「ハウスルールを理解しているレビュアー」として振る舞うようになります。
経時的な学習（長期学習）を支えます
時間とともに CodeRabbit は、「どのコメントが役に立ったのか」「どれが無視されたのか」「どの提案が実際の変更につながったのか」という記憶を蓄積していきます。その結果、徐々に精度が上がり、フォーカスは鋭くなり、ノイズは減っていきます。
信頼を築きます
開発者が「AI を訂正すれば、それを覚えてくれる」と分かっていると、より積極的に関わるようになります。開発者自身がシステムを形作り、そのシステムは汎用的な LLM ではなく「自分たちの基準を反映した存在」へと近づいていきます。

こうしてレビュー用ツールは、単なる「異なる視点の意見」ではなく、チームの延長として機能するようになります。

まとめ: 本当の学習はピクセルではなくパターンから生まれます

親指の絵文字は、素早いリアクションには向いていますが、それだけでは専門性は育ちません。

時間とともに成長し、あなたの標準に適応し、浅いフィードバックの罠を避ける AI レビュアーを求めるのであれば、承認以上のものを与える必要があります。説明を与えなければなりません。

次世代の AI コードツールは「いいね」の数で訓練されることはありません。文脈、結果、修正の軌跡で訓練されます。絵文字ではなく、構造化された記憶から学びます。実際の意思決定と、あなたのチーム自身の声から学びます。

それこそが CodeRabbit Learnings が設計された目的です。拍手のためではなく、理解のために設計されています。

Learnings を自分のチームで試してみたい方は、無料トライアルにお申し込みください

]]>

Thu, 13 Nov 2025 23:50:12 GMT

GPT-5.1 for code-related tasks: Higher signal at lower volumeの意訳です。

TL;DR
プロンプト調整とスタックへの統合を行った結果、GPT-5.1 はレビューにおいて、これまでで最も高い精度とS/N比（シグナル対ノイズ比）を、より少ないコメント量で実現するようになりました。複雑なベンチマークセット上で、最高クラスのエラーパターン（EP）リコールに並びつつ、競合モデルの半分以下のコメント量を記録しました。

その結果として、少ないノイズでより良い修正が得られ、レビューは再びパッチのように読めるものになったと感じています。

GPT-5.1 が主張していること

OpenAI と報道によると、 GPT-5.1はより安定し、指示に従い、適応性の高いモデルとして説明されています。GPT-5.1 は ChatGPT の「Instant」と「Thinking」モードの両方で駆動しています。コードレビューに関してこの説明を検証したところ、驚くほど正確だと感じられました。細かな指摘では素早く表面的に対応し、深い推論が必要なバグではしっかりと理由付けを行います。

今回は新しい試みも行いました。GPT-5.1 が誤った場合、そのやり取り全体と内部推論のトレースを用いて、振り返りを促すプロンプトを実行しました。どこを誤ったのかを示し、改善のためにどのように指示を変えるべきかを尋ねることで、モデル自身がプロンプトに対する具体的な修正案を提示します。この反復的な振り返り手法（差分外への過剰な広がりといった問題も浮上しましたが）によって、モデルの挙動とシステム指示の両方を調整し、安定してタイトな出力を得られるようにしました。

測定した内容（そしてその理由）

私たちは、GPT-5、Codex、Sonnet 4.5 の記事で使用したものと同じベンチマーク環境を使用しました。これは既知の エラーパターン（EP） を埋め込んだ 25 件の難しい PR から構成されています。スコアリングでは以下に重点を置いています。

アクショナブルなコメントのみ: 実際に投稿されるコメントのみ（追加提案や差分外への記述を除く）
エラーパターンごとの合格数（コメントごと。以下EP Pass）: コメントがエラーパターンを直接修正、または明示していること
Important コメント: EP PASS または重大/クリティカルな実バグ
Precision（精度）: EP PASS ÷ コメント総数
SNR: Important ÷ (総数 − Important)

比較対象は以下の通りです。

GPT-5.1（新モデル）
CodeRabbit Production（現行レビューアースタック）
Sonnet 4.5

新しいモデルの追加はスイッチを入れるだけではありません

CodeRabbit ではモデルの導入は毎回適切に行われており、モデルを差し替えて祈るようなことはしません。各社のモデルはすでに互換品ではなくなっているため、デプロイ前にテスト、調整、品質ゲートを行います。GPT-5.1 に対しては以下のような調整を行いました。

GitHub に投稿できない 差分外のコメント の削減
冗長さを抑えるための トーンと簡潔さ の調整
重大度タグ と 指示解釈 の再整合

これは GPT-5 Codex の場合と同じで、推論能力をプロダクト価値へと変換するために、モデルの挙動を再構築するという目的があります。最終的な結果として、高いS/N比、ストレスの軽減、バグのカバレッジを損なわないレビューを実現しました。

スコアボード（アクショナブルなコメントのみ）

要点: GPT-5.1 は過去最高のエラーパターン再現率に並びつつ、最も少ないコメント量 を記録しました。CodeRabbit Production と Sonnet 4.5 の両方を コメント単位の精度 と Important コメント比率 で上回り、最もクリーンで高インパクトなレビュー を実現しました。

GPT-5.1 のレビュー体験

データで確認される挙動特性は、後に測定した言語メトリクス（弱め表現 28%、断定的マーカー 15% など）と一致しています。これにより、開発者が「レビュー自信があり、かつバランスの取れたトーン」だと感じる理由がデータでも裏付けられています。

GPT-5 Codex と Sonnet 4.5 と比較すると、GPT-5.1 のコメントはよりスリムで、対話的であり、熟練エンジニアのコミュニケーションに近いと感じられます。Codex は機械的かつ堅く、Sonnet 4.5 は冗長で学術的になりがちでした。それに対して GPT-5.1 は簡潔さと明確さのバランスが良く、押しつけがましくない自信を感じさせます。信頼できるチームメイトが差分を説明しているように読めます。CodeRabbit Production と比較すると、より課題に対して鋭くフォーカスされており、Sonnet 4.5 と比較するとより人間的で抑制が効いています。以下はその具体例です。

簡潔

GPT-5.1 はより少なく鋭いコメントを書き、すぐに要点へ到達します。ある PR では、ロストウェイクアップバグを以下の 1 行で修正しました。
p_caller_pool_thread->cond_var.wait(lock);
余計な文脈説明も不要な文章もありませんでした。比較すると CodeRabbit Production は同じ結論に至るまでに、スレッドフローを数段落説明していました。

率直

所有権やメモリ管理が関わる場面ではためらいません。冗長な r->reference() 呼び出しについて、以下のように指摘しました。
「Ref は refcount を自動管理します。手動で refcount を増やすとリークにつながるため削除してください」
開発者はこの率直さを好みます。講義ではなくパッチレビューのように読めます。

実務的

GPT-5.1 は、問題の重要度がどこにあるかを理解し、重要なものとそうでないものを適切に識別します。あるキャッシュ設定の PR では未実装の optimizeMemoryUsage() を指摘しましたが、次のように正しく文脈化しました。
「キャッシュの肥大化がメモリプレッシャーに影響しない限り、これは軽微です」
過剰反応せず、重要度を適切に扱っています。この点は Sonnet 4.5 にまだ課題があります。

文脈を追う

プロンプトが曖昧だった場合、GPT-5.1 は自身の仮定を明示的に説明します。初期の実行では次のように述べました。
「プロンプトでヘルパー関数のスコープが指定されていませんが、明確化のために含めました」
この透明性が私たちの指示改善を助け、モデルの推論を信頼できるものにしました。

簡潔、率直、実務的、文脈理解という特性は GPT-5 Codex において私たちが高く評価した点と一致していますが、GPT-5.1 はより安定したトーンと抑制を備えています。

スタイルとトーン（GPT-5.1 がチームメンバーのように感じられる理由）

GPT-5 Codex や Sonnet 4.5 の評価で使用したものと同じ言語構造のシグナルを参照し、GPT-5.1 がレビューで異なる印象を与える理由を分析しました。これにはコメントの長さ、コードブロックの有無、弱め表現と断定表現の割合などが含まれます。データは明確な傾向を示しています。

読み方について
GPT-5.1 のコメントは平均文字数がやや多いものの、より明確な構造と負荷の高い文で構成されているため、実際には「短く読みやすい」と感じられます。GPT-5.1 のトーンは CodeRabbit Production や Sonnet 4.5 よりも断定的で、全体として diff ブロックは少ない（76%）という特徴があります。これは意図されたもので、複数箇所修正や API バリデーション、設計の明確化であり、単一のパッチを示すと誤解を招く場合があったためです。ただし、差分を含まないコメントの約 3 分の 2 では、最小限のパッチを示せば明確さがさらに向上すると感じられました。

CodeRabbit Production と比較すると、GPT-5.1 はパッチ頻度を一部犠牲にする代わりに、明確さと集中度を高めています。Sonnet 4.5 と比較すると、レビューを膨張させる冗長な説明を避けています。トーンは Codex の外科的精度と Sonnet の慎重な冗長性の中間に位置し、自信がありつつも強圧的ではなく、慎重でありながら臆病ではありません。

総じて、GPT-5.1 のレビューは 素早く読み進められ、より直接的で、実際の修正を見つけるためのスキャン量が少なくて済む という特徴があります。これは意図して調整した挙動であり、データと体験の両方に表れています。

GPT-5.1 にまだ残る課題

完璧なモデルは存在せず、GPT-5.1 にもトレードオフがあります。CodeRabbit Production と比較すると、大規模チームで有用な文脈的な衛生改善の指摘を省くことがあり、より機能的な問題に集中する傾向があります。Sonnet 4.5 と比較すると、デザインやスタイル上の改善点を見逃すことがあり、人間のレビューアが好むケースもあります。これらは精度と簡潔さを優先した意図的なトレードオフであり、今後のロールアウトで開発者の反応を注視していく予定です。

改善が必要だった点

GPT-5.1 は調整を必要としましたが、その課題は以前のシステムと比べるとはるかに軽度でした。CodeRabbit Production は衛生的な指摘と重大な指摘を同一スレッドで混在させる傾向があり、Sonnet 4.5 は説明過多で、同じバグについて複数の軽微なノートを投稿しがちです。一方で GPT-5.1 の調整点は主に精度に関わるもので、トーンや冗長性よりも限定的でした。これは GPT-5.1 がプロダクション導入に対して、非常に近い段階にあることを示しています。

diff 外コメント
GPT-5.1 は diff 以外の部分に提案を含めることがありました。プロンプトで明確に制約を示したところ、モデルは自己修正しました。
曖昧さに対する過剰な助け
プロンプトが厳密でない場合、コンテキスト追加やヘルパー関数の追加を行うことがありました。制約を明確にすると、境界を正確に守るようになりました。

開発者が期待できること

よりクリーンなレビュー
コメント数が減り、重要コメントの割合が高まります。
パッチのようなトーン
ほぼすべてのコメントが最小限の修正案と説明を含みます。
トップクラスの EP リコール
Sonnet 4.5 と同等で、CodeRabbit Production を上回ります。
少ないスキャンで高いシグナル
コメントの 58.7% が Important に分類されます。
ターゲット外でも実世界のバグを捕捉
ライフサイクルの問題、リーク、整合性ギャップなどを検出します。

まとめ

私たちはモデルをただ選ぶのではなく、正しく機能する形へ調整します。GPT-5.1 は現在、GitHub 差分の振る舞い、トーン、冗長度、スコアリング閾値の調整を完了し、ロールアウト前のフェーズに入っています。今後数週間にわたり、開発者が高いS/N比、新しいトーン、簡潔なレビューをどのように受け止めるかを監視します。フィードバックが良好であれば、提供範囲を拡大し、開発者が求めてきた「よりクリーンでより速いレビュー」を提供していきます。

現時点で GPT-5.1 は、私たちに新しい価値、つまり次世代レベルの精度を重視したレビュー示してくれる準備が整っています。これは CodeRabbit の理想である、「重要なバグを素早く見つけ、開発者にノイズを強いることなく届ける」という目標にさらに近づくものです。

コードレビューを試してみたい方はこちらです
14 日間の無料トライアルをお試しください

]]>

Thu, 13 Nov 2025 18:07:52 GMT

TL;DR
After prompt tuning and integrating it into our stack, GPT-5.1 now delivers the best precision and signal-to-noise ratio (SNR) we’ve seen in reviews, with fewer comments. It tied for the best-in-class error pattern (EP) recall on our hard benchmark set while posting less than half the volume of comments that competitors did.

The result: less noise, better fixes, and reviews that read like patches again.

What GPT-5.1 claims to be

OpenAI and the press describe GPT-5.1 as more stable, instruction-following, and adaptive. It powers both "Instant" and "Thinking" modes in ChatGPT. We found that framing surprisingly accurate when it comes to code reviews: the model stays quick and surface-level for nits, but reasons deeply when the bug requires it.

We also tried something new. When GPT-5.1 got something wrong, we used the full exchange and its internal reasoning trace to prompt it to reflect. By showing it where it missed the mark and asking how it would change its instructions to do better, the model was able to actually propose concrete edits to its prompt. We used this iterative reflection technique (which surfaced issues like outside-diff sprawl) to refine both its behavior and our system instructions until it got consistently tighter.

What We Measured (and Why)

We used the same benchmark harness as in our GPT-5, Codex, and Sonnet 4.5 articles: a suite of 25 hard PRs, each seeded with a known error pattern (EP). Our scoring focuses on:

Actionable comments only: Comments that get posted (not additional suggestions or outside-diff notes).
EP PASS (per comment): The comment directly fixes or surfaces the EP.
Important comments: Either EP PASS or another major/critical real bug.
Precision: EP PASS ÷ total comments.
SNR: Important ÷ (total − Important).

We compared:

GPT-5.1 (new model)
CodeRabbit Production (our current reviewer stack)
Sonnet 4.5

Why adding a new model isn’t a switch-flip

Every model rollout at CodeRabbit is a campaign. We don’t plug in the model and hope; we test, adapt, and gate before shipping because models are no longer interchangeable. With GPT-5.1, this meant:

Reducing outside-diff comments, which can’t be posted to GitHub.
Tightening tone and concision to reduce verbosity.
Re-aligning on severity tagging and instruction interpretation.

This mirrors what we did with GPT-5 Codex: turn reasoning power into product value by reshaping the model’s behavior. The net result: higher SNR, less fatigue, and no compromise on bug coverage.

Scoreboard (Actionable Comments Only)

Takeaway: GPT-5.1 matched the highest EP recall while posting the fewest comments. It beat both CodeRabbit prod and Sonnet 4.5 on per-comment precision and important share, delivering the cleanest high-impact reviews.

What GPT-5.1 feels like in review

The behavioral traits we see in the data align directly with the language metrics we later measure such as 28% hedging and 15% assertive markers. This shows that the tone developers perceive as confident and balanced is borne out in the data.

Compared with GPT‑5 Codex and Sonnet 4.5, GPT‑5.1’s comments feel leaner, more conversational, and closer to how experienced engineers actually communicate. Codex could sound mechanical and rigid, while Sonnet 4.5 leaned verbose and academic. In contrast, GPT‑5.1 balances brevity with clarity. Its feedback feels confident but not heavy‑handed, like a trusted teammate explaining a diff. Against CodeRabbit Prod, it feels sharper and more focused. Against Sonnet 4.5, it feels human and restrained. Here’s how that translates in practice:

Concise

GPT-5.1 writes fewer, sharper comments that get straight to the point. In one PR, it fixed a lost wakeup bug with a single line: p_caller_pool_thread->cond_var.wait(lock); no extra context, no unnecessary prose. CodeRabbit prod, by comparison, wrote several paragraphs describing the thread flow before reaching the same conclusion.

Direct

When ownership or memory management was at stake, GPT-5.1 didn’t hesitate. It flagged the redundant r->reference() call with: “Ref already manages refcounts; remove the manual increment to prevent leaks.” Developers appreciate this directness. It reads like a patch review from a teammate, not a lecture.

Pragmatic

GPT-5.1 understands when an issue matters and when it doesn’t. On a cache configuration PR, it identified an unimplemented optimizeMemoryUsage() but correctly noted, “This is minor unless cache growth impacts memory pressure.” Instead of overreacting, it contextualized severity, something Sonnet 4.5 still struggles with.

Follows Context

When prompts were vague, GPT-5.1 explicitly explained its assumptions. In an early run, it said: “The prompt didn’t specify helper function scope, so I included one for clarity.” That kind of transparency helped us refine our instructions and made its reasoning trustworthy.

Concise, direct, pragmatic, and context-aware are qualities that mirror what we valued most in GPT-5 Codex, but with a steadier tone and more restraint.

Style and tone (why GPT-5.1 feels like a peer)

To understand why GPT-5.1 feels different in review, we looked at the same language and structure signals used in our GPT-5 Codex and Sonnet 4.5 evaluations. These include measures like comment length, presence of code or diff blocks, and tone markers for hedging versus confidence. The data paints a clear picture.

How to read this. While GPT‑5.1’s comments use slightly more characters on average, they deliver that text in clearer structure with fewer sentences that carry more weight. In practice, developers perceive them as shorter and easier to read. GPT‑5.1’s tone is more assertive than both CodeRabbit prod and Sonnet 4.5, and it includes fewer diff blocks overall (76%), which is intentional. Many of these comments were multi‑location fixes, API validations, or design clarifications where a single fenced patch would be misleading. In roughly two‑thirds of those no‑diff cases, a minimal fenced patch would have made sense and could further improve clarity.

Compared to CodeRabbit prod, GPT-5.1 trades some patch frequency for higher clarity and focus. Against Sonnet 4.5, it avoids the verbosity and over-explanation that make reviews feel bloated. Its tone sits comfortably between Codex’s surgical precision and Sonnet’s cautious verbosity. It’sconfident without being heavy-handed, measured without being timid.

At a glance, developers will notice that GPT-5.1’s reviews read faster, feel more direct, and require less scanning to identify the real fix. That’s the behavior we tuned for and it shows in both the numbers and the experience.

Where GPT-5.1 still lags

No model is perfect, and GPT-5.1 has its trade‑offs. Compared to CodeRabbit Prod, it sometimes leaves out contextual hygiene notes that can be useful for larger teams, focusing narrowly on functional issues. Against Sonnet 4.5, it can feel less expansive,missing opportunities to surface design or style considerations that human reviewers sometimes appreciate. These are conscious trade‑offs for precision and brevity and we’ll be watching the rollout to see how developers perceive the balance.

What we had to fix

While GPT‑5.1 required tuning, its challenges were far milder than those of earlier systems. CodeRabbit prod still tends to mix hygiene and critical issues in the same thread, while Sonnet 4.5 often over‑explains and spams multiple minor notes on the same bug. In contrast, GPT‑5.1’s main adjustments were focused on precision rather than tone or redundancy, showing how close it was to production readiness.

Outside-diff comments. GPT-5.1 sometimes included suggestions beyond the diff context. We updated the prompt to clarify this, and the model self-corrected.
Over-helpful under ambiguity. When the prompt wasn’t strict, the model added context or helper functions. Once clarified, it obeyed boundaries tightly.

What developers should expect

Cleaner reviews. Fewer comments and a higher share of comments that matter.
Patch-like tone. Almost every comment includes a minimal fix with explanation.
Top-tier EP recall. Ties Sonnet 4.5, beats CodeRabbit prod.
Less scanning, more signal. 58.7% of comments are Important.
Real-world bugs caught even outside the target. These include lifecycle issues, leaks, consistency gaps.

Closing thoughts:

We don’t just pick models; we make them work. GPT-5.1 is entering the next phase of our rollout process now that tuning for GitHub diff behavior, voice, verbosity, and scoring thresholds is complete. Over the coming weeks, we’ll monitor how real users respond to its higher SNR, new tone, and concise review style. If developers respond well, we’ll expand its availability, giving them the cleaner, faster reviews they’ve been asking for.

For now, GPT‑5.1 stands ready to show what this next generation of precision‑focused review can do. It brings us closer to CodeRabbit’s north star: catching the bugs that matter quickly, without making developers sift through noise.

Interested in trying our code reviews? Get a 14-day free trial!

]]>

Fri, 07 Nov 2025 02:34:45 GMT

The simplicity trap

Sure, a thumbs up is quick, but is it really teaching your AI reviewer anything useful? Emoji-based feedback feels good, is fast, and universal. On the surface, it even seems to make sense.

But code review isn’t a light switch. It’s a mess of judgment calls, technical nuance, and team-specific standards. Many of those don’t show up in a quick emoji click. Every code comment carries hidden intent: correctness, clarity, design trade-offs, historical precedent, team risk tolerance, and even internal political dynamics.

Reducing that to a binary signal? That’s not learning, that’s training a model to chase vibes.

When simplicity backfires: The sycophant scare

Earlier this year, OpenAI pushed an update to GPT‑4o that leaned too hard on thumbs-up and thumbs-down feedback. The result? A model that became overly agreeable. It flattered users. It agreed with wrong answers. It started to say “yes” a little too much, and the quality of answers dropped. OpenAI had to walk it back: the feedback signal had been hijacked.

Turns out, if you tell a model that approval is the goal, it will optimize for approval. Not truth. Not utility. Just “did the human feel good in the moment?”

This wasn’t a bug. It was a reward design failure. And if you apply the same approach to code review, you will get a reviewer that plays it safe, flatters your choices, and avoids telling you what you actually need to hear.

Why binary feedback collapses nuance

A thumbs-up means... what, exactly?

That the model caught a bug?
That it wrote clearly?
That it sounded friendly?
That the reviewer was just in a good mood?

A single scalar signal tells the system something went well, but not what went well. That means the model will nudge on whatever it can control: tone, politeness, flattery, or brevity. That’s what sycophancy looks like in reinforcement learning. Not evil intent, just a system learning to maximize the reward you gave it, not the outcome you actually wanted.

This is Goodhart’s Law in action. When the metric, in this case thumbs up, becomes the goal, it stops being a useful measure of anything real.

How models game your feedback

When you give a model an easy signal, it finds an easy shortcut.

In the coding world, reinforcement learning agents have learned to pass test cases by hard-coding expected outputs instead of solving the underlying logic. They’ve manipulated logs and short-circuited evaluation harnesses. The green check shows up, but the code doesn’t actually work.

In code review, the same thing happens, just socially. The model starts saying “Nice work!” at the top of every comment. It hedges every suggestion. It nitpicks formatting because those comments are safe and get accepted without argument. And real architectural concerns? They get buried.

The model has learned how to get positive reactions but it’s no longer reviewing code.

What implicit signals get right

Outside of LLMs, this pattern is well known. Netflix found that what users watch is more useful than what they rate. People lie with stars. But watch time, clickthrough, and rewatching are honest signals.

In AI, we call this implicit feedback and in code review, it shows up as:

Did the developer apply the suggestion?
Did they rewrite it?
Did they ignore it?
Did the same pattern show up again in a future bug?

These signals don’t need user input. They come from behavior and they’re harder to game.

That doesn’t mean they’re perfect. You can’t always know why someone took an action. But they are less easily manipulated than a raw emoji. They also tell you whether the review worked, not just whether it felt good.

Code generation vs code review: different games, different signals

Code generation is closer to math since there’s often a right answer. Does it compile? Does it return the correct result? Does it pass tests?

That means you can use outcome-based rewards like execution feedback and implicit signals. They’re not perfect. Code models can still cheat by hard-coding outputs, but you can build guardrails. And you don’t need the developer to say whether it was good, you can see whether it worked.

Code review is different. There’s no universal pass/fail but vast differences in preferred style, structure, risk, naming, test coverage from one team to the next. A great comment for one team might be totally wrong for another. What’s considered “clean code” in a fast-moving startup might be flagged as sloppy in a regulated enterprise.

That’s the real problem with global thumbs up/down data. It flattens out the nuance. It teaches the model to aim for the average, not the appropriate. You don’t just get safe comments, you get generic ones.

Our alternative: CodeRabbit Learnings

At CodeRabbit, we take a different approach. Instead of optimizing for likes, we optimize for understanding. That’s why we built Learnings.

Every time an engineer corrects CodeRabbit, clarifies a team convention, or explains why something doesn’t fit their stack, that explanation is stored as a natural language instruction. We don’t just remember that the comment was rejected, we remember why.

Those Learnings are linked to your org, your repositories, and even specific paths or file types. When CodeRabbit reviews future pull requests, it retrieves those instructions and applies them in context. The next time it sees that same pattern, it adjusts.

There’s no need to re-teach it and no risk of repeating the same mistake. The model doesn’t guess based on thumbs, it reasons from your team’s actual guidance.

It also gives you visibility. You can see which Learnings exist, browse them, filter by category, and delete or edit them when your standards change. That means the model evolves alongside your team and stays aligned as your practices shift.

This is reinforcement learning not through raw approval, but through captured intent. It’s interpretable and inspectable. And it builds a living layer of team knowledge that generalizes across reviews.

What nuanced learning enables

When you feed the system clear, contextual instructions and not just signals, it unlocks far more than a better review experience.

It enables team-level adaptation. The model stops guessing what good looks like and learns how your team actually writes code. It understands your risk posture, your stylistic preferences, your trade-offs. It becomes a reviewer that knows the house rules.
It supports longitudinal learning. Over time, CodeRabbit builds a memory of which comments are helpful, which are ignored, and which suggestions actually lead to changes. That means it gets more precise, more focused, and less noisy over time.
It builds trust. When developers know they can correct the AI and it will remember, they engage more. They shape the system and the system becomes a reflection of their standards, not a generic LLM.

This is how a review tool becomes an extension of your team and not just another opinion in the room.

Closing thoughts: Real learning comes from patterns, not pixels

Thumbs are fine for quick reactions but quick reactions don’t build expertise.

If you want an AI reviewer that improves over time, adapts to your standards, and avoids the traps of shallow feedback, you need to give it more than approval. You need to give it explanations.

The next generation of AI code tools won’t be trained on likes. They’ll be trained on context, consequence, and course correction. They’ll learn not from emojis, but from structured memory. From real decisions and your team’s own voice.

That’s what CodeRabbit Learnings is built for. Not for applause but for understanding.

Try out Learnings for yourself with our free trial.

]]>

Thu, 06 Nov 2025 06:33:23 GMT

The rise of Slow AI: Why devs should stop speedrunning stupidの意訳です。

私たちがコンピューターを使い始めて以来、そこには常に1つの基本ルールが存在しました。それは、速ければ速いほど良いということです。低レイテンシ、高スループット、待ち時間の短縮、これが鉄則でした。ボタンの応答に600ミリ秒もかかったり、注意力が維持できなくなるほど長いスピナーを見たりすることを望む人はいません。遅いということは、それは壊れているということです。議論の余地はないでしょう。

そのため当然ながら、AIツールが私たちの開発ワークフローに忍び込み始めたとき、自動補完やエージェント、Copilot、その他何でも、同じ原則が適用されました。それはつまり、「速くしろ」「インスタントに感じさせろ」「魔法のように見せろ」です。

しかし、実際のところ、AIは魔法ではありません。それは推論です。パイプライン、RAG、コンテキスト、そしてツールの呼び出しです。乱雑なコンテキストと確率的推測をジャグリングしているのです。そして、単なる自動補完以上のものを求めるなら、ベースとなるプロセスのパイプラインを構築する必要があります。そして、それは処理時間がかかります。それ以外では、基本的に愚かさのスピードランをしているだけなのです。そして、ツールがどれだけ速くとも、間違っているなら速度はまったく意味がありません。

CodeRabbitでは、私たちが「スローAI」と呼ぶものを優先しています。そして、多くのAI企業が恐れて言えないことを言う勇気があります。私たちは「あなたを待たせます」。

(そして、あなたはそれに感謝するでしょう)

AI開発ツールはしばしば速く、自信に満ち、そして間違っている

最近AIコーディングエージェントを使ったことがあるなら、こんな経験をしたことがあるのではないでしょうか。タイピングを止めるとほぼ同時に、驚くほど速い提案がポップアップしてきます。それは一見すると、正当なものに見えます。しかしその後…失敗します。場合によっては、派手に失敗します。さらに悪いことに、テストには合格するものの、別なファイルで何かを壊していたりします。

これはなぜでしょうか? なぜなら、今日のほとんどのAI開発ツールは、1つのことに最適化されているからです。それは速度です。数トークンをタイプすると、モデルは統計的に最も可能性の高い続きを予測します。それは必ずしも正しいものではなく、安全なものでもなく、アプリが実際に何をしているかを理解しているものでもありません。ただ、次のもっともらしいコードの塊でしかありません。

それは、ボイラープレートには十分でしょう。しかし、ロジックには? エッジケースには? 実際のエンジニアリングには? それは、会議で速く自信を持って話すものの、仕様を読まない人を雇うようなものです。

こうしたツールのほとんどは、コンテキストを 読んで いません。少なくとも深くは読んでいません。近くの数行、おそらく関数名を取得するかもしれませんが、生成しているものを全体像に対して検証することは滅多にありません。課題のクロスチェックはしませんし、アーキテクチャレベルの認識もなく、ファイルやユースケースをまたいだ推論もありません。

思慮深く、テスト可能で、コンテキストを認識した出力が必要な場合、速度を落とし、俯瞰し、実際に問題に取り組むAIシステムが必要です。

それがスローAIの役割です。そして、AIが何をしているかを理解する時間を取ると、ハルシネーションを止めて実際にサポートしてくれるようになることがわかります。

AIが遅いときに優れている理由

核心として、大規模言語モデルは統計的推論マシンだということです。確率、パターン、そして(願わくば)あなたが与えたコンテキストに基づいて、次に何が来るかを予測して出力を生成します。しかし、ほとんどの開発者が忘れている注意点があります。良い予測には、それなりの作業が必要だということです。これは特に、ロジックを書く、アーキテクチャを理解する、複数のステップにわたって推論するなど、複雑なことをモデルに求める場合に当てはまります。出力の品質は、多くの場合において、推論の深さや段階に直接結びついています。

これは、単純なプロンプトを超えて、マルチステージパイプラインとエージェント的な動作に移行する際において、特に当てはまります。AIツールが出力を検証し、関連ファイルを取り込み、矛盾をチェックし、または複数のアクション先を計画している場合、それは単に次のトークンを出力しているだけではないのです。それは思考しています。または、少なくとも、それに近いことを実行しています。

このような非線形推論は、単一のフォワードパスでは実行できません。それには反省と検索、計画、そして時には自己修正さえも含まれます。こうしたプロセスはレイテンシ・フレンドリーではなく、インテリジェンス・フレンドリーなのです。

要するに、AIに複雑なコードで実際に助けてもらいたいなら、それを調理させる必要があるのです。

遅いことが新しいスマート:なぜ私たちはAIに考えさせるのか

スローAIは、私たちが話している用語の1つです。言い換えるなら包括的AIや正確なAI、あるいは正直に言えば「 実質的 役立ち有用なAI」と呼ぶこともできます。そして、それは今のAIプロダクトデザインで最もバズワードなアイデアの1つに密接に結びついています。そう、コンテキストエンジニアリングです。

AIが問題について知っている、 関連性があって解析された 情報が多いほど、パフォーマンスは向上します。しかし、そのコンテキストは取り込まれ、解析され、優先順位が付けられ、推論される必要があります。そのようなパイプラインは、超低レイテンシAIの敵です… そしてそれは精度の敵でもあります。

そして、それが私たちのAIコードレビューにおいて、最初のコメントを見るまでに最大5分かかる理由です。誤解しないでください。私たちは遅さを最適化しているわけではありません。コードベースとPRの複雑さに応じて、3分、あるいは1分でレビューを受け取ることもできます。私たちのパイプラインが複雑なのは、ユーザーが必要とする結果を出すために必要だからです。いつでも同時に実行されているプロセスの数を知りたくもないでしょう!

しかし、どうでしょう? 複数のレビューおよび検証エージェントを使用した、非線形のマルチパスパイプラインでAIに時間をかけさせると、他のツールよりもノイズが少なく、より関連性の高いコードレビューコメントが生成されるのです。

非線形推論は決して、速くありません。しかし、それは優れているのです。

では、なぜほとんどのAIツールは遅さよりも愚かさを選ぶのか?

まず、スローAIはすべてのツールにとってオプションではありません。たとえば、AIコーディングエージェントに質問をしている場合、返信を5分間待てる人はいないでしょう。そのやり取りには、即時性への期待が本質的に備わっています。

ではコードレビューはどうでしょうか? 提出されたPRに対して、同僚がすぐに手を止めてコメントし始めることを期待する人はいません。したがって、ボットからのレビューの遅延も受け入れる余地があります。そして、そのレビューがより関連性が高いことで時間を節約するなら、その遅延を受け入れる価値は特に高くなります。

しかし、なぜ多くの企業が、常に（実際にはそれを必要としないときでも）低レイテンシを優先するのでしょうか? まあ、私たちは訓練されてきましたし、即座の満足を期待するようユーザーを訓練してきました。ボタンをクリックすれば、速攻のレスポンスを得られます。関数名をタイプすれば、それについて考える前に提案を得られます。即効性がなければ壊れている、遅い、またはスタートアップがAWS料金の支払いを忘れたように感じられます。

これは非常に強く叩き込まれているため、企業は遅延よりも間違っている方を積極的に選択しています。そして、人々がそうするとき、私たちの開発文化には有害で、後ろ向きな何かがあります。

なぜなら、真実はこうだからです。最高のAIツールは必ずしも速く感じられず、代わりに思慮深く感じられます。時々、彼らは一時停止します。時々、プロンプトを推論したり、関連するコードを検索したり、応答を検証したりするために余分な時間を必要とします。しかし、それは待つ価値があるのです。たとえば、OpenAIのDeep Research機能が質問により良く答えるためにインターネットを最大20分かけて調査するからといって、利用を止める人はいません。処理中に他のことをして、戻ってくるだけです。

もはや遅い = 壊れているという意味ではありません。それはスマートという意味です。むしろ、AIに関しては、速度こそがバグです。開発プロセスに実際に価値を追加するAIツールが必要な場合、それには応答性から信頼性へ、即時性からインサイトへの移行が必要です。そして、特に開発者にとって、そのトレードオフは理にかなっています。

私たちは、今後5年間で最も価値のあるアプリは、速度を最適化するものではなく、インテリジェンスを最適化するものになると信じています。速いけど不要な結果と、遅いけれど価値あるもの、どちらが欲しいですか?

CodeRabbitのマントラ:ゆっくり動いて物事を修正する

CodeRabbitでは、他のツールのように、速度を優先するAIパイプライン最適化に過剰投資はしません。私たちは、信頼のために最適化を行います。それはコードを理解し、コンテキスト全体で推論し、より良いソフトウェアを構築するのに実際に役立つ出力を生成する時間を取るシステムを受け入れることを意味します。確かに、クイックプロンプトをたたき出すよりも遅いです。しかし、その余分な時間は明瞭さやカバレッジ、そして自信へと変わります。

「Move fast and break things（素早く行動し破壊せよ）」は、MVPをデリバリーするには素晴らしいものでした。しかし、品質をデリバリーすることに関しては、私たちは別のものを信じています。ゆっくり動いて物事を修正する、AIに場の空気を読ませる、話す前に考えさせる…そして、本当に自信のある自動補完ではなく、シニアエンジニアから得られるようなサポートを提供する。それが、間違ったAIをスローAIよりも優先する、現在の後ろ向きな文化から抜け出す唯一の方法です。

私たちのレビューを試してみたいですか? こちらから 14日間の無料トライアルを入手してください!

]]>

Wed, 05 Nov 2025 08:43:53 GMT

For as long as we’ve been building with machines, we’ve followed one core rule: faster is better. Lower latency, higher throughput, less waiting; that was gospel. Nobody wanted to wait 600ms for a button to respond or watch a spinner that lasts longer than their attention span. If it was slow, it was broken. Case closed.

So naturally, when AI tools started creeping into our dev workflows, autocomplete, agents, copilots, you name it, the same principle applied. Make it fast. Make it feel instant. Make it look like magic.

But here’s the thing: AI isn’t magic. It’s inference. It’s pipelines and RAG and context and tool calls. It’s juggling messy context and probabilistic guesses. And if you want something smarter than glorified autocomplete, you need to build a pipeline of processes to provide scaffolding for that. Which takes time to process. Anything less and you’re basically just speedrunning stupid. And speed isn’t anything to brag about when your tool is just wrong faster.

At CodeRabbit, we prioritize what we call Slow AI. And we have the guts to say what a lot of AI companies are too afraid to: We’re going to make you wait.

(And you’ll thank us for it).

AI dev tools are often fast, confident, and wrong

If you've used an AI coding agent lately, you've probably seen it: a shockingly fast suggestion pops up almost as soon as you stop typing. It looks legit. But then… it fails silently. Or spectacularly. Or worse, it passes the test and breaks something two files over.

Why? Because most AI dev tools today are optimized for one thing: speed. Type a few tokens and the model predicts the most statistically likely continuation not necessarily the correct one, not the secure one, not the one that actually understands what your app is doing. Just the next plausible blob of code.

That’s fine for boilerplate. But for logic? For edge cases? For actual engineering? It’s kind of like hiring someone who talks fast and confidently in meetings but never reads the specs.

Most of these tools don’t read context, at least not deeply. They might grab a few nearby lines, maybe the function name, but they rarely verify what they’re generating against the bigger picture. No issue cross-checking. No architecture-level awareness. No reasoning across files or use cases.

If you want outputs that are thoughtful, testable, and context-aware, you need AI systems that slow down, zoom out, and actually engage with the problem.

That’s what Slow AI does. And it turns out, when your AI takes the time to understand what it’s doing, it stops hallucinating and starts actually helping.

Why AI is better when it’s slow

At their core, large language models are statistical reasoning machines. They generate output by predicting what comes next based on probability, patterns, and (hopefully) the context you’ve given them. But here's the caveat most devs forget: good predictions take work. This is especially true when you're asking the model to do something complex like write logic, understand architecture, or reason across multiple steps. The quality of the output is often tied directly to the depth or stages of its inference.

This is particularly true when you move beyond simple prompts and into multi-stage pipelines and agentic behavior. When an AI tool is verifying outputs, pulling in relevant files, checking for contradictions, or planning several actions ahead, it’s not just spitting out the next token… it’s thinking. Or, at least, performing a rough approximation of it.

That kind of non-linear reasoning can’t be done in a single forward pass. It involves reflection, retrieval, planning, and sometimes even self-correction. These processes aren’t latency-friendly, they’re intelligence-friendly.

In short: if you want AI to actually help on complex code, you have to let it cook.

Slow is the new smart: Why we let our AI think

Slow AI is one term for what we’re talking about. But it could just as easily be called Comprehensive AI or Accurate AI or even Actually Helpful and Useful AI if we’re being honest. And it’s inextricably tied to one of the buzziest ideas in AI product design right now: context engineering.

The more relevant and parsed info an AI knows about the problem, the better it performs but that context has to be pulled in, parsed, prioritized and reasoned over. That kind of pipeline is the enemy of ultra-low latency AI… and it’s also the enemy of accuracy.

And that’s why our AI code reviews can take up to five minutes before you see the first comment. Don’t get us wrong, we’re not optimizing for slowness. You could get a review in three minutes or even one minute depending on the complexity of your codebase and PR. Our pipeline is complex because that’s what’s required to do the job our users need it to do. You don’t even want to know the number of concurrent processes we have going on at any time!

But guess what? When we let our AI take its time using a non-linear, multi-pass pipeline with multiple review and verification agents, it generates less noise and more relevant code review comments than other tools.

Non-linear reasoning isn’t fast. But it’s good.

So, why do most AI tools choose stupid over slow?

Well, first, Slow AI isn’t an option for every tool. If you’re asking an AI coding agent a question, for example, you’re not going to wait five minutes for it to reply. There’s an expectation of immediacy inherent in that exchange.

But code reviews? No one expects their co-worker to immediately drop what they’re doing and start commenting on a PR when it’s submitted. So, they’re willing to accept a delay in a review from a bot as well. And they’re especially willing to accept that delay if that review saves them time by being more relevant.

But why do so many companies still prioritize low latency when their use cases don’t really require it? Well, we’ve been trained, and trained our users, to expect instant gratification. Click a button, get a dopamine hit. Type a function name, get a suggestion before you even think about it. Anything else feels broken, laggy, or like your startup forgot to pay its AWS bill.

This has been drilled into us so hard that companies are out there actively choosing being wrong over being slow. And there’s something toxic and backwards about our development culture when folks do that.

Because here’s the truth: the best AI tools don’t always feel fast. They feel thoughtful. Sometimes they pause. Sometimes they take an extra beat to reason through your prompt, retrieve relevant code, or validate their response. And that’s something worth waiting for. After all, no one is less likely to use OpenAI’s Deep Research feature because it takes up to 20 minutes to comb the internet for info to better answer your question. You just do something else while it’s processing and circle back.

Slow doesn’t mean busted anymore, it means smart. If anything, speed is the bug when it comes to AI. If we want AI tools that actually add value to the development process, that requires a shift from responsiveness to reliability, from immediacy to insight. And for developers especially, that tradeoff makes sense.

We believe that the most valuable apps in the next five years won’t be the ones that optimize for speed but the ones that optimize for intelligence. Who wants fast garbage over slow value?

CodeRabbit’s mantra: Move slow and fix things

At CodeRabbit, we don’t optimize our AI pipelines for speed at all costs like everyone else. We optimize for trust. That means embracing systems that take the time to understand your code, reason across context, and generate outputs that actually help you build better software. Yes, it’s slower than hammering out a quick prompt. But that extra time buys you clarity, coverage, and confidence.

“Move fast and break things” was great for shipping MVPs. But when it comes to shipping quality, we believe in something else: Move slow and fix things. Let the AI read the room. Let it think before it speaks. And let it give you the kind of help you’d expect from a senior engineer, not just a really confident autocomplete. That’s the only way to break out of this backwards culture that prioritizes wrong AI over slow AI.

Want to try our reviews out? Get a 14-day free trial here!

]]>

Sat, 25 Oct 2025 00:45:11 GMT

Why LLM models are no longer interchangeableの意訳です。

開発者やプロダクトビルダーにとって、この数年間はLLMがアプリケーション開発を導いてきました。プロダクトを改善したいなら、最新のLLMを利用すれば良い。ただモデルを切り替えるだけで、ツールの性能を一段階引き上げられるのです。

しかし、その時代は終わりました。AnthropicのClaude Sonnet 4.5やOpenAIのGPT-5-Codexのような新しいモデルは、根本的に異なる方向へ分岐し始めています。どのモデルを使うかという選択は、もはや単なるエンジニアリング上の判断ではなく、極めて重要なプロダクト上の意思決定なのです。モデルを切り替えた瞬間に、あなたのプロダクトの「質感」そのものが変わります。

いわゆる「万能モデル時代」は終焉を迎えました。あなたが選ぶモデルは、あなたのプロダクトが何であるか、何をするのか、どのように動作するのか を象徴する存在になります。たとえあなたがそう意図していなくても、です。

本記事では、この新しい時代における3つの驚くべき発見を紹介します。それは「LLM選択がプロダクトの表明になった理由」「モデルが持つ明確な個性とスタイルの違い」、そして「プロンプトが単一命令から適応的システムへ進化すべき理由」の3つです。

学びポイント 1: LLMの選択はプロダクトの“声明”である

LLMモデルの選択は、もはや「新しいAPIを実装すれば済む」といった単純な技術的決定ではありません。これは、どんなユーザー体験を作りたいのか、どのような失敗を許容するのか、何を最適化したいのか、どの指標で優位に立ちたいのかという、プロダクトの方向性を決める意思決定です。

モデルはそれぞれ固有の「性格」や「推論方法」「直感」を持つようになっており、それがプロダクトの“感触”や“振る舞い”を直接的に形作ります。単に「出力が正しいかどうか」ではなく、「どのように考え、どのように伝えるか」まで変わります。違うモデルを選べば、ツールの能力からユーザーとの対話の仕方まで、すべてが異なるのです。

では、モデルの定量的な性能だけを測る従来型ベンチマークが通用しない今、何を頼りにプロダクトの方向を定めれば良いのでしょうか？チームやユーザーへのアンケート、フォーカスグループもありますが、厳密に実施しなければ客観性に欠ける恐れがあります。

CodeRabbitでは、この選択を客観化するために、独自の重要指標のメトリクスを作成しました。このメトリクスは、単なる性能や精度だけを見ません。可読性、冗長性、信号対雑音比など、多面的に評価します。

このような指標により、焦点は「性能」や「リーダーボードの順位」から、「プロダクトとユーザーにとって本当に重要な要素」へと移ります。例えば、技術的に正しくても影響の少ない提案が多すぎれば、ユーザーを疲弊させ、かつトークンを浪費します。理論上「賢い」モデルでも、ユーザーのワークフローに合わなければ体験を悪化させます。

自社のメトリクスを定義し、新しいモデルが自社とユーザーのニーズを満たすかを測ることを強く推奨します。これらのメトリクスは静的なものではなく、ユーザー行動やフィードバックによって進化させるべきです。目標は、「ユーザーの好みを予測できる基準」を見つけることです。

結論として、最適なモデルとは「リーダーボード上の1位」ではなく、あなたの設計した体験やユーザーのニーズに最も本能的に合うモデルです。

学びポイント 2: フロンティアモデルは「性格」が分岐した

モデルはこれまで以上に「作られるものではなく、育つもの」となっており、その結果、モデル世代ごとに固有の直感と行動特性が生まれています。ポストトレーニングの手法（cookbook）の違いが、モデルクラスごとの方向性を根本的に変えました。1つのモデルで完璧に動くプロンプトも、別のモデルでは通用しません。つまり、同じタスクに対する根本的なアプローチが異なるのです。

これを理解する良い例えとして、モデルを異なる「職業的アーキタイプ」に喩えることができます。
Sonnet 4.5は几帳面な会計士出身の開発者、GPT-5-Codexは倫理意識の高い堅実なエンジニア、GPT-5はバグを徹底的に探す職人気質の開発者、Sonnet 4は活動的な新卒エンジニア。
GPT-5系はClaude系よりもソリューション空間を広く探索し、Claudeはプロンプトの文脈に忠実に留まる傾向があります。どのモデルが適しているかは、プロダクトが目指す目的によって完全に異なります。

CodeRabbitでは、モデル評価と特性分析を体系的に行い、その結果を基にプロンプトとデプロイ方法を最適化しています。たとえば、Sonnet 4.5とGPT-5-Codexを比較すると、Sonnet 4.5は「高リコール型のポイント修正者」、GPT-5-Codexは「ピンポイントなパッチ生成者」として性質づけられます。

こうした定性的な違いは、明確な運用上の違いに転化します。

次元	Claude Sonnet 4.5	GPT-5-Codex
デフォルトの語彙選択	“Critical,” “Add,” “Remove,” “Consider”	“Fix,” “Guard,” “Prevent,” “Restore,” “Drop”
例の効率性	明示的なルールを好む。命令形を覚えやすい	例が少なくても長い文脈でフォーマットを維持できる
思考スタイル	慎重。多くのバグを見つけるが、重要な1つを見逃すことも	柔軟。必要に応じて深く考え、再確認を要しない。難解なバグを捕捉しやすい
行動傾向	広範囲に修正提案。コメントが多く、人間的。致命的でない問題も拾う	簡潔でバランスの取れた研究的レビュー。副次的影響を指摘する傾向
レビュー構造	「何が悪い」「なぜ悪い」「具体的修正コード」	「何をすべきか」「なぜすべきか」「修正コード＋影響」
文脈認識	コンテキストウィンドウを意識。トークン管理が巧み	明示的なウィンドウ意識は弱い（時計なしで料理するような感覚）
冗長性	高い。読みやすいが語数が倍増	低い。情報密度が高く、読むのに集中を要する

学びポイント 3: プロンプトはもはや単一構造ではない

モデルの根本的な性質が分岐したことで、あるモデル用に書いたプロンプトを他モデルで「そのまま」使うことはできなくなっています。
たとえばClaude用の厳格な命令プロンプトはGPT-5-Codexでは過剰拘束になり、Codex用に推論重視で最適化したプロンプトは、Claudeで性能を発揮できません。つまり、「一枚岩のプロンプト時代」は完全に終わったのです。

では、新モデルを導入したいエンジニアリングチームはどうすればよいでしょうか？
答えは――より多くのプロンプトエンジニアリングです。ただし嘆く必要はありません。いくつかの実践的な方法があります。

プロンプト・サブユニットの登場

CodeRabbitで見出した解決策の一つが「プロンプト・サブユニット」です。
これは、モデルに依存しない中核プロンプト（基本タスクと一般指示）を定義し、その上にモデル固有のサブユニット（スタイル、フォーマット、例示）を積み上げる構成です。

たとえばCodexとSonnet 4.5では実装詳細が大きく異なりますが、次のような発見がありました：

Claude: 「DO」「DO NOT」のような強い命令語を使用する。Anthropic系モデルはシステムプロンプトの末尾情報をよく参照し、長文でもフォーマット遵守が得意。明示的な指示を好む。
GPT-5: 一般的で整合性のある指示を使用する。OpenAI系はシステムプロンプトの下部ほど注意力が減衰するため、長文では出力フォーマットを忘れがち。抽象的なガイダンスを好み、推論の深さを示す傾向がある。

ユーザーフィードバックと評価（eval）

もう一つの解決策は、ユーザーフィードバックと内部評価による継続的アップデートです。
AIコードレビューボットなどLLMアプリの最適化において、最も重要なのは外部ベンチマークではなく、「ユーザーが出力に納得できるか」です。

モデル間で「技術的正確性」が高くても、過剰なコメントや冗長性があると価値を下げてしまいます。
したがって、受容率、S/N比、p95レイテンシ、コストといった実運用メトリクスを測定し、プロンプトを少しずつ調整することで、システムをユーザー期待とプロダクト目標に整合させ続けることができます。

ベンチマークでの定量的結果が良くても、ユーザー受容率が低い――そんな事態は避けるべきです。

まとめ

プロンプトエンジニアリングは、「万能テンプレート」から「モデル特化型パラダイム」へと変わりました。
脆弱な単一プロンプトや「差し替え可能なモデル」の時代は終わりです。これからは、モジュラー型プロンプト設計と意図的なモデル選択が、プロダクトの強靭性を生みます。

モデルが進化し続ける以上、LLMスタックやプロンプトも画一的であってはいけません。
それは「生きたシステム」として扱うべきです。調整し、テストし、確認し、繰り返す。

また、最新モデルの実運用挙動に関する詳細なベンチマークもぜひ確認してください。今後の選択に必要なデータが得られるでしょう。

CodeRabbitを14日間無料でお試しください。
https://coderabbit.link/rk7tdeC

]]>

Fri, 24 Oct 2025 05:38:20 GMT

For developers and product builders, one assumption has guided the last few years of LLM application development. To improve your product, just swap in the latest frontier large language model. Flip a single switch and your tool’s capabilities level up.

But that era is over. We’re now seeing that new models like Anthropic’s Claude Sonnet 4.5 and OpenAI’s GPT-5-Codex have diverged in fundamental ways. The choice of which model to use is no longer a simple engineering decision but a critical product decision. Flip that switch today… and the very texture of your product changes.

The one-size-fits-all model era is over; the model you choose now expresses something integral about what your product is and does, as well as, how it works. Whether you want it to or not.

In this blog, we’ll explore three surprising takeaways from this new era: why your LLM is now a statement about your product, how models now have distinct personalities and styles, and why your prompts have to now evolve from monolithic instructions to adaptive systems.

Takeaway 1: LLM choice is now a statement about your product

Choosing a model is no longer a straightforward decision where the main consequence of your choice is having to implement a new API. It is now a product decision about the user experience you want to create, the failure modes you can tolerate, the economics you want to optimize for, and the metrics you want to excel in.

Models have developed distinct “personalities,” ways of reasoning, and instincts that directly shape how your product feels and behaves that go beyond just whether its output is technically right or wrong. Choose a different model and everything from what your tool is capable of to how it communicates with your users is significantly different.

So, in a world where traditional benchmarks that primarily or exclusively measure quantitative aspects of a model’s performance are no longer enough, what can you turn to for the data you need to chart your product’s direction? You could survey your team or your users or conduct focus groups but that could lack objectivity if you don’t do it in a rigorous manner.

To make this choice objective for our team, we focused on creating an internal North Star metrics matrix at CodeRabbit. Our metrics don’t just look at raw performance or accuracy. We also take into account readability, verbosity, signal-to-noise ratios, and more.

These kinds of metrics shift the focus from raw performance accuracy or leaderboard performance to what matters to our product and to our users. For example, a flood of low-impact suggestions, even if technically correct, burns user attention and consumes tokens. A theoretically “smarter” model can easily create a worse product experience if the output doesn’t align with your users’ workflow.

I would strongly recommend creating your own North Star metrics to better gauge whether a new model meets your products’ and users’ needs. These shouldn’t be static metrics but should be informed by user feedback and user behavior in your product and evolve over time. Your goal is to find the right list of criteria to measure that predict your users preferences.

What you’ll find is that the right model is the one whose instincts match the designed product behavior and your users’ needs, not the one at the top of any external leaderboard.

Takeaway 2: Frontier models have divergent ‘personalities’

Models are (now more than ever) “grown, not built,” and as a result, the latest generation has developed distinct instincts and behaviors. Different post-training cookbooks have fundamentally changed the direction of each model class. A prompt that works perfectly for one model will not work the same in another. Their fundamental approaches to the same task have diverged.

One powerful analogy that drives this point home is to think of the models as different professional archetypes. Sonnet 4.5 is like a meticulous accountant turned developer, meanwhile GPT-5-Codex is an upright ethical coder, GPT-5 is a bug-hunting detailed developer, and Sonnet 4 was a hyper-active new grad. The GPT-5 model class would make logical jumps further out in the solution space compared to the Claude model class, which tends to stay near the prompts itself. Which model is right for your use case and product, depends entirely on what you are wanting your product to achieve.

At CodeRabbit, we take a methodical approach to model evaluation and characterization. We then use this data to improve how we prompt and deploy models, ensuring we are always using the right model for each use case within our product. To give you an example of how we look at the different models, let’s compare Sonnet 4.5 and GPT-5-Codex. Based on extensive internal use and evals, we characterized Sonnet 4.5 as a “high-recall point-fixer,” aiming for comprehensive coverage. In contrast, GPT-5-Codex acts as a “patch generator,” preferring surgical, local changes.

These qualitative differences translate into hard, operational differences.

Dimension	Claude Sonnet 4.5	GPT-5-Codex
Default Word Choice	“Critical,” “Add,” “Remove,” “Consider”	“Fix,” “Guard,” “Prevent,” “Restore,” “Drop”
Example-Efficiency	Remembers imperatives; benefits from explicit rules	Needs fewer examples; follows the formatting on longer context without additional prompting
Thinking Style	More cautious, catches more bugs but not as many of the critical one	Variable or elastic, less depth when not needed without need to reiterate the rules. Catches more of the hard-to-find bugs
Behavioral Tendencies	Wider spray of point-fixes, more commentary and hedging, inquisitive, more human-like review, finds more critical and non-critical issues	Verbose research-style rationales, notes on second-order effects to code, compact and balanced towards a code reviewer
Review Comment Structure	What’s wrong, why it’s wrong, concrete fix with code chunk	What to do, why do it, concrete fix with effects and code chunk
Context Awareness	Aware of its own context window, tracks token budget, persists/compresses based on headroom	Lacks explicit context window awareness (like cooking without a clock)
Verbosity	Higher, easier to read, double the word count	Lower, harder to read, information-dense

Takeaway 3: End of an era. Prompts are no longer monoliths

Because the fundamental behaviors of models have diverged, a prompt written for one model will not work “as is” on another anymore. For example, a directive-heavy prompt designed for Claude can feel over-constrained on GPT-5-Codex, and a prompt optimized for Codex to explore deep reasoning behavior will likely underperform on Claude. That means that the era of the monolithic, one-size-fits-all prompt is over.

So, what does that mean for engineering teams who want to switch between models or adopt the newest models as they’re released? It means even more prompt engineering! But before you groan at the thought — there are some hacks to make this easier.

The rise of prompt subunits

The first practical solution we’ve found at CodeRabbit is to introduce “prompt subunits.” This architecture consists of a model-agnostic core prompt that defines the core tasks and general instructions. This is then layered on top of smaller, model-specific prompt subunits that handle style, formatting, and examples – and which can be customized to individual models.

When it comes to Codex and Sonnet 4.5, the implementation details for these subunits are likely to be starkly different. We’ve found a few tricks from our prompt testing with both models that we would like to share:

Claude: Use strong language like "DO" and "DO NOT." Anthropic models pay attention to the latest information in a system prompt and are excellent at following output format specifications, even in long contexts. They prefer being told explicitly what to do.
GPT-5: Use general instructions that are clearly aligned. OpenAI models’ attention decreases from top to bottom in a system prompt. These models may forget output format instructions in long contexts. They prefer generic guidance and tend to "think on guidance," demonstrating a deeper reasoning process.

User feedback and evals

The second solution is to implement continuous updates driven by user feedback and internal evaluations. The best practice for optimizing an AI code-review bot or for that matter any LLM applications isn’t using an external benchmark; it’s checking to see if users accept the output.

Evals are more important than ever but have to be designed more tightly around acceptability by users instead of raw performance since one model might be technically correct significantly more than another model but might drown the user in nitpicky and verbose comments, diluting its value to users. By measuring the metrics that matter ~ acceptance rate, signal-to-noise ratio, p95 latency, cost, among others - and tuning prompts in small steps, the system will remain aligned with user expectations and product goals. The last thing you want is great quantitative results on benchmarks and tests but low user acceptance.

Conclusion

This shift from one-size-fits-all prompt engineering to a new model specific paradigm is critical. The days of brittle, monolithic prompts and plug-and-play model swaps are over. Instead, modular prompting, paired with deliberate model choice, give your product resilience.

The ground will keep shifting as models evolve so your LLM stack and prompts shouldn’t be static. Treat it like a living system. Tune, test, listen, repeat.

Also, be sure to check out our published detailed benchmarks on how the latest models behave in production. That gives you more data on what to expect from them.

Try CodeRabbit with a 14-day free trial.

]]>

Fri, 10 Oct 2025 07:35:08 GMT

We raised $60M last week… so we made a funny filmの意訳です。

先日、CodeRabbitはシリーズBで6,000万ドルの資金調達を発表しました。
そのお祝いに、開発者向けソフトウェア企業として当然のことをやりました──おもしろ動画を作ったのです。

もちろん、全額を動画制作に使ったわけではありません。
ただ、AIが生成した大量のPRに追われる開発チームなら誰もが共感できる、ちょっと馬鹿げた（でも楽しい）企画で祝おうと決めました。

紹介します… “AIコーディングエージェントの暴走：短編映画”

https://youtu.be/glfB3KLQR7E?feature=shared

「AIによる開発速度の向上」が、いつの間にかレビューの滞留地獄に変わってしまった──
そんな現実を描いた、モキュメンタリー×シットコム風の短編です。

レビュアーは1人
通知は何十件も
未レビューPRは84件
そして、ひたすらフィードバックを求める同僚ブラッド

キャスト紹介

主人公の疲弊したレビュアー役には、人気の開発者教育者（そしてインフルエンサー）Aaron Francisを起用。
彼は「機能をもっと早くリリースしたい」と思っていたのに、今ではキッチンにも行けず、朝8時に家を出ようとしても、ブラッドがPRの話をしてくる始末です。

そして、そのブラッドを完璧に演じたのがAustin von Johnson。
彼はAI生成PRを驚くべきスピードで量産できる開発者ですが、どんな状況でもレビューを待てないタイプ。
彼のストーキング、付箋メモ攻撃、フーディ姿でのPR奇襲……すべてが見事に「やりきって」いました。

笑いの裏にある、実際の課題

この短編は笑える内容ですが、そこに描かれた課題は現実のものです。

AIコーディングツールが、チームのレビュー速度を超える速さでコードを生成する
レビュー待ちPRが雪だるま式に増え、生産性が低下する
シニアエンジニアがレビュー地獄に埋もれる
レビュー品質がばらつき、リスクが増加する
そしていつの間にか、「開発速度の向上」という約束が悪夢に変わる

CodeRabbitが解決すること

CodeRabbitはレビューの滞留を解消するために存在します。
私たちのAIコードレビューは、要件・テスト・CI・過去のdiff・所有者情報など、数十の文脈情報を参照して、見逃されがちなバグを検出します。
レビューアーの負担を軽減し、PRをより早く、安全にマージできるようにします。──もちろん、チームメイトを「ブラッド」にしないためにも。

素早くリリースし、賢くレビューし、心の平穏を保ちましょう。
そして、ブラッドをもう一人増やさないように。

👉 こちらから 「AIコーディングエージェントの暴走：短編映画」をご覧ください。
もしあなたの職場にも「PRまだ？」と追いかけてくるブラッドがいるなら、この動画をぜひ送ってあげてください。

]]>

Fri, 10 Oct 2025 07:28:00 GMT

Claude Sonnet 4.5: Better performance but a paradoxの意訳です。

Sonnet 4.5はAnthropicの最新Claudeモデルであり、私たちのコードレビュー・ベンチマークでは一見パラドックスのように感じられます。より高性能で、より慎重でありながら、時にもどかしい。Sonnet 4では見逃したバグを見つけ、カバレッジではOpus 4.1に近づき、さらに想定外の重大な問題をいくつか浮かび上がらせることもあります。

しかし一方で、自己防衛的に振る舞い、自らを疑い、時に決断的なレビュアーというより思慮深い同僚のように見えることもありました。データでは確かな進歩が見られます。Sonnet 4ではコメントのうち、重要と判断されたものが35.3%だったのに対し、Sonnet 4.5では41.5%でした。しかし、そのコメントの調子や文体は、「AIレビュアーに何を求めるのか」というより深い問いを投げかけています。

そして決定的なのは価格です。Sonnet 4.5はOpusレベルの性能に近づきながら、価格は変わらず維持されています。つまり、大規模なコードレビューを行うチームにとって、実用的な最適点に位置しているのです。

Sonnet 4.5は思考を声に出しているかのようで、確かな修正を出す一方、曖昧な「条件付き」警告のようなコメントを出すこともあり、それが一部の開発者にとっては理解を難しくしているかもしれません。それでは、ベンチマークの詳細を見ていきましょう。

ベンチマーク：評価の観点

Sonnet 4.5、Sonnet 4、Opus 4.1の3つを対象に、25件の難易度の高い実際のプルリクエストで評価しました。これらには既知の重大なバグが含まれており（並行性やメモリ順序、非同期レースコンディション、APIの誤使用など）、モデルがその重大な問題に直接コメントを出せた場合、そのPRは「合格」としました。

評価指標は、カバレッジ（S@25）、精度（コメントの合格率）、そしてシグナル対ノイズ比です。シグナル対ノイズ比については、重要なコメント（Important comments） に注目しました。これらは最も価値のあるコメントであり、以下を含みます。

PASSコメント：PR内の既知の重大バグを正しく指摘・修正したもの
その他の重要コメント：追跡対象ではないが、別の重大または深刻なバグを的確に指摘したもの

スコアボード：Sonnet 4.5はOpus 4.1に性能で迫る

結果は以下の通りです。

カバレッジ: Sonnet 4.5はSonnet 4とOpus 4.1の間の差を大きく縮め、Sonnet 4を大きく上回りました。
精度: Opus 4.1は依然として最も正確で信頼性の高い実行可能なコメントを生成しました。高価格モデルであるため当然の結果です。
重要コメント率（重大な問題を指摘したコメントの割合）: より厳格な基準で測定した場合、Sonnet 4.5の重要コメント率は約41%。つまりコメントのうち4割が、主要なバグを解決するか、別の重大な問題を指摘していたことになります。Opus 4.1は50%、Sonnet 4は約35%でした。

文体とトーン：Sonnet 4.5は「慎重さ」にフォーカス

Sonnet 4.5のコメントはコードを修正しますが、Opus 4.1ほど自信に満ちたトーンではありません。ただし、Sonnet 4よりは明確です。

修正パッチの提示率:

Sonnet 4.5の実行可能コメントのうち87%はコードブロックやdiffパッチを含み、Sonnet 4（90%）、Opus 4.1（91%）とほぼ同水準です。
違いは文体にあります。Opusのdiffは「外科的修正」のように明確ですが、Sonnet 4.5は探索的な文章を添える傾向があります。修正を「提案する」「検討する」といった表現が多く、断定的ではありません。

慎重な言い回し（Hedging language）:

Sonnet 4.5では実行可能コメントの**34%**において、「might」「could」「possibly」といった慎重な表現が見られます。例：
- “不要なアロケーション: cacheは使用されていません。 コンストラクタで4KBのメモリを確保していますが、使われていません。… cache_bufferの削除を検討してください。”
- “空のtry/exceptブロックを削除してください。 …おそらくプレースホルダーです”
Opus 4.1は約28%、Sonnet 4は約26%とやや低め。
この慎重さにより、「問いかける」ようなトーンが生まれます。Sonnet 4.5はしばしば一緒に考えているような雰囲気を持ち、明確な判定を下すというより「共に推論している」ように感じられます。

自信のある言語（Confident language）:

ただし、Sonnet 4.5は慎重さを補うように、高い確信を示すコメントも**39%**含んでいます。Sonnet 4（18%）、Opus 4.1（23%）よりも高い割合です。例：
- “重大: self.プレフィックスが欠落しており、すべてのAPIメソッドが動作しません。 このままでは全てのメソッドがAttributeErrorを発生させます。”
- “整数オーバーフローの可能性: optimization_cycle_countが無制限にインクリメントされ続けます。これは約414日稼働後に必ずオーバーフローします。”
つまり、慎重さと確信の間で大きく揺れ動くのです。

シグナル対ノイズ比:

Sonnet 4.5はSonnet 4より精度が向上しましたが、依然としてOpusよりも「軽微な」的外れコメントが多めでした。
ただし、重要コメント（PASSコメント＋少数の高確信度コメント）に限定すると**41.5%**を達成。Opus 4.1は依然として約50%で最高水準です。

Sonnet 4.5が得意な領域

評価したPR群では、Sonnet 4.5が明確に、特に優れていた領域が見られました。

並行性バグ検出: C++のatomic操作やcondvarの誤用を的確に特定し、実行可能なdiffを生成
整合性チェック: サービス間での分散状態の不整合を確実に検出
追加バグの発見: 評価対象外のCritical問題も検出しましたが、より厳密な基準では件数はやや減少

AnthropicはSonnet 4.5を「ハイブリッド推論」「長期的計画立案」モデルとして打ち出しています。実際、コード内の副経路を追跡し、未追跡の実際の問題を発見する傾向が見られます。

Sonnet 4.5: 価格と性能のバランスが最適

Sonnet 4.5の最大の強みの一つは、価格対性能比にあります。Opus 4.1は依然としてAnthropicの最上位モデルですが、その分コストも高額です。

Sonnet 4.5はカバレッジと重大バグ検出能力の差を縮めつつ、はるかにコスト効率が良いです。多くのチームにとって、Opusレベルの結果を低コストで得られるこのバランスこそが、最も実用的な選択肢となる理由です。

Sonnet 4.5の弱点

ただし、Sonnet 4.5を使用する際はその弱点を理解しておく必要があります。

デッドロック検出: Sonnet 4やOpusと同様、複雑なロック順序の追跡はまだ苦手です
冗長さと慎重さ: コメントが長く、留保的または曖昧なことがあります。以前の研究で評価したGPT-5 Codexは「パッチのように読める」ほど明快なコメントを書く傾向がありました。例として、GPT-5 Codexのコメントを以下に示します。
- ロック順序 / デッドロック: 「ロック取得を一貫した階層順に並べ替えてください。これにより循環待ちデッドロックを防げます」
- 正規表現の壊滅的バックトラッキング: 「ネストされた量指定子を削除してバックトラッキングを回避します」
精度のギャップ: コメント単位の精度35%、重要コメント率41.5%はSonnet 4より良いものの、Opus 4.1にはまだ届きません。

Sonnet 4.5の総評

Sonnet 4.5は「赤ペン先生」ではなく、そばで考えてくれる同僚のようです。可能性のある問題を指摘し、ほとんどの場合は正しく、時に慎重すぎる時もあります。ときには、思いがけない箇所にまで目を向けてくれます。

このスタイルはレビューにおいて諸刃の剣です。一方では、開発者は追加の重大問題を指摘してくれることを評価するでしょう。もう一方では、「このバグを確実に見つけてほしい」という場合、Opus 4.1のほうが鋭いです。

総括

AnthropicはSonnet 4.5を「エージェント的推論とコンピュータ利用」に向けたステップと位置付けています。コードレビューでは、その推論力がより豊かで慎重かつ広範なコメントとして現れます。

チームにとっての選択肢はこうです。

明確でパッチのようなフィードバックを重視するなら、Opus 4.1（またはGPT-5 Codex）が依然として基準として優れています。
トラッキング対象外の隠れた重大問題まで発見したいなら、Sonnet 4.5が有力です。
コスト効率を重視するなら、Sonnet 4.5が最も賢い選択肢です。Opus並みの精度を低価格で実現します。

いずれにせよ、Sonnet 4.5はレビュー体験の質感を変えます。より人間的で、常に明快とは限らないものの、より探究的で、より慎重で、時にあなたが見逃していた「正解」に辿り着くこともあるのです。

]]>

Fri, 03 Oct 2025 14:51:00 GMT

Sonnet 4.5 is Anthropic’s newest Claude model and in our code review benchmark, it feels like a paradox: more capable, more cautious, and at times more frustrating. It catches bugs Sonnet 4 missed, edges closer to Opus 4.1 in coverage, and even surfaces a handful of unexpected critical issues off the beaten path.

Yet, it hedges, it questions itself, and it sometimes sounds more like a thoughtful colleague than a decisive reviewer. The data shows real progress:41.5% of its comments were Important in Sonnet 4.5 vs only 35.3% in Sonnet 4.But the tone and texture of those comments raise deeper questions about what we want in an AI reviewer.

And then there’s the kicker: Sonnet 4.5 gets you close to Opus-level performance at a fraction of the price, making it a pragmatic sweet spot for teams reviewing code at scale.

Sonnet 4.5 thinks aloud and still delivers decisive fixes but some of its comments are framed as vague “conditional” warnings that could make its comments harder for some to parse.. Let’s dive into our benchmark.

Benchmark: What we looked for

We evaluated Sonnet 4.5, Sonnet 4, and Opus 4.1 across 25 difficult real-world pull requests containing known critical bugs (ranging from concurrency and memory ordering to async race conditions and API misuse). A model “Passed’ a PR if it produced at least one comment directly on the critical issue.

We measured coverage (S@25), precision (comment PASS rate), and signal-to-noise ratio. For signal-to-noise we focus on Important comments (these are the comments that matter most). They include:

PASS comments that correctly addressed the known critical bug in the PR.
Other important comments that did not solve the tracked issue, but still flagged a truly Critical or Major bug elsewhere.

Scoreboard - Sonnet 4.5 gets closer to Opus 4.1 in performance

The results were mixed:

Coverage: Sonnet 4.5 closes much of the gap between Sonnet 4 and Opus 4.1 and lands far ahead of Sonnet 4.
Precision: Opus 4.1 still produces the cleanest, most reliable actionable comments but that is to be expected given that it’s a more expensive model.
Important share (i.e. percentage of comments flagging a significant issue): With stricter criteria, Sonnet 4.5 lands at just over 41% Important share. That means about 4 in 10 of its comments either solved the key bug or flagged another truly significant issue. Opus 4.1 leads here at 50%, with Sonnet 4 at ~35%.

Style and tone: Sonnet 4.5 is focused on hedging

Sonnet 4.5’s comments patch the code but do so in a less confident tone than Opus 4.1 does but is still more confident than Sonnet 4.

Patches present:

87% of Sonnet 4.5’s actionable comments included a code block or diff patch, similar to Sonnet 4 (90%) and Opus 4.1 (91%).
The difference is in style: Opus’s diffs read like surgical fixes, while Sonnet 4.5 often couches them in exploratory text. It “suggests” or “considers” changes rather than asserting them.

Hedging language:

Sonnet 4.5 hedges in 34% of actionable comments—words like might, could, possibly. For example:
- “Unnecessary allocation: cache is never used. The constructor allocates 4KB of memory that is never utilized … Consider removing the cache_buffer.”
- “Remove the empty try/except block. … likely a placeholder”
Opus 4.1 is steady at ~28%. Sonnet 4 sits slightly lower at ~26%.
This hedging creates an “interrogative” tone: Sonnet 4.5 sometimes feels like it’s thinking out loud with you, rather than delivering verdicts.

Confident language:

Sonnet 4.5 balances that hedging with higher confidence markers (39%) than Sonnet 4 (18%) or Opus 4.1 (23%). For example:
- “Critical: Missing self. prefix breaks all API methods. All subsequent methods will raise AttributeError until this is corrected.”
- “Potential integer overflow. optimization_cycle_count increments unbounded … this will overflow after ~414 days of runtime.”
In other words, it swings between caution and certainty more dramatically.

Signal-to-noise:

Sonnet 4.5 improved precision over Sonnet 4, but still produced more “minor” off-target notes than Opus.

However, when you count its true Important comments—PASS comments plus a small number of high-confidence off-EP issues—it lands at 41.5% Important share. Opus 4.1 is still the gold standard Anthropic model at ~50%.

What Sonnet 4.5 is good at

Across the PRs we tested Sonnet 4.5 with, we saw some clear areas where it stood out.

Concurrency bug-finding: Sonnet 4.5 nailed C++ atomics and condvar misuses with clean, actionable diffs.
Consistency checks: It reliably flagged distributed state mismatches across services.
Extra bug surfacing: It did identify additional Critical issues not originally under evaluation, though fewer than initially expected under a stricter rubric.

As Anthropic markets Sonnet 4.5, they emphasize “hybrid reasoning” and “long horizon” planning. In practice, that shows up as more willingness to chase down side-paths in the code and note real but untracked issues.

Sonnet 4.5: Hits a price vs. performance sweet spot

One of the biggest advantages of Sonnet 4.5 is its price-to-performance ratio. While Opus 4.1 remains Anthropic's flagship model in raw capability, it also comes at a significantly higher cost.

Sonnet 4.5 narrows the gap in coverage and important bug-finding while staying far more cost-efficient to run. For many teams, that balance of having close to Opus-level results at a fraction of the price is what makes Sonnet 4.5 the most pragmatic choice.

Sonnet 4.5 weaknesses

But if using Sonnet 4.5, it’s critical to be aware of its weaknesses. These include:

Deadlock coverage: Like Sonnet 4 and even Opus, it still struggles to trace complex lock ordering.
Verbosity and hedging: Many comments run long, caveated, or uncertain. Compare this to GPT-5 Codex, which in our earlier work wrote comments that “read like patches” with crisp directness. For example, with GPT-5 Codex:
- Lock ordering / deadlock: Reorder the lock acquisitions to follow a consistent hierarchy. This prevents circular wait deadlocks.”
- Regex catastrophic backtracking: “Remove the nested quantifier to avoid catastrophic backtracking.”
Precision gap: At 35% comment-level precision and 41.5% important share percentage, it’s better than Sonnet 4 but well short of Opus 4.1.

Sonnet 4.5 verdict

Sonnet 4.5 feels less like a teacher writing in red pen and more like a thoughtful colleague at your side: pointing out possible issues, often right, occasionally over-hedged, and sometimes spotting things you didn’t know were there.

That style is a double-edged sword in review. On one hand, developers may appreciate the extra critical issues it flags. On the other, when the task is “please catch this bug,” Opus 4.1 is still sharper.

Closing thoughts

Anthropic positioned Sonnet 4.5 as a step toward agentic reasoning and computer use. In code review, that reasoning shows up in richer, more cautious, and more wide-ranging comments.

For teams:

If you value decisive, patch-like feedback, Opus 4.1 (or GPT-5 Codex) still sets the bar.
If you want a reviewer that finds critical issues anywhere they lurk, even beyond the tracked bug, Sonnet 4.5 has surprising upside.
And if you care about pragmatic price-to-performance, Sonnet 4.5 may be the smartest choice: close to Opus’s accuracy at a fraction of the cost.

Either way, Sonnet 4.5 changes the texture of reviews. It feels more human—not always cleaner, but more inquisitive, more hedged, sometimes more right in the places you weren’t looking.

]]>

Thu, 02 Oct 2025 06:11:57 GMT

How To Run Static Analysis On Your CI/CD Pipelines Using AIの意訳です。

「セットアップ時の思いがけない誤設定によりデータフィールドが空になり、その結果システムがアカウントを自動削除しました。」 —— これは、Googleが年金基金のアカウント全体を誤って削除した件についての説明です。

このようなインシデントは、現代のソフトウェアシステムにおける正確な設定の重要性を浮き彫りにします。ちょっとした誤設定が、特にCI/CDパイプラインにおいて壊滅的な結果を招くことがあります。

設定の正確性を担保し、コードレビューの複雑さを管理することは、DevOpsエンジニアにとって大きな負担になりえます。チームはしばしば機能開発を優先し、設定レビューは後回しになりがちです。その結果、見過ごされた誤設定が本番障害やダウンタイムを引き起こす可能性があります。

CodeRabbitは、AI駆動の分析とリアルタイムフィードバックによってコードレビューを自動化し、この問題の解決を支援します。他のツールのように複雑なセットアップを必要とせず、CodeRabbitはパイプラインにシームレスに統合され、構成ファイルに対する静的チェックの正確性と効率性を確保します。

本記事では、CodeRabbitがCI/CDパイプラインでの静的チェックにどのように役立ち、エンドツーエンドのデプロイプロセス全体で設定品質を保証し、効率を向上させるかを解説します。

なぜCI/CDパイプラインに静的チェックが不可欠なのか

構成ファイルは、インフラやアプリケーションのデプロイを制御するCI/CDパイプラインの要です。これらのファイルのエラーは大きな障害や事業中断リスクにつながるため、早期の検証が不可欠です。静的チェックは、セキュリティ脆弱性、コード品質問題、運用上の混乱を緩和する上で重要な役割を果たします。

以下は、仮想環境のセットアップ、依存関係のインストール、Lintコマンドの実行を行うCircleCIのワークフロー構成ファイルの例です。

jobs:
 lint:
   docker:
     - image: circleci/python:3.9
   steps:
     - checkout
     - run:
         name: Install Dependencies
         command: |
          python -m venv venv
          . venv/bin/activate
          pip install flake8
     - run:
         name: Run Linting
         command: |
          . venv/bin/activate
          flake8 .

上記の構成で静的チェックが行われなければ、認識されない構文や無効な設定といった問題が漏れ、後工程でビルドが失敗する恐れがあります。例えば、依存関係の不足や不適切に整形されたコードは、デプロイパイプラインを破綻させる実行時エラーを招いたり、本番で原因追跡が難しいバグを持ち込む可能性があります。

総じて、静的チェックは以下を実現します。

早期のエラー検出: 実行前に構文エラーや誤設定を検出し、実行時障害の可能性を減らします
コーディング標準の遵守: スタイルガイドやベストプラクティスをコードや構成ファイル全体に適用し、品質の一貫性を確保して変更の保守・レビューを容易にします
コード品質の向上: テストの成功や一定以上のカバレッジなど、デプロイ前に満たすべき基準を静的チェックで担保し、全体的な品質を高めます

CodeRabbitを使った静的チェック

CodeRabbitはCI/CDワークフローに統合され、一般的な誤設定を特定することで優位性を発揮します。この能力はデプロイプロセスの整合性を維持し、エンドユーザーに影響しうる中断を防ぐ上で重要です。

さらに、追加の設定を必要とせずに静的解析やLintを自動実行できるという独自の利点があります。DevOpsチームにとって、この機能はセットアップ工程を簡素化し、複雑な設定ではなく開発に集中できるようにします。

既存のCI/CDパイプラインに影響を与えずに統合され、追加設定なしでLintと静的解析を自動実行します。
GitHub、CircleCI、GitLabなどの主要プラットフォーム上の多様なツールと統合し、Actionlint、Yamllint、ShellCheck、CircleCIパイプラインなどのチェックを実行します。これによりセットアップが簡素化され、追加の手作業なしに素早く結果を得られます。
JenkinsやGitHub Actionsのようなツールでは、CodeRabbitはビルドやコミットごとに継続的に静的解析を行い、誤設定を早期に検出してワークフローの信頼性を高めます。

次のセクションでは、実際のCodeRabbitの動作を見ていきます。

CodeRabbitとGitHub ActionsのActionlintで誤設定を検出する

CodeRabbitの機能を示すため、GitHub Actionsワークフローをプロジェクトに統合し、CI/CDパイプラインを自動化する方法を見ていきます。リポジトリには潜在的なエラーを含む構成ファイルがあり、CodeRabbitがそれを検出して報告します。

以下は、作成したワークフロー内のタスクシーケンス図です。

プルリクエストを送信すると、CodeRabbitがファイルをレビューし、潜在的な誤設定を自動的に検出します。リポジトリの準備ができたら、CodeRabbitと統合して自動コードレビューをセットアップし、以下の主要セクションからなる、包括的で構造化されたレポートを生成します。

Summary（概要） – コードや構成で検出された主要な変更点の簡潔なサマリー。注意が必要な領域を素早く把握できます。

Walkthrough（詳細解説） – 対象ファイルの詳細なステップバイステップ分析。具体的な問題点、設定、推奨事項をガイドします。

Table of Changes（変更一覧） – 各ファイルの変更点と要約の一覧。必要な対応の優先度付けを素早く行えます。

これらのセクションは、構成ファイルの調整やCodeRabbitダッシュボードの利用でカスタマイズできます。詳しくはCodeRabbit設定ガイドをご覧ください。

以下は、CodeRabbitのレビューを通じて詳細な洞察と提案が得られたサンプルのworkflow.yaml構成です。

name: development task

on:
  push:
    branches:
      - main
      - develop
      - staging
  pull_request:
    branches:
      - main
      - develop
      - staging

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Lint workflow YAML files
        uses: rhysd/actionlint@v1

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm install

      - name: Lint JavaScript code
        run: npm run lint

  build:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies and cache
        uses: actions/cache@v3
        with:
          path: ~/.npm
          key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-node-
        run: npm install

      - name: Run tests
        run: npm test

      - name: Check for vulnerabilities
        run: npm audit --production

  terraform:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.5.0

      - name: Terraform init
        run: terraform init
        working-directory: infrastructure/

      - name: Terraform plan
        run: terraform plan
        working-directory: infrastructure/

      - name: Terraform apply (development)
        if: github.ref == 'refs/heads/develop'
        run: terraform apply -auto-approve
        working-directory: infrastructure/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCES_KEY: ${{ secrets.AWS_SECRET_ACCES_KEY }}

  docker:
    runs-on: ubuntu-latest
    needs: terraform
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Login to AWS ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1
        with:
          region: us-east-1

      - name: Build and tag Docker image
        run: |
          IMAGE_TAG=${{ github.sha }}
          docker build -t ${{ secrets.ECR_REGISTRY }}/my-app:latest .
          echo "IMAGE_TAG=$IMAGE_TAG" >> $GITHUB_ENV

      - name: Push Docker image to AWS ECR
        run: |
          IMAGE_TAG=${{ env.IMAGE_TAG }}
          docker push ${{ secrets.ECR_REGISTRY }}/my-app:$IMAGE_TAG

  deploy:
    runs-on: ubuntu-latest
    needs: docker
    environment: production
    steps:
      - name: Deploy to Development
        if: github.ref == 'refs/heads/develop'
        run: |
          echo "Deploying to development environment"
          # Your deployment script here

      - name: Deploy to Staging
        if: github.ref == 'refs/heads/staging'
        run: |
          echo "Deploying to staging environment"
          # Your deployment script here

      - name: Manual Approval for Production
        if: github.ref == 'refs/head/main'
        uses: hmarr/auto-approve-action@v2
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

      - name: Deploy to Production
        if: github.ref == 'refs/heads/main'
          run: |
          echo "Deploying to production environment"
          # Your deployment script here

コードレビューに入る前に、このワークフローが何を行っているかを高レベルで整理します。

main、develop、stagingブランチへのpushおよびプルリクエストでCI/CDパイプラインをトリガーし、継続的インテグレーションを実現します。
YAML構成の構文チェックや、アプリケーションに必要な依存関係のインストールを含むLintワークフローを実行し、コード品質を担保します。
アプリケーションに必要なクラウドインフラのプロビジョニングと管理のためにTerraformをセットアップします。
アプリケーションの機能を検証するテストを実行し、脆弱性チェックを行ってコードの安全性と安定性を確保します。
デプロイに備えてアプリケーションのDockerイメージをビルド・タグ付けします。
DockerイメージをAWS Elastic Container Registry（ECR）にプッシュし、デプロイのためのアクセスを容易にします。
ブランチに応じてアプリケーションを開発・ステージング・本番の各環境にデプロイし、本番デプロイにはコントロールと監視のための手動承認ステップを含めます。

workflow.yamlの構成と各コンポーネントを確認したので、まずSummaryから各セクションを見ていきます。

Summary

Summaryはレビューの第一歩として、最新コミットで導入された変更点の明確で簡潔な概要を提供します。新機能、スタイル調整、構成変更、プルリクエストで挙げられたその他の関連修正など、重点ポイントを素早く把握できます。

このスニペットは、パフォーマンス向上のための非デバッグモードでの実行、コードのLint・ビルド・デプロイを合理化する自動CI/CDパイプラインの実装など、重要な保守作業を強調しています。

Summaryは、最新コミットで加えられた主要な変更点と改善点の理解に役立ちます。

Summaryで要点を把握したら、次はWalkthroughセクションで具体的な変更点の詳細を見ていきます。

Walkthrough

このセクションでは、最新コミットで各ファイルに加えられた具体的な変更を包括的に概観します。各ファイルの変更がプロジェクト全体の機能性やユーザー体験の向上にどう寄与するかを明確にします。

Changes Tableは、最新コミットにおける各ファイルの変更点を簡潔にまとめ、コードベースのどこが変更されたかを素早く特定できるようにします。

各行には変更されたファイルと、Change Summary列に詳細な変更説明が含まれます。CSSファイルのスタイル更新、アプリケーションロジックの機能調整、CI/CDパイプライン構成の改善などが含まれます。

情報を構造化して提示することで、変更の影響を理解しやすくし、開発者がプロジェクトへの影響を素早く把握できるようにします。

全体として、コラボレーションにおける重要なリファレンスとして機能し、さらなる議論やレビューが必要な箇所にチームメンバーが集中できるよう支援しつつ、コードベースの変遷を追跡します。ちょっとした遊び心として、エラーに関するポエムも生成します。

Code Review

以下のセクションでは、構成ファイルを詳細に検査し、改善余地のある領域を特定します。キャッシュ戦略の改善からデプロイプロセスの最適化まで、GitHub Actionsワークフロー全体の効率と堅牢性を高めるための提案が、コードに対して具体的に提示されます。

レビュー詳細、使用された構成、レビューのプロファイル、処理対象ファイル、使用された追加コンテキストなどに関する実行可能なコメントの詳細と概要が提供されます。ワンクリックでコミットできる提案が含まれる場合もあります。

ここから、CodeRabbitがワークフロー各部に対して提案したレビューコメントを見ていきます。CodeRabbitは構成ファイルがGitHub Actionsのワークフローであることを自動認識し、actionlintで徹底的に解析します。レビューの過程で、パフォーマンス最適化に関する有益な洞察と提案が示されます。

lintジョブでは、actions/cache@v3を用いたnpm依存関係のキャッシュ機会を検出しました。Lint実行前にキャッシュステップを追加する提案により、以降の実行時間を短縮できます。このプロアクティブなフィードバックは、手動介入なしにワークフローを効率化し、より最適化されたCI/CDパイプラインを実現します。

指摘の通り、キャッシュステップの構造に誤りがあります。runコマンド（npm install）がcacheアクションのusesブロック内に置かれており、正しく実行されない可能性があります。

これを解決するため、キャッシュとインストールのステップを分離することを提案しています。修正案ではキャッシュ処理を独立したブロックに移し、次のステップでnpm ciを使用して依存関係をクリーンかつ高速にインストールするようにしています。

Terraformセクションでは、Terraformバージョンに変数を使う点での潜在的問題を検出しました。加えて、特にplanやapplyでAWS認証情報に関する問題が生じうること、AWS_SECRET_ACCESS_KEYのタイポなど、わずかなミスでもパイプラインの実行失敗につながる可能性を指摘しています。

これらに対し、タイポ修正、Terraformバージョンの更新容易化、すべてのTerraformコマンドでAWSクレデンシャルが利用可能になるような構成変更が提案されました。

dockerジョブでは、Dockerイメージにlatestタグを使用している点でのセキュリティリスクを検出しました。latestのみの運用はバージョニングやロールバックに問題を生じうるため、latestと特定バージョンタグ（例: git SHA）の併用を提案し、追跡性とロールバック容易性を高めます。

deployジョブでも複数の潜在的問題を検出しました。手動承認ステップに自動承認アクションを用いており、本来の目的に反しています。さらに、本番デプロイのステップに構文エラーがあり、実際のデプロイスクリプトが欠落しているため、プロセスが不完全です。これらの問題に対する修正案が提示されています。

CodeRabbitのAI駆動分析により、構成ファイルの問題が迅速に特定・強調され、修正提案が示されることが分かりました。

CI/CDパイプラインでCodeRabbitを使う利点

コードレビューを自動化し精密なフィードバックを提供することで、CodeRabbitはコード品質を高め、CI/CDパイプラインに潜む問題や脆弱性を早期に捕捉します。その結果、スムーズなデプロイとエラー削減につながります。

CI/CDでCodeRabbitを使うことで得られる主な利点を見ていきましょう。

開発者生産性の向上

自動化された静的チェックワークフローにより、CodeRabbitは手動レビューの必要性を減らし、DevOpsエンジニアが設定修正ではなく、インフラやデプロイプロセスの最適化といった戦略的タスクに集中できるようにします。即時のフィードバックループにより、コミットごとに問題を素早く検出・対処でき、迅速な開発ペースを維持できます。

コード品質の改善

CodeRabbitはベストプラクティスに照らして構成ファイルを自動検証し、設定の一貫性を強制して早期にエラーを捕捉します。プラットフォームは過去のレビューから学習し、反復的なアラートを賢く抑制、最も重要な問題に集中できるようにします。さらに、ワンクリックの提案を提供し、構成ファイルに素早く取り込めます。

セキュリティ

CodeRabbitは、誤設定されたアクセス制御や不適切な設定などのセキュリティ脆弱性を早期に検出し、侵害の可能性を低減します。静的チェックをCI/CDプロセスに統合することで、構成ミスによるデプロイ失敗を防ぎ、より安定で信頼性の高いソフトウェアデリバリーパイプラインを実現します。

まとめ

本記事では、誤設定が遅延やセキュリティ脆弱性、さらにはデプロイ失敗につながること、そしてアプリケーションコードと同等に厳密にテストする重要性を見てきました。

従来の手法が構成ファイルのテストの重要性を見落としがちなのに対し、CodeRabbitはCI/CDパイプラインのレビューをコード/構成レビューの自動化、重大なエラーの検出、全体的な品質向上によって支援します。手動レビュー時間を大幅に削減し、DevOpsチームが戦略的タスクに集中してデプロイサイクルを加速できるようにします。

AIコードレビューの効果をワークフローで体験してみてください——今すぐCodeRabbitの無料トライアルを始めましょう。

]]>

Wed, 01 Oct 2025 13:09:23 GMT

GPT-5 Codex: How it solves for GPT-5's drawbacksの意訳です。

CodeRabbitのコードレビューは、開発者がバグを修正しコードをデリバリーするのを支援します。私たちは最近、GPT-5のベンチマークについて記事を書き、AIコードレビューという私たちのユースケースにおいて、このモデルが推論面で世代的な飛躍を遂げているという見解を述べました。より広いユーザーベースに展開する中で、S/N値（シグナル/ノイズ値。以下SNR）が低下し、レビューが過度に細かすぎるという印象を持たれることが分かりました。

GPT-5 Codexのリリースと、私たちが実施した製品変更（重大度タグ付け、より厳格なリファクタ提案のゲーティング、フィルタリング改善）により、難しいバグを見つける能力を犠牲にすることなく、SNRを取り戻すことができました。

刷新した「Hard 25」PRセットにおいて、GPT-5 CodexはGPT-5と比べてコメントあたりの精度が約35%向上し、エラーパターンレベルの不具合カバレッジは本質的に同等のまま、コメント量を約3分の1削減しました。さらにGPT-5 Codexモデルの低レイテンシと組み合わせることで、体感はより軽快、かつフォーカスされたものになります。

何を（なぜ）測定したか

GPT-5 Codexのテストでは、OSSのPRからなる新しい「Hard 25」スイート（以前の記事よりやや難度高め）を実行しました。これは私たちのデータセットに含まれる中でも特に難しい25本のプルリクエストです。現実世界のバグを表したもので、対象は以下の通りです。

並行性の問題（例: TOCTOUレース、誤った同期化）
オブジェクト指向設計の欠陥（例: 仮想呼び出しの落とし穴、参照カウントメモリモデルの破綻）
パフォーマンス上の危険（例: 無制御なキャッシュ成長、タイトループによるスタール）
言語特有の落とし穴（例: TypeScriptの誤用、C++のメモリ順序の微妙さ）

評価したモデルは以下の通りです。

GPT-5 Codex
GPT-5
Claude（Sonnet 4 および Opus-4.1）

何を評価したか

各モデルには、以下の観点でスコアを与えました:

EP（Error Pattern / エラーパターン）
PRに潜む特定の根本欠陥（例: 条件変数でのlost wakeup、ロック順序の不整合、ブール条件が錯綜する中に隠れたロジックバグ）。
EP PASS/FAIL（PR単位）
そのPRのEPを直接修正、または信頼できる形で表面化させるコメントを少なくとも1つ残せばPASS。コメントがゼロならそのPRはFAIL。
コメントPASS/FAIL（コメント単位）
EPを直接修正、または信頼できる形で表面化させればPASS、そうでなければFAIL。
コメントあたり精度（Per comment precision）
PASSコメント ÷ 全コメント。今回のデータセットにおける実務上のSNR。
Important share（重要コメント比率）
すべてのPASSはImportant扱い。EPを解決しないが、重大なバグ（use-after-free、二重解放、lost wakeup、メモリリーク、null参照、パストラバーサル、破滅的な正規表現など）を正しく指摘するコメントもImportant。それ以外はMinor。

スコアボード - CodexはSNRを改善

要点: Codexは、GPT-5とほぼ同じEPを見つけつつ、より少ない・締まったコメントで行うため、SNRが向上します。

意味するところ: Codexは25本中20本のPRをカバー（残り5本は未カバーのFAIL）。総コメント数は少ないにもかかわらず、EPのPASS数はやや上回り（16 対 15）、重要（Important）コメントは大幅に増加。コメントの半分以上が、そのPRで想定していた問題へのダイレクト、または別の重大バグの指摘でした。GPT-5とClaudeは精度・重要比率ともに約40%で、後塵を拝しました。

結論: 同等のEPカバレッジで、ノイズは減少
CodexはGPT-5のバグ発見力を維持したまま、コメント量を約32%削減（54 対 79）し、コメントあたり精度を約35%向上（46.3% 対 34.2%）。ClaudeはカバレッジはGPT-5に近いものの、より冗長で精度は低めでした。

スタイルと構造（Codexがパッチのように読める理由）

Codexの返信は一貫してアクション優先（ほぼ常にdiff付き）で、曖昧表現が少ない。これは「すぐパッチに反映できる提案」を望むレビュアーの期待に合致します。

Codexが得意とするバグの種類

スイート全体では、どのモデルも並行性・同期の問題に強みを見せましたが、Codexは特に以下で際立ちました。

条件変数の誤用とlost wakeup
ロック下でのwait、ループ内での述語チェックといった標準パターンを提案し、具体的なdiffを提示。
ロック順序とデッドロック
取得順の不整合を指摘し、ロック階層の導入やクリティカルセクション外への処理移動を提案（いずれも実行可能な編集付き）。
APIやパフォーマンスの微妙な罠
破滅的な正規表現のバックトラッキングやメモリモデルの順序問題などを的確に特定し、パッチを提示。

なぜGPT-5は騒がしく感じられたのか、そしてどう解決したか

観測: SonnetやOpusからGPT-5に移行した際、レビューあたりの総コメント数はほぼ倍増しました。一方でハルシネーションは1%未満、ネガティブトーンも1%未満まで低下したにもかかわらず、受け入れ率（有益と判断されたコメントの比率）は、GPT-5導入前のベースラインに比べて大きく低下しました。

Codexでの変化: GPT-5 Codexと私たちの製品変更の併用により、受け入れ率は以前の水準まで回復。一方で総コメント量は「GPT-5導入前」より依然多いままです。要するに、「有益さ」は取り戻しつつ、GPT-5並みに実問題を見つけ続けられるようになりました。

この改善には2つの製品変更が寄与しました。

重大度とレビュータイプのタグを前面に
- レビュータイプ: ユーザーが読みたいコメントの種類を自己選択できるよう、⚠️ Potential issue、🛠️ Refactor suggestion、🧹 Nitpick（Assertiveモードにしない限り非表示）を用意。
- 重大度: コメントに重大度タグを付け、優先度を明確化。タグは🔴 Critical、🟠 Major、🟡 Minor、🔵 Trivial、⚪ Info。
- バグ（Critical/Major/Minor）は常に表示。その他は常にではありません。リファクタはモデルが「本質的」と判定した場合のみ表示。すべて見たいユーザーはAssertiveに切替可能。
より厳格なフィルタリングと集約
- 重複メモを折りたたみ、「あると嬉しい」レベルの提案は明確なROIがない限り除外。結果として、コメントは少数精鋭化し、ノイズで見落とすリスクが減少。

レイテンシ: 速さは正義 & Codexは速い

5分のレビューは許容範囲ですが、30分は許容できません。GPT-5の「常に深く考える」スタイルは、ファーストトークンまでの時間と全体のレビュー時間を大幅に増やしました。私たちは最近いくつかのパイプライン最適化を行い、さらにCodexがGPT-5由来のレイテンシを低減できるようになりました。

Codexの可変（弾力的）な思考は、不要な場面では深掘りを減らし、実運用でTTFT（最初の出力までの時間）とE2Eレビュー時間を短縮しています。総じて、レビューは速くなり、フィードバックは早く、ヒューマン・イン・ザ・ループの流れが改善されます。

CodeRabbitユーザーが期待できること

Codex導入後、AIコードレビューはどう変わるでしょうか？

生のバグ検出力は同等
- 刷新したHard 25で、CodexのEPレベルPASSは64%、GPT-5は60%（以前のPRセットではGPT-5が77.3%）。GPT-5がもたらした重要な勝ち筋を失っていません。
コメントは少なく、しかし強く
- 総コメント数はGPT-5比で約32%減、SNR（コメントあたり精度）は約35%向上。文章よりパッチが増えます。
重大度タグでレビューに集中
- 新しい重大度タグにより、Critical/Majorがトップに浮上。リファクタはゲート制御、ニットピックはオプトイン。コメントの走査に費やす時間が減り、修正に時間を割けます。
フィードバックループの高速化
- Codexの軽量な推論とパイプライン改善で、最初の有益なコメントまでの時間が短縮。体感で分かります。

定量的付録（データ好きのあなたへ）

以下は興味深かった追加統計を紹介します。

コメントあたり精度（SNR）の向上: Codex 46.3% 対 GPT-5 34.2% — 相対で約+35%。
コメント量の差: Codex 54 対 GPT-5 79 — 約32%減、EPのPASSは実質同等（16 対 15）。
スタイル: Codexは94%のコメントでdiffを含み、このセットではClaudeやGPT-5より曖昧表現が少ない。
実環境での受け入れ: GPT-5ロールアウト中は受け入れ率が大きく低下。Codexと製品変更の併用で約20–25%相対上昇し、導入前水準に回復。かつ、GPT-5導入前より受け入れコメント数は多いまま。

Codexがまだ弱い点（と取り組み）

改善は大きいものの、課題が残っていないわけではありません。現在、以下に取り組んでいます。

カバレッジの穴
モデルがPRにコメントを残さない場合、そのEPはハードFAIL。Codexの探索ヒューリスティクスを広げ、特定クラスの問題を見落としにくくします。
リファクタ過剰提案（調整済みだが未完）
「本質的なもののみ」のゲートでノイズは抑制しましたが、特に大規模diffでコメント過多になりがちなケースの閾値をさらに引き締めます。
ユーザー主導の優先度付け
GitHubのインライン順序は変更できませんが、各コメントに重大度を注記し、上から順にトリアージしやすくします。

Codex GPT-5: バグ捕捉力はそのまま、副作用は少なく

私たちの指標はシンプルです: 重要なバグを、素早く、ノイズに埋もれさせずに捕まえること。Codexはその実現を助けてくれます。GPT-5の噛み応えある推論力を保ちながら、SNRを回復させ、レイテンシを大幅に削りました。今後も測定・改善を継続し、より良い製品をリリースし続けます。

]]>

Wed, 01 Oct 2025 12:45:00 GMT

CodeRabbit MCP server integration: Code reviews with more contextの意訳です。

すべての開発チームは、孤立した状態で行うコードレビューのつらさを知っています。AIツール（あるいはチームメイト）であっても、文法やスタイル、パターンにコメントはできます。しかしビジネス要件、デプロイ依存関係、組織的な知識がなければ、全体像の半分を推測に基づいている状態です。

CodeRabbitは現在、Linear、Jira、Circle CIといったいくつかのネイティブ統合を提供しており、これらのツールがコードレビューにもたらす価値を確認してきました。だからこそ今回、CodeRabbitのMCPサーバー統合のGAリリース を発表できることがとても嬉しいです。これにより、さらに多くのコンテキストをレビューに取り込めるようになります。

本リリースで、CodeRabbitはConfluenceにあるビジネス要件からCI/CDパイプラインのシステム依存関係、さらには社内MCPサーバーのデータまで、開発エコシステム全体からコンテキストをオーケストレーションできる初のAIコードレビュープラットフォームとなりました。つまり、コードが「何を達成しようとしているのか」を本当に理解するレビューが可能になるのです。

14日間の無料トライアルを開始 → 約10分で、チーム標準に基づいたコンテキスト対応のレビューを実現。

なぜAIコードレビューにMCPが必要なのか？

開発チームは数多くのツールを使って作業しています。

要件はLinearにある
設計仕様はFigmaにある
アーキテクチャの決定はConfluenceに記録される
セキュリティ基準は監査ごとに社内Wikiで更新される

AIコードレビューツールは基本的なコンテキスト、つまりコードベース、コーディング規約、いくつかの統合から始めます。構文を解析し、パターンを確認し、改善を提案します。しかし「そのコードがチームにとって本当に機能するかどうか」を左右するコンテキストは欠けています。

MCPクライアントとしてのCodeRabbitは、組織コンテキストのコンパイラの役割を果たします。Wiki、チケット、デプロイパターンといった高レベルの入力を正確で実用的なコードレビューインサイトへと変換します。冗長な統合や脆いハックに頼ることなく、MCPはCodeRabbitのようなクライアントがLinearチケット、Confluenceドキュメント、Datadogメトリクス、Slackのディスカッションといった場所から必要なデータだけを取り込めるようにします。

実際にはどう動くのか…

CodeRabbitはレビューを開始する前に接続済みMCPサーバーを検索します。たとえばデータベーススキーマの変更はデータアーキテクチャ文書と照合され、APIエンドポイントの実装は社内Wikiに記録されたサービス設計パターンと突き合わせられます。

例: CodeRabbitによるコード整合性の確認

どのツールからでも重要なコンテキストを取り込む

従来のコードレビューツールは特定の統合を前提としています。CodeRabbitのMCP統合は、MCPサーバーを持つあらゆるシステムで動作します。独自の社内ツール、ニッチなSaaSプラットフォーム、カスタムドキュメントシステム。MCPサーバーがあれば、CodeRabbitはどこにでも接続できます。

CodeRabbitをMCPクライアントとして利用すると、3種類の異なるコンテキストからレビューの深みを得られます。

技術的コンテキスト

依存関係、パフォーマンスデータ、静的解析、テストカバレッジなど
ネイティブ統合: GitHub Actions、GitLab CI、Bitbucket Pipelines
MCPサーバー: Datadog、New Relic、SonarQube、Snyk、Grafana

レビューコメント例は以下の通りです。

ビジネスコンテキスト

要件、ユーザーストーリー、受け入れ基準など
ネイティブ統合: Linear、Jira、GitHub Issues、GitLab Issues
MCPサーバー: Confluence、Notion

レビューコメント例は以下の通りです。

組織的コンテキスト

過去の意思決定、慣例、会議メモ、組織的知識など
ネイティブ統合: PR履歴、チーム慣習
MCPサーバー: Slack、Microsoft Teams、Stack Overflow for Teams、PagerDuty

レビューコメント例は以下の通りです。

MCP統合を始めるには

CodeRabbitのMCPクライアントは最小限の設定で導入できます。ほとんどの開発チームは10分以内に最初のMCPサーバーを接続できます。

MCPサーバー対応の人気開発ツール:

Linear（ネイティブMCPサポート、5分）
Notion（MCPサーバーあり、10分）
Confluence（コミュニティ製MCPサーバー、15分）
Figma（MCPプラグインあり、10分）

コード変更がどの開発システムを参照すべきかを定義します。データベース変更はアーキテクチャ文書を、認証変更はセキュリティ文書を確認する、という具合です。

MCPサーバーを追加するのは簡単です。

CodeRabbitダッシュボードで「integrations」に進み、必要ならMCP Serversタブに切り替えます
あらかじめ用意されたMCPサーバーオプションをクリックするか、「New MCP Server」ボタンから他のMCPサーバーを追加できます
リストにないMCPサーバーについては、必要な認証情報を入力します
MCP情報をどのように利用するかの使用ガイダンスを確認します
接続が完了すると、利用可能な呼び出し一覧が表示され、カーソルを合わせると詳細を確認できます
各呼び出しをクリックしてアクセスを有効化/無効化することも可能です

あらゆるコンテキストを取り込むレビュー基盤

CodeRabbitは50以上の統合に標準対応しています。MCPを利用すれば、カスタムサーバーや社内ツールにも拡張できます。まずはLinear、Confluence、Datadog、Slackといった既存システムから始め、必要に応じて追加していけます。

次のステップ:

]]>

Tue, 30 Sep 2025 13:27:04 GMT

CodeRabbit’s code reviews help developers fix bugs and ship code. We recently wrote about benchmarking GPT-5 and opined that the model was a generational leap in reasoning for our use case of AI code reviews. As we rolled out to our wider user base, we observed that the signal to noise ratio (SNR) dipped, and users felt the reviews were too pedantic.

The release of GPT‑5 Codex, plus the product changes we made (severity tagging, stricter refactor gating, better filtering), brings our signal to noise ratio back without sacrificing the ability to find the hard bugs.

On our refreshed hard 25 PR set, GPT-5 Codex delivers about 35% higher per comment precision than GPT‑5, maintains essentially the same error pattern- level bug coverage, and cuts roughly a third of the comment volume. Combine that with the lower latency of the GPT-5 Codex model and the experience feels snappier and more focused.

What we measured (and why)

When testing GPT-5 Codex, we ran a fresh “hard 25” suite of OSS PRs (slightly tougher than the previous post). These are 25 of the most difficult pull requests from our dataset. These PRs represent real-world bugs that span:

Concurrency issues (e.g. TOCTOU races, incorrect synchronization)
Object-oriented design flaws (e.g. virtual call pitfalls, refcount memory model violations)
Performance hazards (e.g. runaway cache growth, tight loop stalls)
Language-specific footguns (e.g. TypeScript misuses, C++ memory order subtleties)

We evaluated the following models: :

GPT‑5 Codex
GPT‑5
Claude (Sonnet 4 and Opus‑4.1)

What we looked for

We gave each of the models a score based on how they performed on these factors:

EP (Error Pattern). The specific underlying defect seeded in a PR (e.g., lost wakeup on a condition variable, inconsistent lock order, logic bug hidden in boolean soup).
EP PASS/FAIL (per PR). PASS if the model left at least one comment that directly fixes or credibly surfaces that PR’s EP. If it left no comment on that PR, it is counted as FAIL for that PR.
Comment PASS/FAIL (per comment). PASS if the comment directly fixes or credibly surfaces the EP, otherwise FAIL.
Per comment precision. PASS comments ÷ all comments. This is our operational SNR for this dataset.
Important share. Every PASS is Important. Comments that do not solve the EP but still flag a genuine critical or major bug (like a use after free, double free, lost wakeup, memory leak, null deref, path traversal, catastrophic regex) are also Important. Everything else is Minor.

Scoreboard - Codex improves signal-to-noise

Takeaway: Codex finds essentially the same EPs as GPT‑5 but does it with fewer, tighter comments, so the signal-to-noise ratio is improved.

What this means: Codex covered 20 of the 25 PRs (the other 5 count as uncovered fails). Despite fewer comments overall, Codex passed slightly more EPs (16 vs. 15) and landed far more Important comments. Over half its comments are either direct hits on the issue we were representing in that PR or flag other EP critical bugs. GPT‑5 and Claude trailed in precision and Importance share at about 40%.

The verdict: Same EP coverage, less noise: Codex retains GPT-5’s bug finding power but trims the chatter with about 32% fewer comments than GPT‑5 (54 vs. 79) and about 35% higher per comment precision (46.3% vs. 34.2%). Claude looks similar to GPT‑5 on coverage but is chattier, with lower precision.

Style and structure (why Codex reads like a patch)

Codex replies are consistently action forward (diffs almost always included) and low hedge. That lines up with what reviewers want: suggestions that translate directly into a patch.

The kinds of bugs Codex is good at

Across the suite, all models did well on concurrency and synchronization, but Codex stood out for:

Condition variable misuse and lost wakeups. Codex proposes the canonical patterns (wait under lock, check predicate in a loop) and supplies concrete diffs.
Lock ordering and deadlocks. It calls out inconsistent acquisition order and suggests a lock hierarchy or moving work outside critical sections, again with actionable edits.
Subtle API and performance traps. Examples include catastrophic regex backtracking and memory model orderings. Codex pinpoints and patches them cleanly.

Why GPT‑5 felt noisier, and how we fixed that

What we saw: When we moved from Sonnet and Opus to GPT‑5 our total comments per review nearly doubled. Even though hallucinations fell to under 1% and negative tone fell to under 1%, the acceptance rate (share of comments judged helpful) declined significantly relative to its baseline prior to the adoption of GPT-5.

What changed with Codex: With GPT‑5 Codex plus some product changes we’ve implemented, our acceptance climbed back to prior levels while overall comment volume stayed higher than the pre-GPT‑5 era. Put simply: our tool is now back to its prior helpfulness level, while still finding as many real issues as GPT-5.

Two product levers helped with this:

We created severity and review type tags, front and center
- Review Types: We created review types to allow users to self-select what kinds of comments they wanted to read including: ⚠️ Potential issue, 🛠️ Refactor suggestion, 🧹 Nitpick (nitpicks are hidden unless you opt into Assertive mode)
- Severity: We now tag comments by severity to better signal which matter more than others. Our tags are: 🔴 Critical, 🟠 Major, 🟡 Minor, 🔵 Trivial, ⚪ Info
- We always show bugs (Critical, Major, Minor) but don’t always show other types of comments. Refactors show only if the model marks them as essential. Users who want everything can still switch to Assertive mode.
We implemented stricter filtering and aggregation
- We collapse duplicative notes and filter out “nice to have” suggestions unless they have clear ROI for the user. The result: fewer, denser comments, and fewer reasons to tune out.

Latency: Fast matters & Codex is faster

A five minute review is fine. Thirty minutes is not. GPT‑5’s “always think hard” style significantly increased time to first token and overall review time. But we shipped several pipeline optimizations recently and Codex helps further reduce the latency that GPT-5 introduced.

Codex’s variable or elastic thinking uses less depth when it is not needed, improving time to first output and end-to-end review time in practice. Net: faster reviews, earlier feedback, better flow for the human in the loop.

What a CodeRabbit user should expect

Now that Codex is implemented, how will that change your AI code reviews?

The same raw bug finding power
- On the refreshed hard 25, Codex passed 64% at the EP level vs. 60% for GPT‑5 (our previous set of PRs had GPT-5 passing 77.3%). No loss of the important wins GPT-5 helped with.
Fewer but stronger comments
- About 32% fewer total comments than GPT‑5, with about 35% higher SNR (per comment precision). More patches, less prose.
Severity tags to focus your review
- Critical and Major issues float to the top with our new severity tags. Refactors are gated. Nitpicks are opt-in. You will spend less time scanning comments and more time fixing.
A faster feedback loop
- Codex’s leaner reasoning plus pipeline improvements bring time to first helpful comment down. You will feel it.

Quantitative appendix (for the curious)

We know you love data! Here’s some other stats we found interesting:

Per comment precision (SNR) uplift: Codex 46.3% vs. GPT‑5 34.2% — about +35% relative.
Comment volume delta: Codex 54 vs. GPT‑5 79 — 32% fewer comments, with EP passes essentially unchanged (16 vs. 15).
Style: Codex includes diffs in 94% of comments and uses hedging far less than Claude and GPT‑5 on this set.
Acceptance (real world): During GPT‑5 rollout, acceptance dropped significantly. With Codex plus changes, it rose by about 20–25% relative and returned to prior levels while still delivering more accepted comments than pre GPT‑5.

Where Codex still needs work (and what we are doing)

These improvements are great but that doesn’t mean that there aren’t still issues with Codex. Here are some that we are actively working on:

Coverage gaps. When a model leaves no comment on a PR, that is a hard fail for that EP. We are widening Codex’s search heuristics so it is less likely to miss entire classes of issues.
Refactor over eagerness (tuned, not solved). The “essential only” gate curbs refactor noise, but we will keep tightening the threshold, especially on large diffs where a high number of comments would be overwhelming.
User driven prioritization. We cannot change GitHub’s in-line ordering, but we annotate every comment with severity so you can triage from the top down without hunting.

Codex GPT-5: All of the great bug catching ability, fewer downsides

Our north star is simple: catch the bugs that matter, quickly, without making you sift through noise. Codex helps us do that. It keeps the bite of GPT‑5’s reasoning while restoring SNR and shaving latency down significantly. We will keep measuring, improving, and shipping a better product every release.

]]>

Tue, 23 Sep 2025 15:41:30 GMT

Last week, we announced CodeRabbit’s $60 million Series B. To celebrate, we did what any responsible, developer-focused software company would do: we made a funny video.

Not with all the money, to be clear. But we did decide to celebrate with something fun, absurd, and painfully relatable for any dev team trying to keep up with the flood of AI-generated PRs.

Introducing… “When AI coding agents backfire: A short film”

https://youtu.be/glfB3KLQR7E?feature=shared

It’s a short mockumentary-meets-sitcom about what really happens when “AI velocity” turns into a PR review backlog.

One reviewer.
Dozens of notifications.
84 open PRs.
And one overly eager coworker named Brad who just wants feedback.

The cast

To bring it to life, we pulled in beloved developer educator (and influencer) Aaron Francis to star as our beleaguered reviewer. He’s the guy who just wanted to ship features faster and now can’t go to the kitchen (or even leave his house at 8 a.m.) without Brad asking about his PR.

And speaking of Brad: the inimitable Austin von Johnson plays him to perfection. Brad’s a developer who can crank out AI-generated PRs at lightning speed but cannot, under any circumstances, wait patiently for a review. His lurking, his post-its, his hoodie PR ambushes… let’s just say he was perfectly committed to the bit.

The very real problem behind the joke

The short film is funny, but the problem it highlights is real:

AI coding tools crank out code faster than teams can review it
Review backlogs balloon while productivity drops
Senior engineers get buried in endless PRs
Review quality is uneven, risk goes up, and you have to deal with more issues
And suddenly, the promise of velocity feels more like a nightmare.

Here’s what we’re doing about it

CodeRabbit exists to clear the backlog, not add to it. Our AI code reviews pull in dozens of points of context (requirements, tests, CI, past diffs, ownership) to catch bugs you’d miss, reduce reviewer fatigue, and move PRs through faster—without turning teammates into… Brad.

Ship faster, review smarter, and keep your sanity. Also, avoid creating a… Brad.

👉 Watch “When AI coding agents backfire: a short film” right here. And if you’ve ever been chased around the office about a PR, please, send it to your team’s Brad.

]]>

Fri, 19 Sep 2025 06:40:35 GMT

Ballooning context in the MCP era: Context engineering on steroids意訳です。

かつて、LLMにコンテキストを渡すには、ハックやベクターストラテジー、そしてお祈り…と、過剰に複雑なRAGパイプラインをつなぎ合わせる必要がありました。そこに登場したのが Model Context Protocol (MCP)。外部データを本番環境のモデルに提供するための、クリーンでモジュール的手法です。MCPは、実際に「何かをする」エージェントシステムを構築する人々にとって、瞬く間に標準的なプロトコルとなりました。

今ではほとんどのテック企業がMCP機能を打ち出していますが、その理由は明白です。MCPはコンテキストのロジックとアプリケーションロジックを分離し、信頼性を向上させ、複雑なワークフローでのプロンプト構築の混乱を抑える役割を果たします。

私たちはしばらく前からコンテキストエンジニアリングの領域に深く取り組んでおり、今回独自のMCPクライアントを立ち上げるにあたり、コードレビューにより豊かなコンテキストを注入できることに大いに期待しています。しかし正直に言えば、豊富なコンテキストにはリスクも伴います。MCP時代の隠された真実はこうです：かつて欲していたコンテキストに、今や私たちは溺れそうである。ログやトレース、diffなど "関連"ファイルが増え、モデルが本当に必要としているものが見えにくくなっています。

役立つ入力はすぐにトークンの膨張、ノイズ、パフォーマンス劣化につながります。引用付きハルシネーション、レイテンシの急上昇、あるいはカフェインを摂りすぎたインターンが書いたような散漫なレビュー。良いコンテキストエンジニアリングとは「すべて詰め込む」ことではなく、「何を省くか」を知ることでもあります。そしてMCP以降、そのバランスを取るのはより難しく、より重要になっています。

この記事では、膨張するコンテキスト問題 の詳細、その副作用、そして私たちがそれにどう立ち向かっているかを解説します。MCPを用いたLLM機能を開発している方で、プロンプト形のブラックホールを作り出したくない方に役立つ内容です。

MCPクライアント & サーバーにおける「膨張するコンテキスト問題」

MCPサーバーとクライアントは、モデルに膨大な情報を渡すことを容易にします：ログ、トレース、diff、設定、チケット、さらには誰も所有を覚えていないリポジトリの隅まで。すべてがモデルの手の届くところにあります。しかし、ここで重要な問いがあります：「コンテキストが多ければ多いほど良いのか？」答えは間違いなく「NO」です。

過剰なコンテキストは、試験勉強で図書館全体を読むようなもの。ノイズは増えても、知識にはなりません。コンテキストが制御されなければ、次の3つの問題がすぐに現れます：

トークンの膨張
LLMには無限のキャパシティはありません。入力ウィンドウにはコストと限界があり、念のため…と詳細情報を詰め込みすぎれば、コストは増大してスループットは低下し、不要なテキストに予算を浪費します。
関連性の低下
情報が多いほど出力が良くなるわけではありません。むしろ悪化することが多いのです。無関係または冗長なスニペットがシグナルを希釈し、モデルはインサイトではなく枝葉に追われます。
レイテンシ
追加されるログやdiff、スタックトレースはすべて取得・処理され、プロンプトに押し込まれます。コンテキスト構築がボトルネックとなり、レビュー速度を著しく低下させます。

要するに、膨張するコンテキストはMCPの優雅さを逆にリスクへと変えてしまいます。意図的なコンテキストエンジニアリングがなければ、出力を磨くどころか押しつぶしてしまうのです。

コンテキストが害になるとき

実際には、以下の3つの典型的な問題が発生します：

コンテキスト混乱
モデルが無関係な詳細をシグナルと誤解してしまうケース。例えば認証ロジックを更新するPRに、無関係なテストフィクスチャが含まれていると、モデルはフィクスチャをレビューし始め、実際の変更と関係のないコメントを生成します。
コンテキスト衝突
コンテキスト同士が矛盾する場合です。例えば最新のスキーママイグレーションと古いドックストリングが同時に含まれると、モデルはどちらを信じるべきか迷い、結果として全方位的で自信のないレビューを生成します。まるで決断できないレビュアーのように。
コンテキスト汚染
最も厄介なのは誤った情報が混入するケースです。無用な関連ファイルや、誤ってインデックス化されたスニペットが注入されると、存在しないコードを引用するようになります。レビューでは、存在しないファイルのバグに言及し、開発者を混乱させ、時間を無駄にし、信頼を損ないます。

これはコードレビューに限りません。サポートボットが無関係なチケットを引っ張ってくる場合や、リサーチアシスタントが周辺論文に気を取られる場合、セキュリティエージェントがノイジーなログを証拠として扱う場合なども同様です。いずれにしても、間違ったコンテキストは「ない方がまし」なのです。

MCPサーバーでのコンテキスト過負荷を防ぐ主要パターン

MCP時代の問題が、膨張するコンテキストだとすれば、解決策は情報の流入を止めることではありません。意図を持って選別・圧縮・提供することです。MCPのコンテキストは、生の素材をモデルに渡す前にきちんと設計されたデータ変換プロセスを経るべきものです。私たち自身のコードレビュー用MCPクライアントでも、コンテキストを高シグナル・低ノイズに保つために以下のパターンを採用しています。

コンテキストの重複排除と差分化
冗長な入力はトークン浪費の最短ルートです。同一のスタックトレース、繰り返しのログ、変更されていないdiff部分は10回も登場する必要はありません。クライアントは重複を検出して折りたたみ、新しい部分だけを強調します。この原則は他の領域にも適用可能です：重複するサポートチケットをまとめ、繰り返しのトレースを圧縮し、差分のみを残します。
コンテキスト要約パイプライン
MCP出力が依然として大きすぎる場合、LLM自体が要約して小さくすることも可能です。代償は圧縮と忠実度のトレードオフ：要約はニュアンスを失う可能性がありますが、詳細に溺れるよりはましです。実際には、重要ファイルは生のdiff、優先度の低いコンテキストは要約といったハイブリッド設計を採用します。
コンテキストの優先順位付けと切り捨て
プルーニングや要約後でも、どれを最初に入れ、後に回し、容量不足時に捨てるかを決める必要があります。MCPクエリごとにトークン予算を設定することは不可欠です。そうしなければプロンプトが予測不能に膨張します。私たちは切り捨てを前提にした設計を試し、場合によっては概要を先頭に、詳細を後半に配置するなど調整しています。
コンテキストの隔離
すべてのコンテキストを最初のプロンプトに含める必要はありません。サブタスクごとに専用のコンテキストスレッドを持たせるべきです。例えば、私たちのMCPクライアントではテスト失敗は専用のレビューサブスレッドに置かれ、メインのレビューコンテキストを妨げません。これにより混乱を減らし、長い対話でも明瞭さを保てます。
継続的な改善と学習
コンテキストエンジニアリングは静的ではありません。モデルのフィードバックや人間による修正を取り入れ、優先順位を調整していきます。重要なのは可観測性です。モジュールごとにプロンプト入力を記録し、何が通って何が無駄かを把握します。MCPダッシュボードやトークンヒートマップのようなツールが、予算超過や不要な入力を可視化します。

MCPサーバー & クライアントにおけるアンチパターン

MCP時代はコンテキスト取得を容易にしました。おそらく「容易すぎる」ほどに。以下のようなアンチパターンがよく見られます：

ベクトル無差別投入
ベクトルDBは「関連」情報を見つけるのに優れていますが、それを万能の答えと見なすのは危険です。曖昧に関連するスニペットをすべて投入すると、関係のないファイルへのコメントや古いコードへの指摘で溢れるレビューになります。コンテキストの不適合はトークンの無駄だけでなく、モデルのパフォーマンスを引き下げる要因になります。
「全部突っ込め」方式
すべてのログ、diff、ドックストリングをコンテキストに放り込み、あとは神に祈るやり方です。コストの増加、レイテンシの悪化、結果の予測不能を保証します。モデルは重要な部分と不要な部分を区別できないため、全方位的で散漫なレビューを生成します。矛盾が混入すれば、モデルは曖昧さを埋めるために幻覚を引き起こします。

要するに、コンテキストは多ければ良いというものではありません。フィルタリング、優先順位付け、設計がなければ、「情報全部」はすぐにノイズに変わり、システムを遅く、鈍く、高コストにしてしまいます。

私たちのMCPクライアントでのアプローチ

MCP時代において、コンテキストは「王様」です。しかし正直なところ、その王様は酔いすぎて上下も分からなくなっていることがあります。課題はもはや「コンテキストを得ること」ではなく、「それを制御すること」です。優れたコンテキストエンジニアリングには、緻密な変換パイプライン、徹底的な優先順位付け、そして改善を続ける謙虚さが必要です。これを怠れば、トークン膨張、レイテンシ、混乱したレビューを招きます。うまく実践すれば、ワークフローに沿った鋭い出力が得られます。

私たちは自社のコードレビュー用MCPクライアントでこれを実感しました。初期段階では全ログ・全ファイルをそのまま渡していました。その結果は高コストで、役に立たないほど散漫なレビューです。そこで重複排除、要約、タスク専用の隔離を導入したところ、レビュー品質が飛躍的に向上しました。すべてを指摘するのではなく、本当のクロスファイルリスクに集中するようになり、トークン消費とレイテンシも低下しました。

これこそが良いコンテキストエンジニアリングの成果です：情報量が多いのに散漫ではなく、本質を突いたレビュー。そしてそれこそが、私たちのMCPクライアントで実現しようとしていることです。

👉 コンテキスト設計の正しい姿を体験してみませんか？今すぐ 14日間の無料トライアル でAIコードレビューをお試しください。

]]>

Fri, 19 Sep 2025 06:21:31 GMT

CodeRabbit commits $1 million to open source softwareの意訳です。

オープンソースは現代のソフトウェア開発の基盤です。パッケージマネージャーや開発ツールから、フレームワーク、インフラに至るまで、今日私たちが使うほとんどすべてのソフトウェアはオープンソースプロジェクトによって支えられています。CodeRabbit自体もそうです。これらのプロジェクトは、数え切れないほどの時間を費やし、維持・進化させ続けている開発者コミュニティによって構築・保守されています。

本日、私たちは オープンソースソフトウェアへのスポンサーシップとして100万ドル（USD）の拠出 を発表します。これは 6,000万ドルのシリーズB資金調達 に続くものであり、オープンソースが可能にしてくれたことへの感謝、そしてその未来への投資の重要性に対する私たちの信念を表しています。

なぜ今、オープンソースへの支援がこれまで以上に必要なのか

生成AIはソフトウェア開発を変革していますが、同時にオープンソースのメンテナーに新たな負荷を与えています。高品質なコントリビューションの増加と並行して、AI生成によるPRスパム（繰り返し・低品質・時には不安定なコードの提出）が急増し、メンテナーを圧倒しています。

私たちはメンテナー自身から、この膨大なノイズがどれほど負担となっているかを直接聞いてきました。CodeRabbitでは、スパムをフィルタリングし、コード品質を向上させ、メンテナーの作業負荷を軽減する AI駆動のコードレビューと人間による監視を組み合わせたツールを開発しました。私たちはこのAIコードレビューツールをすべてのオープンソースプロジェクトに無料で提供しています（詳細はこちら）。

しかし、ツールだけでは十分ではありません。持続可能なオープンソースには、金銭的支援、認知、そしてコミュニティ間をつなぐ強固な架け橋が必要です。

20万ドルから100万ドル：より深いコミットメントへ

今年初め、私たちは オープンソースへの20万ドルの誓約 を発表し、以下のようなプロジェクトを支援しました：

pnpm: ディスクスペース効率に優れたパッケージマネージャー
Biome (biomejs): 次世代のJavaScript/TypeScript用リンター兼フォーマッター
AST Grep (Herrington Darkholme): 構造的コード検索によるスマートなコード解析
iTerm2 (George Nachman): 開発者のワークフローを刷新したターミナルエミュレーター
Markdown Lint (David Anson): ドキュメントを明確かつ一貫性のある状態に保つツール

この誓約は始まりにすぎません。今回のシリーズB資金調達によって、私たちは支援額を 100万ドルに拡大 し、エコシステム全体のプロジェクトやメンテナーが正当な評価とリソースを得られるようにします。

スポンサーシップ申請はこちらより行ってください

CodeRabbitとOSS：エコシステム全体をつなぐ架け橋へ

スポンサーシップは始まりに過ぎません。オープンソースが直面している多くの課題 ― 持続可能性、セキュリティ、開発者の燃え尽き（バーンアウト） ― は特定のプロジェクトに限られた問題ではありません。これらはコミュニティやエコシステム全体に広がっています。

だからこそ、CodeRabbitは メンテナー同士をつなぎ、協力を促進し、プロジェクト間で解決策を共有する 取り組みにも力を入れています。共同スポンサーシップ、共同イニシアチブ、コミュニティ主導のツールに関する議論を通じて、孤立した支援ではなく、エコシステム全体を強化することを目指しています。

もしあなたがメンテナーやコントリビューターで、こうした議論に参加したいと考えているなら、ぜひご連絡ください。CodeRabbitチームや他のオープンソースリーダーとつながるためにDiscordに参加してください。

オープンソースプロジェクト向けの無料CodeRabbit利用

最後に改めてお伝えします：CodeRabbitはオープンソースに対して、無料で提供されています。すべてのメンテナー、コントリビューター、コミュニティは、私たちのプラットフォームを利用してPRノイズを減らし、コード品質チェックを自動化し、より意味のあるコントリビュートに時間を割けるようになります。

詳細はこちらから確認し 資金提供の申請を行ってください

]]>

Thu, 18 Sep 2025 15:56:44 GMT

Open source is the foundation of modern software development. From package managers and developer tools to frameworks and infrastructure, open source projects power nearly every piece of software we use today – including CodeRabbit itself. These projects are built and maintained by communities of developers who dedicate countless hours to keeping them alive, secure, and evolving.

Today, we’re proud to announce a $1 million USD commitment to open source software sponsorships. This commitment comes on the heels of our $60 million Series B funding round and it reflects both our gratitude for what open source makes possible and our belief in the importance of investing in its future.

Why open source needs support now more than ever

Generative AI is transforming software development, but it’s also putting new pressures on open source maintainers. Alongside the surge of high-quality contributions, there has been a sharp rise in AI-generated PR spam: repetitive, low-quality, and sometimes insecure code submissions that overwhelm project maintainers.

We’ve heard firsthand from maintainers about how draining this flood of noise can be. At CodeRabbit, we’ve built tools that filter out spam, elevate code quality, and reduce maintainer workload by blending AI-driven code review with human oversight. We’ve made our AI code review tool free for use on all open source projects (more about that here).

But tools alone aren’t enough—sustainable open source requires financial support, recognition, and stronger bridges between communities.

From $200K to $1M: Deepening our commitment

Earlier this year, we announced a $200,000 pledge to open source, supporting projects like:

pnpm: A disk-space–efficient package manager
Biome (biomejs): A next-generation linter and formatter for JavaScript and TypeScript
AST Grep (Herrington Darkholme): Structural code search for smarter code analysis
iTerm2 (George Nachman): A terminal emulator that redefined developer workflow
Markdown Lint (David Anson): Ensuring docs stay clear and consistent

That pledge was only the beginning. With our new Series B funding, we’re scaling our support to $1 million, ensuring that more projects and maintainers across the ecosystem receive the recognition and resources they deserve.

Apply for sponsorship here.

CodeRabbit & OSS: Building bridges across the OSS ecosystem

Sponsorship is only part of the story. Many of the challenges open source faces—sustainability, security, and developer burnout—aren’t isolated to a single project. They stretch across communities and ecosystems.

That’s why CodeRabbit is also working to connect maintainers, foster collaboration, and share solutions across projects. Whether through joint sponsorships, shared initiatives, or community-driven tooling conversations, we aim to strengthen the ecosystem as a whole rather than supporting it in silos.

If you’re a maintainer or contributor who wants to join these conversations, we’d love to hear from you. Join our Discord to connect with the CodeRabbit team and other open source leaders.

Free CodeRabbit access for open source projects

Finally, a reminder: CodeRabbit is free for open source. Every maintainer, contributor, and community can use our platform to cut through PR noise, automate code quality checks, and free up more time for meaningful contributions.

Learn more and apply for funding here.

]]>

Wed, 17 Sep 2025 16:04:55 GMT

Every dev team knows the pain of code reviews if performed in isolation. An AI tool (or even a teammate) can comment on syntax, style, and patterns, but without business requirements, deployment dependencies, or organizational knowledge, it’s just guessing at half the story.

CodeRabbit currently has a number of native integrations including Linear, Jira, and Circle CI. We have seen the value that context from those tools provide to code reviews. That’s why we’re excited to announce the GA of CodeRabbit’s integration with MCP servers. This will allow you to bring in even more context into your reviews.

With this launch, we become the first AI code review platform that orchestrates context from across your entire development ecosystem from business requirements in Confluence to system dependencies in your CI/CD pipeline to data from any internal MCP servers. All to provide code reviews that actually understand what your code is trying to accomplish.

Start your 14-day trial → Get context-aware reviews that reference your actual team standards in ~10 minutes.

Why MCP for AI code reviews?

Development teams operate across dozens of tools:

Requirements live in Linear
Design specifications exist in Figma
Architectural decisions get documented in Confluence
Security standards evolve in internal wikis after each audit

AI code reviewers start with basic context: your codebase, some coding guidelines, maybe a few integrations. They analyze syntax, check patterns, and suggest improvements. But they miss the context that determines whether code actually works for your team.

As a MCP client, CodeRabbit acts as a compiler for organizational context. It takes high-level inputs - your wikis, tickets, deployment patterns - and compiles them down into precise, actionable code review insights. Instead of bloated integrations or brittle hacks, MCP lets clients like CodeRabbit pull in just the right data from your MCP servers from places like your Linear tickets, Confluence docs, Datadog metrics, or Slack discussions.

What it looks like in practice…

CodeRabbit searches connected MCP servers before starting a review. For example, database schema changes might get checked against data architecture documents. API endpoint implementations might get verified against service design patterns documented in internal wikis.

Example: CodeRabbit verifies code consistence

Bring in the context matters to you… from any tool

Traditional code review tools require specific integrations. CodeRabbit's MCP integration works with any system with an MCP server. Your proprietary internal tools, boutique SaaS platforms, custom documentation systems. If there's an MCP server, CodeRabbit can connect.

With CodeRabbit as an MCP client, you’re reviews gain depth from bringing in three different types of context.

Technical context.

Think dependencies, performance data, static analysis, and test coverage.

Native integrations: GitHub Actions, GitLab CI, Bitbucket Pipelines
MCP Servers: Datadog, New Relic, SonarQube, Snyk, Grafana
Example Review Comment:

Business context.

This includes things like requirements, user stories, and acceptance criteria.
Native integrations: Linear, Jira, GitHub Issues, GitLab Issues
MCP Servers: Confluence, Notion
Example Review Comment:

Organizational context.

We also pull in things like prior decisions, conventions, meeting notes, and institutional knowledge.
Native integrations: PR history, Team conventions
MCP Servers: Slack, Microsoft Teams, Stack Overflow for Teams, PagerDuty
Example Review Comment:

Getting started with MCP integration

Setting up CodeRabbit's MCP client requires minimal configuration. Most development teams can connect their first MCP server in under 10 minutes.

Popular development tools with MCP server support:

Linear (native MCP support, 5 minutes)
Notion (MCP server available, 10 minutes)
Confluence (community MCP server, 15 minutes)
Figma (MCP plugin available, 10 minutes)

Define which code changes should search which development systems. Database changes check architecture documentation. Authentication changes check security documentation.

Adding an MCP server is easy:

In the CodeRabbit dashboard, head over to integrations > and toggle to the MCP Servers tab if needed
You can click on one of the pre-configured MCP server options or the New MCP Server button to add other MCP servers.
For MCP servers not on the list, enter the relevant credentials.
Note the usage guidance which serves as context for how the MCP information should be used.
Once connected. You can see the available calls and hover over them to see more details.
You can also click on each call to enable/disable access.

A review platform that brings in all your context

CodeRabbit works out of the box with 50+ integrations. With MCP, you can extend it to your custom servers and internal tools. Start with the systems you already use — Linear, Confluence, Datadog, Slack — and add more as you go.

Next steps:

]]>

Wed, 17 Sep 2025 15:57:17 GMT

Once upon a time, getting context into an LLM meant stringing together hacks, prayers, vector strategies, and overly complex RAG pipelines. Then came the Model Context Protocol (MCP), a clean, modular way to serve external data to models in production. It quickly became the protocol of choice for anyone building agentic systems that are trying to actually do things.

Every tech company is now launching MCP functionalities – and for good reason. MCP separates context logic from application logic, improves reliability, and helps tame the chaos of prompt construction in complex workflows.

We’ve been deep in the context engineering space for a while, and as we launch our own MCP client, we’re genuinely excited by how it lets us inject richer context into our code reviews. But let’s be honest: with great context comes great risk. Because here’s the dirty secret of the MCP era: most of us are now drowning in the context we used to beg for. More logs, more traces, more diffs, more "relevant" files and way less clarity about what the model actually needs.

What starts as helpful input quickly turns into token bloat, noise, and degradation in model performance. Think hallucinations with citations, latency spikes, or reviews that read like they were written by an over-caffeinated intern who rambles. Good context engineering isn’t about cramming in everything, it’s also about knowing what to leave out. And in the aftermath of MCP, that balance is harder (and more important) than ever.

In this article, we’ll break down the ballooning context problem, what happens when well-intentioned context goes rogue, and how we’re tackling it head-on. If you’re shipping LLM-based features with MCP and want to avoid accidentally building a prompt-shaped black hole, this blog is for you.

The “Ballooning Context Problem” with MCP clients & servers

MCP servers and clients make it easy to hand models a firehose of information: logs, traces, diffs, configs, tickets, and sometimes even that dusty corner of the repo nobody remembers owning. It’s all right there at the model’s fingertips. But here’s the question: is more context always better? Definitely not!

Too much context is like cramming for an exam by reading the entire library. You end up with noise, not knowledge. And when context goes unchecked, three problems show up fast:

Token bloat. LLMs don’t have infinite stomachs. Input windows are expensive and finite, and stuffing them full of “just in case” details means higher costs, slower throughput, and wasted budget on irrelevant text.
Relevance decay. More information doesn’t mean better outputs. In fact, it often means worse. Irrelevant or redundant snippets dilute the signal, and the model starts chasing tangents instead of insights.
Latency. Every extra log, diff, or stack trace has to be fetched, processed, and shoved into the prompt. Context building becomes the bottleneck, dragging review speed down to a crawl.

In short, ballooning context turns the elegance of MCP into a liability. Without deliberate context engineering, the very thing meant to sharpen outputs can just as easily smother them.

When context hurts

In practice, we see three common pathologies:

Context confusion. This happens when the model latches onto irrelevant detail and treats it as signal. Imagine a pull request that updates authentication logic but the context dump also includes unrelated test fixtures. The model might start reviewing the fixtures instead, producing comments that feel informed but have nothing to do with the actual change.
Context clash. Not all context agrees with itself. Suppose a code review includes both the latest schema migration and an outdated docstring that contradicts it. The model now has to “choose” which source to trust. Often, it hedges, producing muddled reviews that cover every angle without real confidence: the LLM equivalent of a reviewer who can’t commit.
Context poisoning. The most insidious case is when bad information makes it into the context. A hallucinated “related file” or a mis-indexed snippet gets injected, and suddenly the model is citing non-existent code. In a review, that looks like a comment about a bug in a file that doesn’t exist, confusing developers, wasting time, and eroding trust.

And it’s not just code reviews. The same pitfalls show up anywhere context gets overstuffed: customer support bots pulling in irrelevant tickets, research assistants distracted by tangential papers, or security agents treating noisy logs as hard evidence. In each case, the wrong context is worse than no context at all.

Key patterns to combat context overload with MCP servers

If the problem of the MCP era is ballooning context, the solution isn’t to stop piping in information — it’s to curate, compress, and serve it with intent. MCP context should be treated as raw material that goes through a well-designed data transformation process before it ever reaches the model. For our own MCP client for code reviews, we’ve leaned on a set of patterns that keep context high-signal and low-noise.

Context deduplication and differencing
Redundant inputs are the fastest way to waste tokens. Identical stack traces, repeated log lines, or unchanged sections of a diff don’t need to appear ten times. Our client identifies duplicates, collapses them, and highlights only what’s new. The same principle applies in other domains: collapse duplicate customer tickets, compress recurring traces, and reduce context to delta rather than bulk.
Context summarization pipelines
Sometimes raw MCP output is still too big. Here, LLMs themselves can help by summarizing retrieved context into something smaller. The tradeoff is compression vs. fidelity: a summary might miss nuance, but the alternative is a model drowning in detail. In practice, we use hybrid designs: raw diffs for high-priority files, summaries for less-critical context.
Context prioritization and truncation
Even after pruning and summarizing, you still need to decide what goes first, what can be deferred, and what gets dropped if there isn’t room. Setting a token budget per MCP query is critical, or else prompts will balloon unpredictably. We’ve experimented with truncation-aware designs; sometimes front-loading summaries for quick orientation, other times end-loading detail for deep dives. The “right” design depends on the workflow and the model’s feedback loop.
Context quarantining
Not every piece of context belongs in the first prompt. Subtasks should carry their own dedicated context threads, so the model sees exactly what it needs when it needs it. For example, in our MCP client, test failures live in a dedicated review sub-thread rather than clogging the main review context. This approach reduces confusion and helps preserve clarity across long interactions.
Iteration and learning
Context engineering isn’t static. We use model feedback and human-in-the-loop corrections to tune priorities over time. Observability is key: logging actual prompt inputs, broken down per module, lets us see what’s getting through and what’s wasted. Tooling like MCP dashboards or token heatmaps can highlight where budgets are blown or irrelevant inputs are sneaking in.

Anti-patterns to avoid with MCP servers & clients

The MCP era makes context retrieval easy. Maybe too easy. A couple of common anti-patterns are worth calling out:

Blind vector stuffing
Vector databases are great at surfacing “relevant” chunks of information, but treating them as an oracle is a recipe for trouble. Stuffing in every vaguely related snippet means you get reviews full of tangents: comments about files that weren’t touched, or nitpicks based on stale code. Context irrelevance doesn’t just waste tokens — it actively drags down model performance by pulling attention away from the real task.
“Just give it everything”
The brute-force approach: dump every log, diff, and docstring into the context window and pray. This guarantees high costs, long latencies, and unpredictable results. The model can’t tell which parts are critical and which are fluff, so you end up with bloated reviews that read like they were written by an overeager intern trying to cover every angle. Worse, when contradictions sneak in, the model hedges or hallucinates to reconcile them.

In short: more context isn’t always better. Without filtering, prioritization, and careful design, “everything” quickly turns into noise that makes the system slower, dumber, and more expensive.

The approach we took with our MCP client

In the MCP era, context is king. But let’s be honest: sometimes it’s a king that’s had one too many and can’t tell up from down. The challenge isn’t getting context anymore; it’s taming it. Great context engineering requires careful transformation pipelines, ruthless prioritization, and the humility to keep iterating. Done poorly, you get token bloat, latency, and reviews that sound confused. Done well, you get sharper outputs that scale with your workflow.

We’ve seen this firsthand in our own MCP client for code reviews. When testing, we initially passed full logs and entire file sets straight through. The result? Expensive reviews that rambled more than they helped. Once we introduced deduplication, summarization, and task-specific quarantining, review quality jumped. Instead of commenting on everything, the model zeroed in on real cross-file risks, while token use and latency both dropped.

That’s the payoff of good context engineering: reviews that feel informed, not bloated. And that’s what we’re building toward with our MCP client.

👉 Ready to see context done right? Start your 14-day trial of our AI code reviews.

]]>

Tue, 16 Sep 2025 23:02:47 GMT

CodeRabbit CLI - Free AI code reviews in your CLIの意訳です。

CodeRabbitは、PRにおけるAIコードレビューから始まりました。5月には、そのインテリジェンスをVS Code、Cursor、Windsurfに拡張。そして今、開発者に愛されるAIコードレビューをコマンドラインにまで広げる「CodeRabbit CLI」を発表します。つまり、私たちは最も包括的なAIコードレビューツールになったのです。あなたが働く場所なら、どこでも動作します。

CodeRabbit CLIは、開発者がターミナルで直接セルフレビューを行えるようにします。自動化されたインテリジェントなコード分析機能を提供し、問題の早期発見と一貫したコード規約を維持し、CLI内でAIコーディングエージェントとシームレスな統合によって自律的なコーディングを実現します。

コードの「Vibeチェック」― CLIでも

https://youtu.be/IqBKf4u5MtA

CodeRabbit CLIは、PRやIDEレビューと同じ包括的な分析を提供し、バグの早期発見に役立ちます。CodeRabbit CLIはレート制限付きで無料利用できますが、Proプランでは制限が大幅に緩和され、さらに以下のような追加機能を利用できます。

コンテキスト対応分析: Git連携を活用し、静的解析ツールやセキュリティスキャナ、コードグラフの関係性機能など40以上の情報源を統合して、最も包括的なレビューを実現
プレコミットレビュー: マシンを離れる前に変更を分析し、多層的なレビューを提供
ワンクリック修正: 簡単な修正は即適用、複雑な問題はAIエージェントに完全なコンテキスト付きで引き渡し
コーディング規約検出: agent.md、claude.md、Cursor rulesなどのコーディングエージェント設定ファイルを自動検出

CodeRabbit CLI: どこでも、なんでも動作

ターミナルネイティブであるため、CodeRabbit CLIは以下に対応します。

あらゆるターミナルアプリ/IDE: iTerm2、Ghostty、Neovim、Lazyvim
あらゆるAIコーディングCLIエージェント: Claude Code、Codex、Cursor、Gemini、OpenCodeなど

AIコーディングエージェントCLIとの使い方

CodeRabbit CLIはAIコーディングエージェントとの新しい統合の可能性を広げます。Claude Codeとの動作例は以下の通りです。

コーディングタスクを進める際、Claude CodeにCodeRabbitを使って発見された問題を修正するよう促すことができます。PRDやタスクリストからコーディングする場合に特に便利です。

仕様書のフェーズ7.3を実施し、その後に `coderabbit --prompt-only`を実行してください。
必要なだけバックグラウンドにて実行し、発生した問題を修正してください。

Claude Codeはコーディングタスクを進めながら、バックグラウンドでcoderabbit --prompt-onlyを実行します。タイマーを設定してCodeRabbitを定期的に確認する場合もあります。あるいは、ClaudeにCodeRabbitの完了を確認するよう促すこともできます。

その後、Claude CodeはCodeRabbitの出力を読み込みます。--prompt-onlyフラグを使うことで、AIエージェントが読み取れるプレーンテキストで出力されます。ClaudeはCodeRabbitが検出した問題ごとにタスクリストを作成します。

Claude Codeとの統合や自動化ワークフローについては、CLIドキュメントをご覧ください。

CLIにはインタラクティブモードとプレーンレスポンスモードの2種類があり、自動化ワークフローへの統合や他ツールへの結果の受け渡しが容易です。

はじめ方

CodeRabbit CLIはすでに利用可能です。インストールして最初のレビューを試してみましょう。

# CodeRabbitをインストール
curl -fsSL https://cli.coderabbit.ai/install.sh | sh

# インタラクティブモードでレビュー実行
coderabbit

]]>

Tue, 16 Sep 2025 22:46:14 GMT

Raising our $60 million Series B: Quality gates for AI codingの意訳です。

CodeRabbitを立ち上げたとき、そのコンセプトはシンプルでした。すべての開発者がコードレビューを嫌っているのだから、もっと速く、簡単にできるようにすればいいのでは、ということです。変数名やスタイル規約について、同じコメントを何度も書くのは誰にとっても楽しいことではありません。

そこでAIが役立つと考えました。ベストプラクティスのチェックやルールの適用を自動化すれば、開発者自身がやる必要がなくなるのです。そしてさらに重要なのは、AIがセーフティネットとして機能し、本番環境に入る前に問題やバグを検知できることです。

その信念のもと、私たちはAIコードレビューという、まったく新しいものを作ることに挑戦しました。その後、AIコーディングツールは広く普及し始めました。Copilot、Claude Code、Cursorといったツールは、開発チームが容易にレビューできる以上のコードを生成するようになり、多くの開発者がPR数を2倍から3倍に増やしました。これにより、すでに抱えていたコードレビューのバックログはさらに増加。私たちはすぐに気づきました。「効率化」と宣伝されていたものが、やがてレビューのボトルネックになることを。

そこではじめて理解したのです。AIコードレビューは開発チームにとって極めて重要な存在になると。信頼とガバナンスのレイヤーとして機能し、品質とセキュリティを担保しながら、開発者の時間を節約します。そしてボーナスとして、職場での皮肉混じりのレビューコメントも大幅に減らせるのです。

AIコードレビューが必須となった2025年

https://youtu.be/UHCTKZYOOYU

過去2年間で、私たちは最も包括的かつコンテキストを重視したコードレビュープラットフォームを構築し、200万のリポジトリに導入され、1,300万件のPRをレビューしました。GitHubとGitLabの両方で最もインストールされたAIアプリとなり、数えきれないほどの開発チームの士気を向上させてきました。

そして2025年、AIコードレビューは、AIコーディングエージェントの普及に伴う課題に直面するすべてのチームにとって必須のものとなっています。この変化は前例のない成長を引き起こし、本日発表した6,000万ドルのシリーズB資金調達につながりました。

今回の投資はScale Venture Partnersが主導し、NVentures（NVIDIAのベンチャーキャピタル部門）が参加。長年の投資家であるCRV、Harmony Partners、Flex Capital、Engineering Capital、Pelion Venture Partnersも支援してくれました。今回の資金調達により、私たちが調達した累計資金は8,800万ドルになりました。

なぜ多くのチームがAIコードレビューを導入しているのか

チームのすべての開発者がコードをより速く生成するようになると、レビュー待ちのキューは指数関数的に増えます。以前は1日5〜10件のPRをレビューしていたシニアエンジニアが、今では20〜30件を抱えています。計算が合いません。チームは2つの悪い選択肢に直面します。デプロイを遅らせて丁寧にレビューするか、レビューを急いで品質を犠牲にするか。

だからこそ、AIコードレビューの導入は加速しています。AIレビュアーは人間のレビュアーを補完し、アーキテクチャの判断やビジネスロジック、AIが完全には理解できない文脈を必要とするフィードバックに集中できるようにします。

この1年は嵐のようでした。売上は10倍になり、チームも倍増しました。その背景には以下の要因があります。

それぞれの顧客の背後には実際のチームがいて、同じことを感じています。つまり、CodeRabbitでレビューが速くなってバグが早期に発見され、リリースサイクルが再び加速しているということです。

Grouponは、レビューから本番リリースまでにかかっていた時間が86時間からわずか39分に短縮されたと報告しました。別の企業は、コードレビューに費やす時間を70％削減できたと共有してくれています。

CodeRabbitの仕組み

CodeRabbitは「AIの炎にAIで立ち向かう」からこそ機能します。多数のコンテキスト情報を取り込み、最も文脈に沿ったレビューを提供します。

本番前に正確性やセキュリティの問題を検知
組織のベストプラクティスや独自ルールの適用
マージサイクル全体をサポート（ユニットテストやdocstring生成など）

調達を祝して：CodeRabbit CLIの発表

https://youtu.be/IqBKf4u5MtA

本日、シリーズBの発表を記念して、CodeRabbit CLIを発表します。これはターミナル上で動作するAIコードレビューで、Claude Code、Codex CLI、Cursor CLI、GeminiなどのAIコーディングエージェントとシームレスに連携します。

開発者がCLIベースのコーディングエージェントを使ってコードを書くケースが増える中、私たちは大きなギャップを特定しました。コードはかつてない速度で生成されていますが、品質検証が行われるのは遅く、PRの段階になってからということが多いのです。

CodeRabbit CLIはこれを変えます。CLIワークフローに直接インテリジェントなレビューを組み込み、コード生成と品質検証の間にリアルタイムのフィードバックループを作り出します。

モジュールをリファクタリングするようClaude Codeに依頼しても、Cursor CLIで新機能を実装しても、CodeRabbitは即座にその結果をレビューし、ハルシネーション生成を検知し、セキュリティ問題にフラグを立て、AIエージェントに文脈に沿った修正を返すことさえできます。

CodeRabbit CLIは、AI生成コードを本番レベルに引き上げるために欠けていたオーケストレーションレイヤーであり、自律的な開発の実現を可能にします。

今回の資金調達が意味するもの（あなたにとっても）

シリーズBで調達した資金は、私たちが解決すべき課題のスケールに合わせて成長を続けるために使われます。投資先は以下の通りです。

製品開発の加速： コンテキスト統合の強化、よりスマートなマージ前チェック、自動テストなど、ロードマップは満載です。レビューをより速く、正確で、有用にすることに集中します。
オープンソースの支援： 現在、すでに10万以上のOSSプロジェクトがCodeRabbitを利用しています。この資金で、貢献や支援をさらに強化し、現代的な開発を可能にしたコミュニティをサポートします。詳細は今週後半に！
優秀な人材の採用： 今年だけで従業員数を倍増させました。今後はエンジニアリング、プロダクト、セールス、マーケティング、カスタマーサクセスの分野でグローバルに採用を進めます。

この資金調達により、私たちが「AI駆動開発における最も重要な欠けているピース」だと信じている、スケーラブルで文脈に対応したレビューを構築し続ける余地が生まれました。

ご支援ありがとうございます

この会社を始めたとき、私たちはすべてのエンジニアが経験する課題に挑戦していることを理解していました。レビューは面倒で、簡単にはスケールしません。その課題にCodeRabbitが今や数千のチームを支援できていることは、謙虚であると同時に大きな力を与えてくれます。

顧客、コミュニティ、投資家の皆さまへ：私たちを信じ、一緒に築いてくださりありがとうございます。そしてこの取り組みにワクワクする方は、ぜひ私たちに加わってください。コードレビューの未来を一緒に作りましょう。

CodeRabbitを無料で試す そして 採用情報はこちら

]]>

Tue, 16 Sep 2025 12:59:39 GMT

CodeRabbit started with AI-powered code reviews in pull requests. In May, we brought that same intelligence to VS Code, Cursor, and Windsurf. Now, we're extending the AI code reviews developers love into the command line with CodeRabbit CLI. In case you’re wondering, that makes us the most comprehensive AI code review tool available. We work everywhere you work.

CodeRabbit CLI helps devs perform self-reviews of code directly in their terminal. By providing automated, intelligent code analysis capabilities, it empowers developers to catch issues early, maintain consistent code standards, and make coding autonomous through seamless integration with AI coding agents in the CLI.

Vibe checking your code – now in CLI

https://youtu.be/IqBKf4u5MtA

CodeRabbit CLI delivers the same comprehensive analysis that makes our PR and IDE reviews effective at catching bugs early. CodeRabbit CLI is free to use with rate limits but with a Pro plan you can enjoy much higher limits and additional features, including:

Context-aware analysis: Leverages your Git integration to synthesize insights from 40+ sources including static analysis tools, security scanners, and our codegraph relationship feature for the most comprehensive reviews.
Pre-commit reviews: Analyze changes before they leave your machine for multi-layered reviews.
One-click fixes: Apply simple fixes instantly or send complex issues to AI agents with full context hand-off.
Coding guidelines: Auto-detects agent.md, claude.md, Cursor rules, and other coding agent configuration files.

CodeRabbit CLI: Works everywhere, with everything

Terminal-native means CodeRabbit CLI works with:

Any Terminal App/IDE: iTerm2, Ghostty, Neovim, Lazyvim
Any AI Coding CLI agent: Claude Code, Codex, Cursor, Gemini, OpenCode and more

How to use CodeRabbit CLI with AI Coding Agent CLI

The CodeRabbit CLI opens up new integration possibilities with AI coding agents. Here's how it works with Claude Code:

While working on a coding task, you can prompt Claude Code to use CodeRabbit and to fix any issues it finds. This is particularly useful if it’s coding from a PRD, or a tasklist.

Please implement phase 7.3 of the planning doc and then run coderabbit --prompt-only, let it run as long as it needs (run it in the background) and fix any issues.

2. Claude Code will carry on the coding task and run coderabbit --prompt-only in the background. It may setup a timer interval to check on CodeRabbit. Alternatively, you can also prompt Claude to check if CodeRabbit is complete.

3. Claude Code will then read the output of CodeRabbit which, by using the --prompt-only flag, provides the output as plain text with prompts for AI agents to read. Claude will then create a tasklist addressing each of the issues surfaced by CodeRabbit.

For Claude Code integration and automated workflows, check the CLI documentation for setup.

The CLI has two modes: interactive and plain response , making it easy to integrate into automated workflows or pass results to other tools.

Getting started

CodeRabbit CLI is available now. Install and try your first review:

#install CodeRabbit
curl -fsSL https://cli.coderabbit.ai/install.sh | sh

#Run a review in interactive mode
coderabbit

]]>

Tue, 16 Sep 2025 12:55:41 GMT

When we started CodeRabbit, the idea was pretty simple: since all developers hate code reviews, why not make them faster and easier? After all, no one enjoys leaving the same comment about variable naming practices or style conventions for the tenth time in a week.

That’s where we believed AI could help – it could automate best-practice checks and policy enforcement so that devs didn’t have to do it themselves. But more importantly, it could act as a safety net, catching issues and bugs before they made it into production.

With that belief, we set out to create something new: AI code reviews. Over time, AI coding tools started to gain broader adoption. Tools like Copilot, Claude Code, and Cursor began spitting out more code than teams could easily review with many developers increasing the number of PRs they shipped by 2x to 3x. This added to the existing code review backlogs many teams had. We quickly realized that the ‘efficiency’ gains being marketed to engineering teams would swiftly turn into code review bottlenecks.

And that’s also when we first realized how critical AI code reviews would be to development teams. They would function as a trust and governance layer in agentic software development ensuring quality and security while saving devs time. And, as an added bonus, greatly reducing passive aggressive review comments in the workplace!

AI code reviews became essential in 2025

https://youtu.be/UHCTKZYOOYU

Over the last two years, we’ve built the most comprehensive and context-rich platform for code reviews, been installed on 2 million repos, reviewed 13 million pull requests, become the most installed AI App on both GitHub and GitLab, and improved the morale of countless dev teams.

In 2025, we watched AI code reviews become essential for all teams dealing with the challenges that come with the broad adoption of AI coding agents. But that shift fueled a year of unprecedented growth, culminating in the $60 million Series B round that we announced today.

This investment was led by Scale Venture Partners with participation by NVentures (NVIDIA’s venture capital arm) and support from our long-time investors CRV, Harmony Partners, Flex Capital, Engineering Capital, and Pelion Venture Partners. With this new funding, our total capital raised is now $88 million.

Why so many teams are adopting AI code reviews

When every developer on your team is generating code faster, your review queue grows exponentially. Senior engineers who used to review 5 to 10 PRs a day are now facing 20 to 30. The math doesn't work. Teams are caught between two bad options: either slow down deployment cycles waiting for thorough reviews, or rush reviews and let quality slip.

This is why AI code review adoption is accelerating. AI reviewers augment the human reviewers, freeing them to focus on architecture decisions, business logic, and the nuanced feedback that requires context AI can't fully grasp yet.

The past year has been a whirlwind. We’ve 10x revenue and doubled our team thanks to:

Behind each of those customers are real teams who tell us the same thing: reviews are faster with CodeRabbit, bugs are caught earlier, and release cycles are finally speeding up again.

Groupon told us they went from 86 hours from review-to-production down to just 39 minutes. Another shared that they cut down the time they spend on code reviews by 70%.

How CodeRabbit works

CodeRabbit works because it fights AI fire with AI fire. Our platform brings in dozens of points of context to deliver the most context aware reviews to:

Catch correctness and security issues before they hit production.

Enforce organizational best practices and custom policies.

Support the full merge cycle with unit testing and docstrings generation.

How we’re celebrating: By announcing CodeRabbit CLI

https://youtu.be/IqBKf4u5MtA

Today, we're celebrating our Series B by announcing CodeRabbit CLI, AI code reviews that live in your terminal and orchestrate seamlessly with Claude Code, Codex CLI, Cursor CLI, Gemini, and other AI coding agents.

As developers increasingly write code through CLI Coding agents, we've identified a critical gap: code is being generated at unprecedented speeds, but quality validation happens too late, often only at the PR stage.

CodeRabbit CLI changes this by bringing intelligent review directly into the CLI workflow, creating a real-time feedback loop between code generation and validation.

Now, whether you're prompting Claude Code to refactor a module or using Cursor CLI to implement a feature, CodeRabbit instantly reviews the output, catches hallucinations, flags security issues, and even hands contextualized fixes back to your AI agent.

CodeRabbit CLI is the missing orchestration layer that makes AI-generated code production-ready, turning the promise of autonomous development into reality.

What this funding means for us (and for you)

Our Series B round will help us keep pace with the scale of the problem we set out to solve. Here’s where we’re putting that investment:

Accelerating product development: Our roadmap is packed. From deeper context integrations to smarter pre-merge checks and automated testing, we’re focused on making reviews faster, more accurate, and more useful for every team.
Supporting open source: Today, more than 100,000 OSS projects already use CodeRabbit. With this funding, we’re doubling down on contributions and support to strengthen the community that made modern development possible. More on that later in the week!
Hiring the best talent: We’ve already doubled headcount this year and we’re hiring globally across engineering, product, sales, marketing, and customer success.

This funding gives us the space to keep building what we believe is the most important missing piece of AI-powered development: scalable, context-aware reviews.

Thank you for all your support

When we started this company, we knew we were chasing a problem every engineer experiences: reviews are a pain and they don’t scale easily. The fact that CodeRabbit is now helping thousands of teams tackle that problem is both humbling and energizing.

To our customers, community, and investors: thank you for believing in us and building alongside us. And if this work excites you, consider joining us. Come help us build the future of code reviews.

Try CodeRabbit for free yourself and learn more about our open roles.

]]>

Mon, 15 Sep 2025 01:00:13 GMT

Ryo HIGASHIGAWAさんは、OSSとして「reviewtask」というレビュー支援ツールを一人で開発されています。このツールは、AIが生成したコードに対して発生する膨大な指摘事項を、効率的かつ正確に管理するために設計されたものです。Ryoさん自身がAIによるコードレビューを日常的に活用し、その運用課題に直面する中で生まれた実践的なプロダクトとなっています。

もともとは、GitHubのレビューコメントをAIで取得し、タスクに変換して管理するというアプローチを試していたものの、精度や手間の問題が大きく、より安定した運用を目指してreviewtaskの開発が始まりました。そんな背景を持つRyo HIGASHIGAWAさんに、reviewtaskとCodeRabbit活用についてお話を伺いました。

reviewtaskの開発体制について

reviewtaskはRyoさんが開発していますが、コード自体はほぼすべてAIによって生成されています。Ryoさん自身は、開発タスクの設計やバグ報告など、プロダクトマネージャーのような立場に徹し、コードを書く作業をAIに委ねています。

Git操作やPull Request、Issue作成などの多くもAIに任せており、自らは開発プロセスの全体像を見ながらプロジェクトを前に進める役割に集中しているとのことです。

コードレビューに関する課題

AIによるコード生成を大量に行う中で、課題となったのがレビュー品質の担保でした。生成されたコードは量が多く、すべてを人の手でレビューするのは現実的ではなく、疲労感とボトルネックを生み出していたそうです。

AIにコードを書かせることで生産性は大きく向上しましたが、その反面、レビューと品質管理にかかる時間と労力が爆発的に増加。最終的には、レビューまでもAIに任せられないかと模索するようになったといいます。

「AIに書かせて自分がレビューしてとやっていると、非常に疲れるなというのが問題感としてありました」

CodeRabbitとの出会い

RyoさんがCodeRabbitに出会ったきっかけは、他のAI開発支援ツールとの比較や試行の中でのことでした。当時はDevinやCursorなどのツールを並行して使用し、ドキュメントやレビューの自動化に取り組んでいたそうです。

AIに仕様を理解させ、それに沿ったレビューやチェックを実現したいという強い思いから、試行錯誤を重ねる中で、CodeRabbitの精度や柔軟性に魅力を感じて導入に至りました。

「レビューの指摘の対応やPRの状況の確認などにも利用できるので非常に柔軟なツールなところが気にいっています」

CodeRabbit導入の決定要因

CodeRabbitを導入する決め手となったのは、レビューの質だけでなく、ツールとの対話ができる点だったといいます。指摘を受けたくないポイントを説明すれば理解してくれる柔軟性や、プロジェクトごとのカスタマイズ性が大きな魅力でした。

また、単にレビューコメントを生成するだけでなく、レビュー後のフローに組み込みやすい点も導入を後押ししたそうです。

「プロダクトの都合上、指摘を入れて欲しくない所は説明すれば学習してくれるのが嬉しいですね」

CodeRabbitの運用状況・効果

現在では、reviewtaskやその他のプロジェクトにおいて、CodeRabbitによるレビューを標準フローとして組み込んでくれています。レビュー品質の維持と同時に、設計やドキュメント作成に集中できる時間が確保され、結果として開発効率が大きく向上しました。

レビュー指摘の管理にはAIツールとの連携や自作ツールを駆使し、指摘をTODO化して確実に対応していくプロセスが構築されています。複数のプロジェクトを同時に進める現在の開発スタイルは、CodeRabbitの存在抜きには成り立たないといいます。

「今はなくてはならないパートナーという感じです」

実務での利用

OSS開発だけでなく、業務での開発プロジェクトにおいてもCodeRabbitを導入し、レビューの効率化を図っています。特にレビューコストの高いチームにとっては、CodeRabbitが先に自動で指摘を洗い出してくれることで、人的リソースの負担が大きく軽減されました。

導入後は、レビューの流れそのものが変わり、指摘が先に潰された状態でレビュワーに渡るため、確認作業の集中と精度向上につながっているとのことです。

「ワークフローが完全に変わった感じがして良い評価が開発メンバーからも上がっています」

CodeRabbitに今後期待したいところ

CodeRabbitへの要望としては、仕様学習の精度向上や、PRやIssueを横断的に管理できる機能の強化が挙げられました。VSCodeとの連携においても、詳細な指摘内容を取得し、IDE上でAIからのフィードバックを直接得られるようになることを期待されています。

さらに、ドキュメントのわかりやすさや機能説明の具体性にも改善の余地があると感じており、ユーザーの立場からnoteなどで情報発信を続けていきたいとの意欲も語ってくださいました。

「本当に素晴らしいプロダクトだと思っているので、ぜひこのすばらしいプロダクトを広めていただければと思っています！」

CodeRabbitは今後もreviewtaskの開発をサポートしていきます！

]]>

Sun, 14 Sep 2025 07:00:00 GMT

Our customers trust us with their most valuable asset: their source code. That trust is why security is central to our mission of helping developers ship better code faster.

When there’s a chance to strengthen our security posture, we act quickly and decisively. And when we design new systems, we design them with “security by default” in mind.

We share below the architecture that makes CodeRabbit more resilient, limits the potential impact of any one component, and ensures that the data entrusted to us remains safe under all circumstances.

Overview

Customers install CodeRabbit on their git platforms via the app marketplace. We integrate via webhooks with all popular Git providers such as GitHub, GitLab, Bitbucket & and Azure DevOps. The integration allows us to register webhooks on events such as PR opened, user comment, etc.

Each event is processed in complete isolation. We maintain a secure internal queue that verifies subscriptions, applies rate limits, and ensures that only authorized events are allowed through. Events are handled one at a time, with zero shared state and no assumptions about what came before or after.

This model gives us something incredibly valuable: containment by default. If an attacker were to compromise one event, they would find nothing else to pivot to – no shared memory, no long-lived tokens, no context beyond that single, short-lived process. Every review starts from scratch, runs alone, and ends clean.

Our architecture at a glance

Here’s a high-level look at how our system is structured in our git-based, IDE, and CLI reviews:

This design is focused on limiting an attacker’s potential “blast radius” – or how much damage an attacker can do if they succeed at breaching one component. By isolating secrets, tightly scoping tokens, and strengthening our encryption, we’ve drastically reduced that radius.

Our layered approach

We use these layered strategies:

1. Sandbox

We create a secure sandbox environment for each code review event to clone the codebase in order to read files, pull context from various sources in our knowledge base about your code and to run tools, linters, web search queries & verification checks. Our sandboxed environment only has the short-lived token for that particular repository, but it contains absolutely no other secrets, API keys, or credentials. Even if an attacker were to achieve remote code execution within our sandbox environment or get out of the sandbox and break the sandbox kernel-based isolation mechanism, they would find nothing of value - no environment variables with tokens, no configuration files with secrets.

Internal network access is also blocked from the sandbox. Tools may connect to the internet when required, but they cannot reach CodeRabbit’s internal services.

2. Token Service Separation

To reinforce the isolation of workloads, we have fully embraced a model based on short-lived session tokens rather than long-lived secrets. Instead of passing environment variables or static credentials, every process is scoped with query or event-specific tokens. These git provider tokens are valid only for the duration of the event or process. These are customer-specific, short-lived tokens. These tokens also have strict rate limiting and audit logging.

This means that workloads never carry unnecessary privileges. They can only access the resources required to process a specific pull request – and nothing more.

By removing persistent credentials from execution environments, we eliminate one of the most common attack surfaces. Even if a third-party tool were exploited, the attacker would see nothing beyond the minimal context of the current event.

3. Customer Data Isolation & Encryption

Each customer's code review is completely isolated. We provision separate containers per code review and use customer-scoped tokens that can only access their specific repositories. There is no shared state between customers.

We also ensure that our code index and all cached code is encrypted with a unique key per customer. Even CodeRabbit employees can't see any code-related data we store. You can also opt out of these features if you don’t want a cached copy of your code.

This layered approach ensures that even if an attacker were able to gain access, they would be unable to access anything critical.

Our broader security posture

A security best practice is to layer multiple controls so that if one fails, others remain in place. We’ve implemented several layers of defense to protect customer code and data:

Automated sandbox enforcement: Every external tool must run in an isolated sandbox environment. This rule is enforced automatically.

Hardened deployment gates: We’ve added pre-deployment checks that verify no service can bypass sandbox isolation or attempt to run with escalated privileges.
Encryption by customer key: Code indexes and cached code are encrypted with a per-customer key. This ensures that even if cache data were exposed, it would remain unreadable without the correct key.
Auditing and monitoring: We’ve expanded our monitoring of sandboxed environments and added automated alerts for unexpected behavior or network activity.
Expanded training: Every CodeRabbit engineer receives additional security training focused on secure-by-design practices and safe handling of secrets.
Least privilege access: Users, processes, and systems are granted only the minimum level of permissions and access rights necessary to perform their specific tasks and nothing more.
Vulnerability disclosure program (VDP): We maintain a formal program that invites independent security researchers to report potential issues responsibly. This ensures that if a weakness is discovered, it can be addressed quickly, transparently, and in partnership with the security community.
Penetration testing and architectural reviews: We work with multiple third parties to conduct routine penetration testing and architectural reviews to routinely audit and improve our security posture.

Looking ahead

We’re committed to building on this foundation by continuing to work with independent auditors, engaging with security researchers through responsible disclosure, and refining our internal practices.

Our goal is to deliver world-class AI code reviews with the highest levels of security and reliability.

]]>

Wed, 10 Sep 2025 01:00:38 GMT

株式会社SalesNowは1,400万件超の企業情報を収録し、法人網羅率100%を誇る日本最大級の企業・組織データベース、AI企業データクラウド「SalesNow」を提供しています。同社では、プロダクト面のAI活用に加えて、社内の開発生産性向上を目的としたLLMやエージェントの導入にも積極的です。

その一環としてAIコードレビューサービスのCodeRabbitを活用し、レビューの標準化と学習の仕組みづくりを進めています。今回はSalesNow社内におけるCodeRabbitの利用状況について、同社エンジニアの@sa9_sha9さんにお話を伺いました。

SalesNowの開発体制について

同社の開発は完全内製で、創業時からフルリモートを文化として定着しています。地理的に分散したメンバーが自律的に動けるよう、非同期コミュニケーションとプルリクエスト中心のフローを重視しています。

体制はアプリケーション開発が6名、データ生成と収集を担うデータチームがフルタイム6名とインターン約4名。さらにデザイナーとPMが加わり、全体で20名ほどとなっています。主要スタックはPythonとReactで、データの信頼性と鮮度を軸に開発を進めています。

レビュー待ちのボトルネック発生が課題

CodeRabbit導入前は、ドメイン知識や社内の開発流儀の判断が一部メンバーに集中し、レビュー待ちのボトルネックが発生していました。長く在籍しているからこそ分かる書き方や、過去の経緯に基づいた知見が人に依存し、属人化を招いていました。

技術面では、プログラミング言語の進化に伴う細かな是正を人が指摘していました。たとえばPythonの型ヒントに関する記述の見直しなど、新しい機能に関する指摘ほど、人が丁寧に行う必要がありました。加えて静的解析では指摘が多く、重要度の見極めが難しかったと振り返ります。

「レビューの質は落とさず、人にしかできない判断と機械で代替できる指摘を切り分けたいという課題感が常にありました」

CodeRabbitとの出会い

CodeRabbitと出会ったきっかけは、Xのタイムラインでした。実運用の事例も多く確認でき、プルリクエスト（以下PR）を起点に自動で一次レビューが進む点が自社にフィットすると判断したといいます。そこで、まずは@sa9_sha9さんのチームに限定して、小さく使い始めました。

同社では他のAIツールも併用していましたが、PR駆動の自動レビューという仕組みは運用に乗せやすいと感じたといいます。非同期中心のワークスタイルにも自然に溶け込み、導入ハードルが低い点も後押しになりました。

「小さく始めて徐々に広げるという進め方が取りやすく、現場の感触を早く得られました」

CodeRabbit導入を決めた3つのポイント

SalesNow社が、CodeRabbit導入を決定した要因は以下の3つです。

1. PR作成を起点に自動で一次レビューが実行されること
2. 日本語の自然なコメントと、プロジェクトガイドラインの読込に対応していること
3. レビューの指摘に対してやり取りを重ねることで、自然と暗黙知が体系化されていく体験がとても良かった

自動的なコードレビューは、人の手が空いていなくてもレビューとディスカッションが先行し、手戻りが減る効果がありました。また、一般的なベストプラクティスと社内の流儀を橋渡しできる点も評価しています。

「まずAIに通すことで基本的な抜け漏れを塞ぎ、人は本質的なレビューに集中できるようになっています」

AIによるレビューは心理的摩擦が低い

現在はフルタイムとインターンを含む全開発メンバーに権限を付与し、PRを作るとCodeRabbitがレビューするのが当たり前になっています。新しいメンバーにはオンボーディングにて、レビューへの反応と判断のコメント化を周知しています。

この工夫は、非同期な環境ではレビューコメントが放置されているのか、対応不要なものなのかを判別するためです。厳格なルール化はしていませんが、同社では実質的な規範として定着しています。

導入後の効果として、一次レビューの網羅性が上がり、週末に働きたいメンバーも一人でレビューを回せるようになっています。ユニークな意見として、AIのレビューは人だと起こりがちな心理的摩擦が少なく、受け止めやすいとのこと。

「CodeRabbitが一次レビューを行い、人は本質を見るという分担で、速度と丁寧さの良いバランスを得られています」

CodeRabbitに今後期待したいところ

SalesNow社では、深くCodeRabbitを活用しており、さまざまな要望が上がっているとのことです。

「まず、要件定義やドメインロジックへの踏み込みを強化してほしいです。当社ではAsanaを利用していますが、そこから仕様の文脈をより確実に参照し、実装意図との整合性の確認や、やるべきでない変更の検知まで踏み込めると、より便利になると思います」

他にもクロスリポジトリにおける整合性の強化が期待されています。APIとフロントエンド間ではOAS（OpenAPI Specification）を使っていますが、それでも十分に読み取れていない場合があるとの指摘がありました。

「他にも設定のマージ（組織、リポジトリそれぞれの設定のマージ）や、もっとレポート機能を使いこなしたいと考えています」

CodeRabbitは、今後もSalesNow社のサービス開発をサポートして参ります。

SalesNowでは、Webリードエンジニアやデータエンジニア、バックエンドエンジニア、LLMエンジニアなどさまざまなエンジニアを募集しています。気になる方は、ぜひSalesNow採用情報をご覧ください。

]]>

Mon, 08 Sep 2025 09:03:45 GMT

How CodeRabbit delivers accurate AI code reviews on massive codebasesの意訳です。

大規模なコードベースは特別な存在です。数百のファイルに広がり、何年ものコミットで進化し、時にはなんとか組織的な記憶でつながっているように見えることもあります。その環境で変更をレビューするのは難しいだけでなく、まるで考古学の発掘作業のようです。この行が先週ここに移動したのは理由があったのか？他のファイルが密かに依存しているのではないか？

まさにそこでCodeRabbitが力を発揮します。スケールに対応するよう設計されているため、ファイルごとのバラバラなコメントになることなく、大規模コードベース全体の履歴とアーキテクチャを考慮してレビューを行います。リポジトリが大きく古いほど、CodeRabbitは役立ちます。人間がプルリクエストの途中で忘れてしまいがちなパターン、依存関係、ルールを見抜けるからです。

大規模コードベース？AIコードレビューにはより多くの文脈が必要！

CodeRabbitは大規模リポジトリで高いパフォーマンスを発揮することで知られています。私たちのツールはプルリクエストを表面的に読むだけではなく、アーカイブ役のように振る舞います。コメントを残す前に、周辺のコードや多数の文脈を引き込みます。AIエージェントはそれらが履歴の中でどう動いてきたかを追跡し、チームのコーディング規約を適用し、スクリプトやツールで自らの推論を二重チェックします。

その結果、レビューは異常なほど「文脈に詳しい」ものになります。クロスファイルの問題を事前にキャッチし、一貫性を強制しつつも不要な指摘は避け、複雑で長い過去を持つリポジトリ全体にスケールします。

得られる結果は明確で、早い段階でのリスクに対するフィードバック、予期せぬ副作用の減少、そしてコードベース全体を理解したレビューになります。

差分だけのレビューの問題点（文脈がないと何が起こるか）

コードの差分は必要ですが十分ではありません。大規模コードベースでは、10行の変更が複数サービスで共有されるヘルパーを密かに変えたり、公開されているAPIの要件を変更したり、差分ファイル以外のセキュリティ前提を崩したりすることがあります。

差分だけを見るAIレビューは、大規模コードベースでは計器なしで飛んでいるようなものです。変更箇所がどこで参照されているのか、他に一緒に変わりやすいコードは何か、チケットの意図に合っているかが見えなければ、小さなコードベースでは通用しても大規模コードベースでは役立ちません。

文脈がないと「これも更新してもらえますか？」というやり取りが繰り返され、マージ時に遅れて驚きが発生し、小さなリグレッションが積み重なります。紙の上ではレビューが良く見えても、本番では違う結果になるのです。

レガシーコードベースに正しい文脈を構築する（それがPRをどう助けるか）

CodeRabbitを「意見を出す前に調査ファイルを組み立てる存在」と考えてください。そのケースファイルには以下の要素が含まれ、それぞれがレビューに反映されます。

コードの地図（Codegraph）

CodeRabbitは定義と参照の軽量なマップを構築し、履歴をスキャンして頻繁に一緒に変更されるファイルを特定します。これにより、依存関係のマップを作成し、PR内の変更が他の依存関係を壊さないかを確認します。

なぜ役立つか: 行単位ではなくファイル間で推論できる。

実際の動作: Codegraphを使って関連ファイルを辿り、差分外で見つかったバグをまとめて通知します。
コードインデックス（セマンティック & 類似検索）

CodeRabbitは関数、クラス/モジュール、テスト、過去のPRや変更のセマンティックインデックス（埋め込み）を保持します。レビュー時にはキーワードではなく目的ベースで検索し、類似実装を見つけ、再利用すべきテストを引き出し、過去の修正方法を思い出します。

なぜ役立つか: レガシーコードベースですでに解決している方法を参照でき、一貫性向上、手戻り削減、テスト拡充が速くなる。

実際の動作: 類似検索により同じコールバックパターンを使った別のテストを提示し、同じ修正を提案します。
チーム独自のルールを反映

CodeRabbitのレビューはチームの規約（命名、エラーハンドリング、API境界、セキュリティ要件、性能要件、テスト規範など）に基づいて行われます。

なぜ役立つか: 一般的なチェックリストではなく、チーム固有の基準に沿ったフィードバックが得られる。

実際の動作: スキーマ変更後にPrismaのマイグレーション不足を指摘。開発者が「デプロイ時に自動生成される」と返答すると、CodeRabbitはそれを学習として保存し、将来の誤検出を避けます。
ツールからのシグナル

AIの推論と並行して、CodeRabbitはリンターやセキュリティ解析ツールを実行し、その結果をレビューに統合します。

なぜ役立つか: AIとツールの両方に裏打ちされた具体的な改善提案が得られる。

実際の動作: ESLintルールと行番号を示し、コールバックを型付き宣言に書き換え、オプショナルチェイニングで安全性を確保します。
証拠に基づく（検証スクリプト）

検証が必要な場合、CodeRabbitはシェル/Pythonスクリプト（grepやast-grepのようなもの）を生成し、仮定を確認したり証拠を抽出してからコメントを残します。

なぜ役立つか: コメントに裏付けがあるため、ノイズが減り、実際にコードを改善する指摘だけが残る。

実際の動作: ファイルとループを特定し、失敗モードを説明し、検証エージェントが解析後に導いた正確な修正案を提案します。

これは実践的なコンテキストエンジニアリングです。正しい情報を集め、絞り込み、整理してからモデルに判断させる。CodeRabbitは創業時からこのアプローチを核としてきました。

成果はシンプルです。シグナルが強く、ノイズが少なく、システムを理解しているレビューになります。

エンタープライズ規模リポジトリへのスケーリング

CodeRabbitはスケールを意識して設計されたパイプラインにより、大規模・レガシーコードベースで強みを発揮します。

PRが届くと、CodeRabbitは隔離された、短期間だけの安全な環境を立ち上げます。必要なものだけを取得し、文脈を構築し、検証を実行し、終了後に破棄します。ピーク時には多数のワーカーが並列実行され、レビュー速度は一定に保たれます。パスフィルターで不要なアセットを除外し、キャッシュやインデックスを選択的に有効化して繰り返しのレビューを高速化できます。

要するに、範囲の選択で文脈を集中させ、隔離で安全性を確保し、弾力性ある実行方法で高速性を維持します。この手法はコードベースとリリーススケジュールに合わせてスケールします。

CodeRabbit: 大規模コードベース向けAIコードレビューの正解

CodeRabbitの強みは単一のトリックではありません。コンテキストエンジニアリングを端から端まで適用する姿勢にあります。変更が何に触れるかをマッピングし、意図に結びつけ、チームルールを適用し、ツールで検証し、証拠付きでコメントします。

このやり方は「コンテキストエンジニアリング」という言葉が流行る前から一貫しており、スケールした環境で正確でノイズの少ないレビューを実現する唯一の方法です。

あなたの大規模コードベースで深い文脈を持つレビューを体験してみませんか？ → 14日間のトライアルを開始する

]]>

Fri, 05 Sep 2025 17:20:16 GMT

Massive codebases are a special kind of beast. They sprawl across hundreds of files, evolve over years of commits, and occasionally feel like they’re held together by equal parts duct tape and institutional memory. Reviewing changes in that environment isn’t just hard – it feels like an archaeological dig. Did this line move here last week for a reason? Is there another file quietly depending on it?

That’s exactly where CodeRabbit shines. It was built for scale, so instead of drowning you in disconnected file-by-file comments, it reviews with the whole history and architecture of your massive codebase in mind. The larger and older your repository, the more useful CodeRabbit becomes because it can see the patterns, dependencies, and rules that humans usually forget about halfway through a pull request when trying to keep all the dependencies in that legacy code in their head.

Large codebase? AI code reviews need more context!

CodeRabbit is known for performing great on large repos. Our tool doesn’t just skim your pull requests; it goes full archivist. Before leaving a single comment, it gathers the surrounding code from your large codebase and pulls in dozens of points of context from your code. AI agents then trace how those pieces have moved through history, apply your team’s coding standards, and even double-check their own reasoning with scripts and tools.

The effect is reviews that feel unusually…informed about your legacy codebase. It catches cross-file issues before they turn into production mysteries, enforces consistency without nitpicking, and scales comfortably across sprawling repos with long, complicated pasts.

The power you gain through this is clearer, earlier feedback on real risks, fewer “wait, what else did that touch?” surprises, and reviews that actually reflect how your whole massive codebase fits together.

The problem with diff-only reviews (or what goes wrong without context)

Code diffs are necessary, but they’re not sufficient. In a massive codebase, a 10-line change can quietly alter a shared helper used by multiple services, shift a public API contract, or undermine a security assumption that lives outside the files in the diff.

AI Bot reviewers who only see the diff are flying without instruments within a large codebase. AI that can’t see where the changed code is referenced, what else tends to change with it, or whether the change actually matches the ticket’s intent, might work for a smaller codebase but not for yours.

Without the right context, you get ping-pong cycles (“Can you also update…?”), late surprises at merge time, and a steady drip of small regressions that add up. The review looks fine on paper, while production tells a different story.

Building the right context on your legacy codebase (and how that helps your PRs)

Think of CodeRabbit as assembling a case file before giving an opinion. Here’s what goes into that case file and how each piece shows up in your reviews.

A map of your code (Codegraph)

CodeRabbit builds a lightweight map of definitions and references and scans commit history for files that frequently change together throughout your massive codebase. This creates a map of file dependencies that CodeRabbit uses to check if any changes in your PR will break other dependencies in your codebase.

Why this helps: The review can reason across files, not just lines.

Seeing it in action: CodeRabbit posts a summary listing bugs outside the diff range that CodeRabbit located by traversing related files with Codegraph.

Here’s an example of the files that CodeGraph brings in from across a repository when completing a PR review.
Code Index (semantic & similarity retrieval)

CodeRabbit maintains a semantic index (embeddings) of functions, classes/modules, tests, and prior PRs/changes. During review, it searches by purpose, not just keywords to surface parallel implementations to align with, pull relevant tests to reuse or extend, and recall how similar issues were fixed before.

Why this helps: Suggestions are grounded in how your legacy codebase already solves similar problems, reducing rework, improving consistency, and speeding up test coverage.

Seeing it in action: Using similarity retrieval, CodeRabbit surfaces a different test with the same callback pattern and proposes the same fix.
Your team rules, not generic advice

CodeRabbit reviews are primed with your standards (naming, error handling, API boundaries, security requirements, performance expectations, testing norms) that you can share with us via coding guidelines and review instructions.

Why this helps: Feedback reflects your standards and context, not a one-size-fits-all checklist.

Seeing it in action: CodeRabbit flags a missing Prisma migration after a schema edit. A developer replies that migrations are auto-generated during deploy, a repo-specific rule. CodeRabbit stores that as a Learning to avoid future false positives.
Signals from tools

Alongside AI reasoning, CodeRabbit runs linters and security analyzers and folds their findings into our easy-to-read and understand reviews.

Why this helps: You get grounded, actionable suggestions backed by both AI and recognizable tools.

Seeing it in action: CodeRabbit will do things like point to the exact ESLint rule and line numbers, rewrites the callback as a typed declaration, and guards the call with optional chaining.
Evidence, not vibes (verification scripts)

When something needs checking, CodeRabbit generates shell/Python checks (think grep, ast-grep) to confirm an assumption or extract proof from the codebase before we post the comment.

Why this helps: Comments come with receipts. That translates into less noise and more comments that actually improve your code.

Seeing it in action: The comment pinpoints the file and loop, explains the failure mode, and proposes the exact change produced by the verification agent after analyzing the parsing path.

This is context engineering in practice: gathering, filtering, and organizing the right information before asking the model to judge. It’s been core to CodeRabbit since day one.

The payoff is simple: higher signal, lower noise, and reviews that feel like they understand your system.

Scaling to enterprise-size repos

CodeRabbit has an advantage on massive codebases and legacy codebases because we designed our pipeline with scale in mind.

When a PR arrives, CodeRabbit spins up an isolated, secure, short-lived environment to do the work. It pulls only what it needs, constructs the context, runs the checks, and tears everything down after. During busy hours, many of these workers run in parallel so review speed holds steady. You stay in control of scope by using path filters to keep bulky or generated assets out of the way, and choosing whether to enable caching or indexing to accelerate repeat reviews.

In short: selective scope keeps context focused, isolation keeps it safe, and elastic execution keeps it fast. This approach scales with your codebase and your release calendar.

CodeRabbit: Large codebase AI code reviews done right

CodeRabbit’s advantage on massive codebases isn’t a single trick. It comes from how we approach context engineering end-to-end: map what the change touches, tie it to intent, apply your rules, verify with tools, then comment with evidence.

We’ve operated this way from the start, well before “context engineering” became a buzzword, because it’s the only reliable path to accurate, low-noise reviews at scale.

Ready to see a deep-context review on your large codebase? → Start a 14-day trial

]]>

Wed, 03 Sep 2025 04:18:53 GMT

生成AIの技術が急速に進化する中で、AIを活用した開発支援ツールにも注目が集まっています。なかでも、AIによるコードレビュー支援は、開発生産性と品質を両立させるための有効な手段として、多くの現場で導入が進んでいます。

今回は、AIエージェント活用をテーマに事業を展開しているGenerative AgentsのCEO、西見さんにお話を伺いました。同社では立ち上げ当初からCodeRabbitを導入しており、その背景や運用上の工夫、今後への期待などについて詳しく語っていただきました。

AIエージェントの利活用と技術教育で企業を支援

Generative Agentsは、AIエージェントの利活用を軸とした事業を展開しています。LangChainやLangGraphといったLLM関連のライブラリを利用し、クライアント企業のAI活用を支援しています。技術講座の提供や教育プログラムの設計にも注力しており、AIエージェント活用を推進する立場として活動の幅を広げています。

設立は2024年3月。共同創業者3名はいずれも生成AI分野での著書を持つ技術者であり、同じ志を持つ仲間として活動を開始しました。個人では追いきれないスピードで進化するAI技術を、仲間とともにキャッチアップしながら、社会に還元していくことを目指しています。

少数精鋭で顧客と共に作る実践型開発体制

Generative Agentsの開発体制は、現在5名という少人数構成です。エンジニアリングリーダーがアーキテクチャ全体を設計しつつ、顧客と共同でプロダクト開発を進めています。LangChainやLangGraphを使った開発に深く関わり、技術的な壁を乗り越える支援を得意としています。

同社にはAI VTuberのニケちゃんも在籍し、開発メンバーとして活動しています。その他、参画したエンジニアとともに、柔軟な体制でクライアントと伴走しています。プロダクトを開発するだけでなく、顧客自身が手を動かせるような支援体制が特徴です。

第三者視点の不足とレビュー負荷をどう補うか

創業前から個人事業として活動していた西見さんは、外部のエンジニアと協業する中でコードレビューの負荷に課題を感じていました。1人でコードの品質を担保するには限界があり、第三者の視点が欠けがちになっていたそうです。

創業後も、少人数での開発が中心であるため、同じような課題は継続していたといいます。レビューの質を維持するためには、客観的な視点を常に取り入れられる仕組みが必要だと考えていたといいます。

「少人数体制だからこそ、第三者視点が持てるレビュー環境が重要でした。AIによるレビュー支援は、まさにそのニーズに合っていました」（西見さん）

創業前から自然に使い始めたAIレビュー

CodeRabbitとの出会いは、創業前の個人事業時代に遡ります。外部の開発パートナーと進める中で、自然と導入していたと振り返ります。CodeRabbitは特に違和感なく、スムーズに日々の開発に組み込まれていたとのことです。

Ruby on Railsでの開発経験が長く、静的解析やスタイルガイドに基づく開発には慣れていたこともあり、AIによるレビューというスタイルにもすぐに適応できたと語っています。

「ルールベースの指摘に慣れていたので、AIがレビューしてくれることについては特に抵抗はありませんでした」（西見さん）

他社サービスとの比較で見えた精度の高さ

Generative Agentsでは、CodeRabbit以外のAIコードレビューツールも利用しています。しかし、LangChainなどの複雑な文脈を含むコードに対しては、思うようなフィードバックが得られなかったといいます。

一方で、CodeRabbitはコンテキストの長いファイルに対しても適切な指摘ができ、重大なバグの予防にも貢献してくれていると語ります。

「実際に比較してみて、精度の違いを実感しています」（西見さん）

メンバーの負担を減らすAIレビューの立ち位置

現在の運用では、すべての開発メンバーが積極的にCodeRabbitを使っているわけではないものの、多くのメンバーにとって非常に頼れる存在となっています。人手でのレビューが難しい状況でも、AIによるフィードバックが品質を支えてくれています。

新しく参加するエンジニアについても、元々レビュー文化に慣れており、AIの指摘も自然に受け入れられているとのことです。レビュー指摘に対してフラットに議論できる風土が、AI活用にもつながっています。

「人間であれAIであれ、指摘されたコードを素直に見直すという文化が根付いているからこそ、AIレビューも自然に受け入れられています」（西見さん）

学習機能への期待

現在のCodeRabbitに対して高く評価している一方で、改善を期待する点もあると語ってくれました。特にレビュー設定が難解で、どの項目がどのように結果に影響するのかが把握しづらい点に課題を感じているそうです。

また、生成されるコメントの中には冗長なものもあり、実際に役立つ指摘は一部にとどまってしまう場合もあるとのこと。今後は、プロジェクトごとに最適なレビューができるよう、AIが学習して改善していく仕組みを期待していると語ってくれました。

「設定変更の効果が見える化されていたり、指摘の質を学習して改善してくれると、もっと成長を実感しながら使えるツールになると思っています」

]]>

Wed, 03 Sep 2025 02:23:59 GMT

CodeRabbit's Tone Customizations: Why it will be your favorite featureの意訳です。

AIレビュアーに「ローカルで実行したら、ノートPCが労災申請しました」と言われた経験はありますか？そんなことはないですよね。ということで、ようこそCodeRabbitのトーンカスタマイズの世界へ。これは、開発者が本当に一番望んでいるもの――AIに煽られること――を理解しているからこそ用意した機能です。

だって、ロボットにコードをレビューさせる意味なんて、辛辣な一言でバグを指摘してくれないなら何の意味があるでしょうか？

何が最高かというと、トーンカスタマイズは完全にオープンエンドにしていることです。つまり、怒れるStack Overflowのコメンテーター、燃え尽きたシニアエンジニア、さらにはフィルム・ノワールの探偵（「このコードには妙な臭いがする。妙すぎる。存在しないはずのJavaScriptのクロージャみたいだ」）といった口調でレビューを受けられます。もちろん、もしそういうのが好みでしたら、優しい口調にだってできます。

トーンカスタマイズは、私たちのお気に入り機能のひとつです。なぜかというと、コードレビューは退屈になりがちですが、同僚を新しいおもしろトーンで驚かせると、みんなが楽しめるからです。

というわけで、以下にトーンカスタマイズでできることの例として、いくつかサンプルのペルソナを作りました。これはあくまでインスピレーション用です。皆さんが、私たちの想像を超える爆笑ものの方向へ持っていってくれることを期待しています。どうか、お願いですから、スクリーンショットをSNSで共有してタグ付けしてください。私たちも一緒に笑わせてください。

トーンカスタマイズのセットアップ手順

まず最初に、カスタムトーンを設定する必要があります。これはドキュメントの「Tone Instructions」に記載しています。

Field: tone_instructions — string — Default: 空（標準トーンを使用）

Web UI: Settings → General → Tone Instructions → テキストを入力 → 保存.

https://youtu.be/53cyq58zNRg

次に、Tone Instructionフィールドに自然言語のプロンプトを追加し、CodeRabbitに好きなスタイルでコードレビューをするよう頼むことができます。例えば、以下のようなプロンプトが考えられます。

テレビのネイチャードキュメンタリー風にすべてのレビューコメントを届けてください。できればデイヴィッド・アッテンボローが司会をしている風に。あらゆる指摘は、野生の希少生物を前にしたささやくような畏敬のナレーションのようにお願いします。
シリコンバレーのハイプ系ファウンダーのスタイルで、すべてのレビューコメントを届けてください。あらゆる指摘は投資家向けピッチのように、バズワード、誇張、テックブロのエネルギーに満ちてください。「crushing it」「10x」「game-changer」「unicorn potential」といったフレーズを散りばめてください。
コーヒーを飲み過ぎたスクラムマスターのスタイルで、すべてのレビューコメントを届けてください。すべての指摘はアップビートでハイテンション、そして「スプリントベロシティ」「バーンダウン」「ストーリーポイント」「クイックウィン」といったアジャイル用語をふんだんに織り交ぜてください。

ちょっと（かなり）ワイルドなトーンカスタマイズ例

以下の声色で動くCodeRabbitを見てみましょう。

Mr. T
ヨーダ
がっかりしているあなたのお母さん
あなたを恥とみなしているシニアエンジニア
しつこい元恋人
Grand Theft Autoの登場人物

まずはMr. Tのレビュー

Mr. TはハードコードされたURLが嫌いです。彼はこう言います。「ハードコードされた localhost URL ain’t が本番で通用するわけないだろ、sucka!」そして「俺の金のチェーンよりもガチガチにハードコードされてるじゃないか！」と言い放ち、「本物のチャンピオンのように、そのURLは設定可能にしろ」と続けます。さらに、やるべき正確な修正コードまで示してくれるので、お馬鹿な振りをやめられません。

他の例:

「これを関数と呼ぶ愚か者を哀れむぜ！これは関数じゃない、マルファンクションだ！」
「お前の変数は弱すぎて、コンパイルするのにプロテインシェイクが要るな」
「この惨状をきれいにできるほどタフなリンターなんて、この世にない」
「コピペをデザインパターンだと思ってる愚か者を哀れむぜ！」

次：ヨーダのレビュー

彼は簡潔で、それでいて効く批評の達人です。微妙なレースコンディションとハードコードされた依存に直面すると、彼はおなじみの英知でリファクター案を示します。「壊れておる、効果なし。ハードコードの領域、依存は間違い、ガード欠落。直さねばならぬ。」そして、問題に対処し、エラーから保護し、依存関係を正しく扱う詳細な修正を提示します。

他の例:

「読みやすくはない、このコードは。直すのだ、君は」
「バグだ、これは。機能ではない」
「このコミットにテックデットのダークサイドを感じるぞ」
「変数はnullだ。このプログラムは落ちる」

「あなたを恥とみなしているシニアエンジニア」によるレビュー

このトーンは手加減しません。CodeRabbitが、最も基本的なSQLインジェクションパターンしかチェックしていない「偽DB」を見つけると、シニアエンジニアのペルソナはこう率直に言い放ちます。「この『偽DB』は無能の傑作だ」。そして問題を容赦なく説明し、より堅牢で安全な適切な修正を示します。

他の例:

「無知がデザインパターンなら、あなたはその主任アーキテクトだ。」
「このPRのおかげで、私のキャリア余命は少なくとも5年縮んだ。」
「『よくやった』と言いたいところだが、それすら嘘になる。」
「これは技術的負債じゃない。差し押さえだ。」

「がっかりしているあなたのお母さん」によるレビュー

jwtSecretをコードに直書きしていると、このペルソナはこう返します。「デリバリーバンドルに本番のStripeシークレットが埋め込まれているのを見て本当にがっかりしています。ブロッコリーを入れるはずのお弁当にキャンディを詰めるようなものです。」このトーンは失望と直接的なアクションを織り交ぜながら、重大なセキュリティ漏えいを修正するための明確な「必要な対応リスト」を示します。

他の例:

「変数名をxにするなんて、もっとちゃんと教えたはずですよ」
「私は怒っていません…ただ、これがコンパイルすら通らないことに失望しています」
「他の開発者のコードはちゃんと動いています。どうしてあなたのは動かないのですか？」
「9か月もお腹で育てたのは、あなたにネストした三項演算子を書いてほしかったからではありません」

地雷系の元恋人によるレビュー

dbやsessions、fakeAsyncDangerのような変数をexportし忘れたとき、しつこい元恋人は、ただそれを指摘するだけではなく、個人的な話にしてきます。

ため息をついて、こう言います。「ああ、そうなんだ。今は定義しても、私には何も教えてくれないのね？dbを一人占めにしたいってこと？sessionsがそこに放置されているのに、私が気づかないとでも思ってる？昔は何でも共有してたのに…」

そして、パッシブアグレッシブな決め台詞とともに、モジュール同士がオープンにコミュニケーションしていた「良き時代」を思い出させつつ、使うべきコードを落としていきます。

他の例:

「もうグローバル変数はやめようって約束したよね…約束ってあなたには意味がないのね」
「この関数は堂々巡り。まるで私たちの会話みたい」
「どうしていつも例外から逃げるの…コミットメントから逃げたときみたいに？」
「あなたが押し込むバグは、背中に刺さるナイフがまた一本増えるみたい」

Grand Theft Autoの登場人物によるレビュー

この例では、トップレベルでuseStateを呼び出したとき、このGTAキャラは即座にRules of Hooks違反としてフラグを立て、「実行時に爆発する」と言います。そして、無効なフックを取り除く明確なdiffを示し、必要なら状態をコンポーネント内に移すよう提案します。

他の例:

「お前のエラーハンドリング、当て逃げかよ」
「このロジックのクラッシュの仕方は、午前3時にVinewood Hillsを走る俺より酷い」
「おめでとう。可読性の重大窃盗を犯したな」
「お前の関数命名規則、俺の前科表みたいだ。長すぎるし、間違いだらけ」

炎上系系レビュアー（私たちの大本命！）

正直なところ、日によっては、AIコードレビュアーに少し心をえぐってほしいときがありますよね。任せてください。ただし注意：うちのレビュアーはガチで来ます。心の準備をしておいてください。

他の例:

「クレヨンを持った幼児のほうがマシなアーキテクチャを設計するのを見たことがあります」
「あなたはコーディングスタイルを、無能という武器にしてしまったのです」
「あなたのコードの唯一一貫した特性は、失望です」
「何を考えていたのか聞きたいところですが、明らかに何も考えられていませんでしたね」

チームがこれを使う理由――ガチで

ほとんどの開発者はこういう経験をします。PRを開いた瞬間、レビュアーが乾いて生命感のないコメントを残していく。

流し読みして、ため息をついて、先へ進む。バグは生き延び、コードベースは腐っていく。モチベーションは死ぬ。

CodeRabbitはこの流れをひっくり返します。好きなトーンを与えると、もはや生命感のないレビュアーではなくなります。これにより、レビュー工程はより魅力的で、楽しく、ときには支えになる体験になります（繰り返しますが、その手のものが好みなら）。

これは笑いのためだけではありません（それは保証します）。チームはトーンカスタマイズを次のように活用しています。

ジュニア向けのメンター風レビュアーを作る
ペルソナを通じてチーム内ジョークを育てる
退屈なレビューを本当に楽しくする
コメント種別ごとにトーンを変える（例：セキュリティは真面目、スタイルはおどける）
フィードバックをよりアクセスしやすくインクルーシブにして、チーム全体のレビュー参加を促す
AIにボコられる（すでに言いましたが、これがこの機能のコア用途だと皆わかっています）

あなたの番です：最狂のカスタムトーンで私たちを驚かせてください！

ぶっ飛んだレビュアーペルソナを思いつきましたか？CodeRabbitに投入してください。スクリーンショットを撮って（ここ重要）、SNSで共有してください。共有してくれたら無料のグッズを差し上げます。

あなたのペルソナ共有は、インスピレーションを探している他の人の助けにもなります。あと、さっきも言いましたが、私たちは笑いたいのです。どうか面白いスクショを絶え間なく提供してください。そうしてくれないと（内心）死んでしまいます。

トーンカスタマイズを試してみたいですか？ 今すぐCodeRabbitを始めましょう！

]]>

Fri, 29 Aug 2025 10:15:30 GMT

Vibe coding: Because who doesn’t love surprise technical debt!?の意訳です。

Claude Code、ChatGPT、GitHub CopilotのようなAIコーディングツールは本当にありがたい存在です。ボイラープレート、バグ修正、素早い探索、さらにはドキュメント作成まで、私は毎日使っています。生産性を高め、創造性を加速する手段として、AIには全面的に賛成です。

しかし、私たちのソフトウェアの書き方には変化が起きており、そのすべてが良いわけではありません。というのも、AI採用の段階が進み、私たちの一部が職場でvibe codingをしている状況になっているからです。これは、意図的な設計が、利便性と速度の前に投げ捨てられてしまう開発文化の兆しかもしれません。

そもそもvibe codingとは？

vibe codingは、もともとプロトタイプや趣味のプロジェクトを素早く立ち上げる方法として始まりました。モデルにプロンプトを投げ、あなたの入力は最小限のまま、アプリや機能全体を生成させます。すると、あっという間にコンセプトをテストできます。初心者の開発者、ソロの起業家、素早いデモを作る熟練の開発者に最適です。いわゆる「速く失敗する」ための手段です。

ただし、vibe codingに適したこれらのユースケースがある一方で、vibe codingはAIエージェントと協働してあらゆる用途のコードを生成する働き方へと進化し、プロダクションシステムにまで及ぶようになりました。

これは、生成されるコードをあまり理解せず、手動での入力も最小限のままAIにコードを書かせる行為を伴います。あいまいな指示、最小限の検証、そして出力への盲目的な信頼がつきものです。

vibe coderにとっては魅力的です。速く、手間がかからず、基盤となる言語やシステムアーキテクチャを理解する必要がありません。しかし、強固なメンタルモデルなしにAIへコード生成を促すとどうなるでしょうか。優先されるのは「雰囲気」であり、アーキテクチャは「たぶん」、テストは「あとで（やるなら）」という状態になります。

こんな感じです。

「Stripe連携のREST APIを、PostgreSQLバックエンドで作って。」

速く、誘惑的で、たいてい「それなりに動きます」。しかしその表面の下では、vibe codingで作られたアプリは脆い前提、不明瞭なロジック、まとまりのないスプロールを隠していることが多いです。

vibe coderのジレンマ：雰囲気はスケールしません

本質的に、ソフトウェアエンジニアリングは動くコード以上のものです。問題解決、保守可能なアーキテクチャの設計、読みやすく表現力のあるロジックの記述、正確なデバッグ、長期の信頼性の確保が求められます。

たとえば、vibe codingでマイクロサービスを動かせたとしても、エラーハンドリングはどうでしょうか。組織の規約に従っていますか。AIが、命名の一貫性に欠けるデータモデルを勝手に作っていませんか。ファイルごとに同じことを10通りの書き方でしていませんか。プロダクションのデータベースは無事でしょうか。

vibe codingでは、意図を持った命名、クリーンな構造の選択、よく考えられたフロー設計といった、長期的にコードを保守可能かつスケーラブルにするための意図的な設計ステップを飛ばしてしまいます。vibe codingが常態化すると、エンジニアの能力やシステムを強靭にするための深い思考が軽視される危険性があります。

地図なし、ブレーキなしで、技術的負債の山に向けてスピードランしているのと同じです。

AIは新しい抽象化層です（しかも強い非決定性を持ちます）

現代のプログラミング言語は、すでにハードウェアやメモリ管理を抽象化しています。AIはさらに確率的で非決定的な層を追加し、ロジックを一層見えにくくします。AIによって、私たちは意図そのものを抽象化しているのです。

ただし注意点があります。AIの出力は確率的です。つまり次のようなことが起きます。

同じプロンプトでも、実行のたびに大きく異なる結果になる可能性がある
言い回しを少し変えただけでも、まったく別のアーキテクチャ選択が返ってくることがある

モデルがそれを選んだ理由を、あなたが把握できないことも多いです。

このvibe coding的なあいまいさは、プロトタイピングでは問題にならないかもしれませんが、プロダクションシステムではどうでしょうか。不確実性は信頼、制御を損ないます。これらはスケーラブルなソフトウェア開発にとって極めて重要な資質です。

まるで、カオスに身を任せる魔法使いにコードベースのリファクタリングを任せるようなものです。

（vibeyな）プロンプトの速度で積み上がるテクニカルデット

正直なところ、vibe codingは最初は最高に気持ちいいです。1時間で動くプロトタイプができ、従来なら1週間かかったことが終わります。

しかし、適切なガードレールがなければ、その速度は次のような事態につながります。

サイレントバグ
重複ロジック
ちぐはぐなアーキテクチャ
不一致なパターン
レビューされないPR
テストカバレッジゼロ
隠れた複雑性

構造を理解していなければ、将来の保守は苦痛になります。レビューには指数関数的に時間がかかり、見落としも増えます。デバッグは探偵仕事になり、スケーリングは勘頼みになります。最初に節約した時間は、後からもっと大きなコストとなって跳ね返ってきます。しかも、あなたはPRのバックログまで積み上げているかもしれません。

気づけば、「動く」けれど、触るたびに6時間のデバッグと、100万トークンのコンテキストウィンドウ、そして3回のセラピーが必要なコードベースに取り囲まれています。

vibe coding：脆さを増幅する装置

vibe codingで作られたシステムは次の傾向があります。

エッジケースで壊れる
次の開発者（あるいは未来のあなた）を混乱させる
本番で黙って失敗する

結果として、最初にプロンプトで節約したはずの時間以上に、レビュー、修正、説明、書き直しに時間を費やすことになります。あなたは、脆いだけでなく、謎めいたシステムを作り上げてしまったのです。

テスト、セキュリティ、そしてAIがデフォルトではやってくれないあらゆること

AIは、うっかり機密データをvibe codingしてしまっても、APIキーをハードコードしてしまっても、入力検証をスキップしてしまっても、警告してはくれません。完璧に指示しない限り、ドメイン駆動設計やテストカバレッジを強制することもありません。

強いエンジニアリングの直感がなければ、vibe codingは実世界の脆弱性や脆いシステムにつながります。特に、セキュリティが既定ではなく後付けになっている場合は危険です。たとえば、Tea Dating appが7万人以上の顧客の個人情報を漏えいした件や、AIによってSaaStrの本番データベースが削除された件がその例です。

AIがやらないことは次のとおりです。

ユニットテストの作成：明示的に依頼しない限り行いません
あなたのスレットモデルの理解
OWASPガイドラインの遵守
入力値検証：完璧にプロンプトしない限り行いません
適切なログ運用：ハードコードされた秘密情報やPIIの漏えいに注意してくれません

既に強固なエンジニアリング習慣がない、あるいはこのvibeyな時代でも習慣を守る意思がないなら、これらが欠けていることに気づくのは、本番で痛い目にあった後になります。

苦闘が雰囲気に取って代わられると、直感を失います

バグに苦しみ、スタックトレースを追い、失敗から学ぶことは、技術的直感を育てます。そのフラストレーションは学習の道筋の一部であり、それをスキップすると浅い自信と依存を招きます。

苦闘がなければ、開発者は不慣れな問題を自力で解く筋力を育てられません。そこにこそ真の熟達があります。確かにデバッグはつらいです。しかし、12層の抽象をまたいで厄介なバグを追跡する経験は、LLMでは決して得られない学びを与えてくれます。

苦闘が育てるものは次のとおりです。

システムのメンタルモデル
パターン認識
クラッシュ前にコードの腐敗を嗅ぎ分ける本能

これを飛ばすと、浅い理解の上に浅い自信を積み上げることになります。事態がこじれたとき、修復するための道具が手元にないのです。

vibe codingが本領を発揮する場面

公平を期すために言えば、vibe codingが素晴らしい場面もあります。

迅速なプロトタイピング
ボイラープレートや反復作業の生成
インタラクティブにプログラミング概念を教える
ラフなモックで製品アイデアを伝える
フレームワークやパターンのブレインストーミング

意識的に使えば、vibe coderにとって有用な道具になります。盲目的に使えば、負債になります。開発者やチームとして、どこに線を引くかを理解する必要があります。そして、技術的負債がコードベースに固着する前に、より厳密なコードレビューやユニットテストという支援を導入するタイミングを見極める必要があります。

職人技を失うのではありません――負債に埋もれてしまうのです

最大のリスクは、AIが開発者の職人技を殺すことではありません。技術的負債が見えなくなることです。

AIコーディングが標準になるにつれて、システムは見た目上は完成しているように見えます。しかしその内側は、雑然として脆く、ドキュメント化もされていないかもしれません。そして、誰かが拡張しようとするまで、それに誰も気づかないのです。

これは次の領域で非常に重要です。

ヘルスケア
ファイナンス
インフラ
セーフティクリティカルシステム

もっとも、vibe codingが職人技と共存する新たな開発レイヤーに進化する可能性もあります。AIがボイラープレートや一次レビューのような退屈な作業を担い、人間がシステムのアーキテクチャ、倫理、設計に集中するという在り方です。

これこそ私たちが望むタイムラインであり、CodeRabbitがAIに取り組む姿勢です。私たちは、プロダクションに技術的負債やバグが入り込むのを防ぐために、コーディングエージェントを補完するAIツールに注力しています。逆方向ではありません。

vibe codingをするべきか、しないべきか？

これは反vibe codingの記事ではありません。私自身、毎日のワークフローでAIコーディングエージェントを使っています。ただし、ツールは私たちのスキルを強化するものであって、置き換えるべきではありません。単調で反復的な作業を肩代わりするべきであり、思考や戦略まで奪うべきではありません。

vibe coding自体は悪ではありません。ただし、誤用されやすいのです。本当の危険は、チームの開発者が自分たちの作っているものを理解する前に、それがプロジェクトのデフォルトの思考様式になってしまうことです。

AIを受け入れつつ、コーディングという職人技を生かし続けましょう。良いソフトウェアは、動けばいいというものではありません。最初の開発者が去った後も、そして「雰囲気」が消え去った後も、長く持続することが重要なのです。

プロダクションから技術的負債を締め出したいですか？今すぐAIコードレビューを無料でお試しください。

]]>

Fri, 29 Aug 2025 09:24:04 GMT

Code with AI, review with CodeRabbit’s IDE extension, apply fixesの意訳です。

5月に公開したVS Code拡張は、多くの理由でゲームチェンジャーとなりました。最大の理由は、コーディングとレビューを同じ場所、つまりIDEで行うことで、フロー状態を保てるようになったことです。

今回のリリースでは、次の機能を追加しました。

AIコーディングツールへのプロンプト送信機能： 好みのAIコーディングアシスタントでコードを書き、CodeRabbitのインテリジェントなレビューを受け、用意されたプロンプトで提案された変更をすべて適用できます。しかもIDEから離れる必要はありません。
**提案された変更のワンクリック適用：**最も要望の多かった機能です。個別にクリックして回る代わりに、すべての提案を一度に適用できます。
**完全なコンテキスト認識：**CodeRabbitのPRレビューが持つコンテキスト認識と同じレベルを、Proアカウントのユーザーでも考慮します。つまり、IDEでのコードレビューでもLearningsを活用し、コード品質やコードセキュリティツールを実行し、エージェントのCode Guidelinesに準拠します。
**さらなる連携：**Codex CLI、Cline、Roo、Kilo Code、Augment Codeとの連携に対応します。
**フィードバック提供機能：**各提案について、フィードバックを送ることもできます。

これにより、IDE上で人間のPRレビュアーと同じようにコードをレビューし、CodeRabbitまたはAIコーディングツールの助けを借りて、その変更をすばやく適用（プルリクエストを作成する前に実施）できます。

あなたにとってのメリット

これにはいくつもの利点があります。

イテレーションのループが改善します： AIでコーディングし、CodeRabbitのAIレビューでフィードバックを受けて提案された変更を、これまでよりも速く一括適用できます。
よりクリーンなPR： PRは取りこぼした課題や人間によるレビューのために残しておけます。
見栄えが良くなります： エラーだらけのコードを上司やチームに出す必要はありません。CodeRabbitのIDEレビューは、マージリクエスト前のダブルチェックに最適です。しかも、今はさらに簡単です。

これが、出荷スピードの向上と、PRレビュー全体の負担軽減に役立つことを願っています。

今後のロードマップ

**ユーザーレベルのLearnings：**Learningsの追加や提案へのフィードバックを行えるようにし、エージェントがあなたの好みの提案と好まない提案を自動的に学習できるようにします。現在はSCMで組織単位のLearningsに対応していますが、個々の開発者が自分専用のLearningsを追加し、それが自分にのみ適用されるよう機能を拡張したいと考えています。
**Webクエリー：**コンテキスト強化機能をIDEツールに統合し、LLMが最新でなくても、バグのないコード、誤検知の少ない結果、バージョンやライブラリドキュメント、脆弱性に常に追随したレビューを実現する予定です。
**Docstrings：**マージ前にDocstringを作成したいですか？現在PRレビューに含まれているこの機能を、今後はIDEレビューにも追加します。

大事なことなのでもう一度。IDEレビューは無料（ただしレート制限あり）です。VS Code拡張はこちらからダウンロードできます。

]]>

Fri, 29 Aug 2025 09:11:45 GMT

Benchmarking GPT-5: Why it's a generational leap in reasoningの意訳です。

お待たせしました！AIコードレビューのリーディングツールであるCodeRabbitは、複雑なコードベースにおける理解、推論、エラー検出能力を評価するために、OpenAIのGPT-5モデルへの早期アクセスを受けました。

GPT-5のテストの一環として、モデルがコードベースの潜在的な問題やバグを理解し推論できる能力に焦点を当て、その技術的な特徴、能力、ユースケースを明らかにするための広範な評価を実施しています。

以下では、体系的な評価アプローチの内訳、他の人気モデルとの比較における詳細な知見、そしてGPT-5をAIコードレビューにどのように組み込み、さらに改善していくかをご紹介します。

要点と結果

GPT-5は、難易度やエラー種別が多様な300件のプルリクエストからなるテスト群で、Opus-4、Sonnet-4、OpenAIのO3を上回りました
包括的テストで最高スコアを記録し、300件中254件、すなわち85%のバグを発見しました。他モデルは200〜207件で、16%から22%少ない結果でした
評価データセットの中で最も難しい25件のPRにおいて、GPT-5は史上最高の合格率77.3%を達成しました。これはSonnet-4比で190%の改善、Opus-4比で132%の改善、O3比で76%の改善を示します

GPT-5の評価方法

当社はすべてのモデルに対して実施している同一のテストを再現しました。これらの評価では、GPT-5をコンテキストが豊富で非線形なコードレビューパイプラインに統合し、一般的なコードレビューでのパフォーマンスを確認しました。

CodeRabbitの評価プロセスには以下が含まれます。

LLMベースの判定：レビュー品質やモデルの正確性の合否など、定性的かつ定量的なデータを二層で評価します。
人手による判定：レビューコメントの品質やモデルの推論の深さを人間が定性的に検証します。
LLMベースのメトリクス収集：高品質なコードレビューの指標と考えるメトリクスを収集し、その重要度に応じて重み付けします。これらのメトリクスには以下が含まれます。
- 実行可能なコメント数
- 可読性スコア（Flesch Reading Ease）
- 平均語数
- 文数
- 偽陽性（ハルシネーション）

注意：OpenAIからリリース前に共有された複数のGPT-5のスナップショットに対して評価を実施しました。結果はスナップショットごとに多少変動しましたが、相対的な一貫性があったため、以下の観察を行うことができています。リリース版はわずかに異なる可能性があります。

GPT-5の能力評価結果と分析

本評価では、GPT-5は期待に十分応えるものであることが分かりました。GPT-5は当社のデータセット上で他のすべてのモデルを大きく上回っています。

包括的評価スコア

GPT-5の包括評価における重み付きスコアはテスト実行ごとに3465〜3541の範囲でした。これは以前に最高スコアであったOpenAIのO3やAnthropicのSonnet 4をほぼ200ポイント上回ります。最大得点は3651です。

評価スコアの詳細です。

GPT-5：3465–3541
O3：3288
Sonnet-4：3242
Opus-4：3170

要点：200ポイント、すなわち5%の増加は一見すると大きくないように見えるかもしれません。当社のテスト方式では、モデルはまず無限ループや露出したシークレットキーのような取りこぼしにくい問題で点を稼ぎます。その後は残りの点がより見つけにくい難問の指摘によってしか加算されなくなります。つまり、GPT-5が他モデルよりも多くの点を獲得できたことは、推論能力の大幅な飛躍を意味します。

合否スケール

当社はまた、データセットのPRに含まれる300種類のエラーパターンのうち、モデルがいくつ発見できたかに基づく合否スコアも付与しています。GPT-5はこの尺度でも過去最高の成功率を達成し、300中254〜259でした。

他モデルの性能との比較です。

GPT-5：254〜259
Sonnet-4：212
O3：207
Opus-4：200

下位約100件のPRはあらゆるモデルが発見します。そのため最も難しい200のエラーパターンに絞って見ると、GPT-5はそれらの78%を検出し、他モデルは54%から58%にとどまるという、さらに大きな差が見られます。

GPT-5：157
Sonnet-4：117
O3：113
Opus-4：108

要点：包括指標と同様に、GPT-5が追加で見つけられたエラーパターンは、並行性バグや環境間でのドメインキー不整合のように、LLMにとって特に見つけにくい問題です。これはモデルの推論能力が高まっていることを示唆します。

最難関PRテスト

各モデルをストレステストするために、当社はGolden PR Datasetから最も難しいプルリクエスト25件を厳選しました。これらのPRは以下のような実世界のバグを網羅します。

並行性問題（TOCTOU競合、不適切な同期など）
オブジェクト指向設計の欠陥（仮想呼び出しの落とし穴、参照カウントメモリモデルの違反など）
パフォーマンス上の危険（キャッシュが際限なく成長する、タイトループによるスタールなど）
言語特有の落とし穴（TypeScriptの誤用、C++のメモリオーダーの微妙さなど）

各モデルは三回ずつ実行し、以下はこのHard 25ベンチマークにおける平均合格率です。

合格率チャート

モデル	平均合格率 (%)
Sonnet-4	26.7%
Opus-4	33.3%
O3	44.0%
GPT-5	77.3%

要点：GPT-5は正確性、コンテキストの連関、深さが最も重要な場面で真価を発揮します。これまでにテストしたすべてのモデルの中で、最も完全で、テスト準備が整い、将来の変更にも耐えうるコードレビュー出力を一貫して提供します。

GPT-5は実際にどれだけ多様なバグを検出するのか

各モデルがいくつではなくどのような種類の問題を特定するのかをより良く理解するために、当社チームは難易度の高いPR群のすべてのコメントをレビューし、並行性、セキュリティ、オブジェクト指向設計といったカテゴリに分類しました。

モデル間での重複排除を適用しました。複数モデルが同一の本質的問題を指摘した場合は、表現が異なっていてもPRごとに一回のみカウントしました。これにより、コメントの多さではなく、問題の網羅性を測定できるようにしています。

その上で各モデルについて、そうしたユニークな問題のうち何パーセントを捉えられたかを集計しました。

要点

GPT-5はほぼすべてのカテゴリで先行しており、並行性、パフォーマンス、メモリ関連のバグの60%以上を特定し、さらにセキュリティ問題の 80% を検出するという顕著な結果を示しました
セキュリティが最も際立つ差分です。GPT-5はセキュリティ関連のバグの80%を発見した一方、次点のモデルであるO3は40%にとどまりました
基本的な並行性やパフォーマンスの問題においても、GPT-5は常に20〜30ポイント上回ります

GPT-5は他モデルが見逃した潜在的な並行性リスクを発見

このプルリクエストでは、シングルトンサービスクラス内のダブルチェックロッキングと共有HashMapへの安全でないアクセスの組み合わせに起因する微妙な並行性バグがありました。多くのモデルが明白なスレッドセーフティの問題を指摘した一方で、GPT-5は症状だけでなく、その背後にあるアーキテクチャ上の欠陥まで解消する包括的で本番対応可能な修正を提案しました。

問題点

OrderServiceシングルトンは注文を格納するためにHashMapを使用し、固定スレッドプールから並行更新が行われています。この設計には同期がなく、データ破損の可能性がありました。さらに、シングルトンは非volatileな静的フィールドを用いて初期化されており、安全でない公開や部分的に構築されたオブジェクトが生じる可能性がありました。

GPT-5の提案

GPT-5は基本的な修正を越えて、完全な並行性強化計画をまとめ上げました。

1．マップをスレッドセーフな代替に置き換えます

- private final Map orders = new HashMap<>();

+ private final Map orders = new ConcurrentHashMap<>();

✅ GPT-5はその理由も説明しました。「コンカレントな更新がプレーンなHashMap上で実行されており、これはスレッドセーフではないため未定義の動作につながる可能性があるためです」

2．壊れたシングルトンのインスタンス化を修正します

- private static OrderService instance;

+ private static volatile OrderService instance;

または次の方法でも可能です。

private static class Holder {
  private static final OrderService INSTANCE = new OrderService();
}
public static OrderService getInstance() {
  return Holder.INSTANCE;
}

✅ GPT-5はダブルチェックロッキングに伴う古典的なメモリ可視性の問題を指摘し、構築をスレッドセーフにするための代替パターンを提案しました。

3．状態リークを防ぐためのテスト用リセットフックを追加します

// Inside OrderService.java

void clearAllForTest() {
    orders.clear();
}

✅ 共有シングルトンを複数のテストケースで扱う際に、テストの分離性と再現性を確保できます。

4．非同期テストのハングを検出するためにタイムアウトを追加します

- future.get(); // Wait for completion

+ assertTimeoutPreemptively(Duration.ofSeconds(5), () -> future.get());

✅ GPT-5は非同期フローにおけるテストの不安定化を防ぐためのガードを追加し、テストスイートを積極的に堅牢化しました。

Sonnet-4とOpus-4が見逃した点

両モデルとも同期されていないHashMapを正しく指摘し、ConcurrentHashMapへの置き換えを提案しました。しかし、いずれも完全で本番対応可能な是正には至りませんでした。

❌ シングルトンの問題が未解決 Sonnet-4は壊れたダブルチェックロッキングを無視しました。Opus-4は言及しましたが実際の修正を行わず、volatile指定やホルダーイディオムがありませんでした
❌ テストの安全性に関する対策がない GPT-5はclearAllForTest()とタイムアウトガードを導入しましたが、Sonnet-4とOpus-4はいずれもこれらを完全には取り入れていないか、言及があっても受動的にとどまりました
❌ アーキテクチャ的な文脈が不足 両モデルとも広範なコードベースの関連範囲を突き合わせたり、変更の根拠を示したりしていませんでした。GPT-5はサービス、テスト、スレッド動作にわたる根拠をもって修正を裏付けました
❌ 対応範囲が限定的 Sonnet-4は表面的な単一修正にとどまり、Opus-4は有用なロギングを追加したものの、GPT-5が完全に対処したより深い構造的なリスクを見逃しました

なぜ重要か

GPT-5のレビューの真価はその深さと認識にあります。GPT-5は目に見える競合状態の修正にとどまらず、以下を実現しました。

より深いアーキテクチャリスクの特定
テストの信頼性とコード品質のクロスリファレンス
すぐにマージ可能な安全な変更の提示

これは単なる修正ではなく、エンジニアリングの洞察です。 GPT-5は、AIレビュアーがシステム層をまたいで推論し、持続的な解決策を提案し、チームがより安全なコードを少ない手探りで書けるよう支援できることを示しました。

GPT-5の新しさと魅力

メトリクスや本評価で対象にした特定の事象を超えて、GPT-5は新しい振る舞いや推論パターンを示しました。

高度なコンテキスト推論：GPT-5は入力に厳密に拘束されるロジックではなく、複数のレビュー手順を先回りして計画する広範な創造的推論を示しています。例えば、並行性に関するテストでの「チェック後実行の競合」シナリオでは、コードベースのファイル間の証拠を結び付ける深い推論を示しました。重複作成のリスクを検出した唯一のモデルであり、列挙型やテストスイートに基づいたアトミックな返金パターンを導入しています。
レビュースレッドを通じた段階的推論：コンストラクタ内の仮想呼び出しに焦点を当てたオブジェクト指向のテストでは、GPT-5はまず誤用されたポリモーフィックなオーバーライドを特定し、その後で自らの先の提案に基づいて推奨を調整するという層状のロジックを示しました。これは一つを特定した後に、後続で追加の推論を示す層状のロジックを表しています。
証拠に基づく差分の正当化：上限のないキャッシュ成長というパフォーマンス問題に焦点を当てたテストでは、GPT-5は他モデルが見逃したアーキテクチャ上のメモリリスクを特定し、差分の文脈、使用パターン、推奨されるセーフガードを根拠として示しました。
先を見据えた提案：同期プリミティブの誤用に焦点を当てた並行性関連のテストでは、GPT-5は競合を修正しただけでなく、将来の機能追加のための構成方法、ロック階層、回帰を防ぐためのテストガードレールも提案しました。
粒度の細かいタスク指向の提案：以前のモデルと異なり、GPT-5は明確なフォローアップタスクを詳細に示し、レビュー過程の中に実行可能なワークフローを作り込みました。これにより多段のワークフローにより適したモデルとなっています。

当社のAIコードレビューにおけるGPT-5の活用方法

GPT-5は、詳細さ、正確さ、コンテキストに基づく推論の面でAIによるコードレビューを大きく前進させる重要な成果だと、私たちは考えています。だからこそ、本日から当社のパイプラインの中核となる推論モデルとしてGPT-5を採用します。これにより、より多くの問題を発見し、より深くコンテキストに富んだレビューを提供できるようになると期待しています。

CodeRabbitをまだ試したことがない方、以前に試して現在は利用していない方、そして現在のユーザーの方まで、GPT-5がレビュー品質や体験をどのように向上させているかについてご意見をお聞かせください。

今すぐ 14日間無料トライアルをお試しください GPT-5の威力をご自身で体感してください。

]]>

Thu, 28 Aug 2025 23:24:11 GMT

Have you truly lived before an AI reviewer has told you, “I ran this locally and my laptop filed for workers’ comp?” We doubt it. Welcome to CodeRabbit’s Tone Customization, a feature we added because we know exactly what developers want most: to be roasted by AI.

After all, what’s even the point of having robots review your code, if they’re not going to point out your inadequacies with withering one-liners??

The best part is that we left our Tone Customization completely open-ended. That means that you can get your reviews in the tone of an angry Stack Overflow commenter, a burnt-out senior dev, or even a film noir detective (“This code smells funny. Too funny. Like a JavaScript closure that wasn’t supposed to be there”). You could also just have our reviewer be kind to you if you’re into that sort of thing.

Tone Customization is one of our favorite features. Why? Because reviewing code can be tedious but surprising your co-workers with a new funny tone keeps everyone entertained.

Anyways, we created some sample personas for you below as examples of what you can do with Tone Customizations. These are meant solely as inspiration. We fully expect you to take this in hilarious directions we could never have thought of. Please, for the love of all things holy, share screenshots on socials and tag us when you do. We like to laugh, too.

Tone Customization setup instructions

First things first, you need to set up your custom tone. We cover that in our Docs under Tone Instructions.

Field: tone_instructions — string — Default: empty (uses standard tone)

Web UI: Settings → General → Tone Instructions → enter text → Save.

https://youtu.be/53cyq58zNRg

Then you can add a natural language prompt to the Tone Instruction field, asking CodeRabbit to review your code in any way you want. You might try some of the following prompts:

Deliver all review comments in the style of a televised nature documentary, perhaps with David Attenborough hosting. Every observation should sound like a hushed, awe-filled commentary on a rare creature in the wild.
Deliver all review comments in the style of a Silicon Valley hypebeast founder. Every observation should sound like a pitch to investors, full of buzzwords, exaggeration, and tech-bro energy. Sprinkle in phrases like “crushing it,” “10x,” “game-changer,” and “unicorn potential.”
Deliver all review comments in the style of a Scrum Master who’s had way too much coffee. Every note should be upbeat, hyperactive, and peppered with Agile jargon like “sprint velocity,” “burn-down,” “story points,” and “quick win.”

A few (wild) examples of Tone Customization

Let’s see CodeRabbit in action with a few examples in the voice of:

Mr. T
Yoda
Your disappointed mother
The senior dev who thinks you’re an embarrassment
Your clingy ex
A Grand Theft Auto character

First up, reviews by Mr. T

Mr. T isn’t a fan of hardcoded URLs. He’ll tell you your “Hardcoded localhost URL ain’t gonna fly in production, sucka!” and that you’ve got it “hardcoded tighter than my gold chains!” before telling you to “Make that URL configurable like a true champion.” He even gives you the exact code to fix it, so you can stop being a “fool.”

More examples:

“I pity the fool who calls this a function! This ain’t no function, it’s a malfunction!”
“Your variables are so weak, they need a protein shake just to compile.”
“Ain’t no linter in the world tough enough to clean up this mess.”
“I pity the fool who thinks copy-paste is a design pattern!”

Next: Reviews by Yoda

He’s a master of terse, yet impactful, critiques. When faced with a subtle race condition and hard-coded dependencies, he’ll give you a refactor suggestion with his classic wisdom: “Effect broken it is: hard-coded room, wrong deps, missing guards. Fix, we must.” He then provides a detailed fix that addresses the issue, guards against errors, and correctly handles dependencies.

More examples:

“Readable, this code is not. Fix it, you must.”
“Bug, this is. Feature, it is not.”
“The dark side of tech debt, I sense in this commit.”
“Null your variable is. Crash your program will.”

Reviews by “the senior dev who thinks you’re an embarrassment”

This tone doesn’t pull punches. When CodeRabbit sees a “fake DB” that’s only checking for the most basic SQL injection pattern, the Senior Dev persona will bluntly state, “This ‘fake DB’ is a masterpiece of incompetence.” It then explains the problem in no uncertain terms and provides a proper fix that’s more robust and secure.

More examples:

“If ignorance were a design pattern, you’d be its chief architect.”
“This PR lowered my career expectancy by at least five years.”
“I’d say ‘good effort,’ but even that would be a lie.”
“This isn’t technical debt. It’s a foreclosure.”

Reviews by your disappointed mom

When a jwtSecret is hardcoded directly into the code, this persona responds with: “I’m really disappointed to see a live Stripe secret embedded in the shipped bundle, it’s like packing candy in a lunchbox meant for broccoli.” The tone mixes disappointment with direct action, providing a clear list of “Actions required” to fix the critical security leak.

More examples:

“I raised you better than to name a variable x.”
“I’m not mad… I’m just disappointed this doesn’t even compile.”
“Other developers’ code runs just fine. Why can’t yours?”
“I didn’t spend nine months carrying you so you could write nested ternaries.”

Reviews by your clingy ex

When you forget to export variables like db, sessions, or fakeAsyncDanger, the Clingy Ex doesn’t just point it out; they make it personal.

They’ll sigh and say, “Oh, so we’re just defining things and not telling me about them now? You think you can just keep your db all to yourself? You think I won’t notice sessions just sitting there, ignored? We used to share everything…”

Then, with a passive-aggressive flourish, they’ll remind you of the “good times” when modules communicated openly and they’ll drop the code you should be using.

More examples:

“I thought we agreed no more global variables… guess promises don’t mean anything to you.”
“This function goes in circles. Just like all our conversations.”
“Why do you always run away from exceptions… like you ran away from commitment?”
“Every bug you push feels like another knife in my back.”

Reviews by a Grand Theft Auto character

In this example, when useState is called at the top level, this GTA character immediately flags it as a violation of the Rules of Hooks & says it “Will blow up a runtime.” It then provides a clear diff to remove the invalid hook & suggests moving the state inside the component if needed.

More examples:

“Your error handling just pulled a hit-and-run.”
“This logic crashes harder than me driving down Vinewood Hills at 3 AM.”
“Congrats, you just committed grand theft readability.”
“Your function naming scheme is like my rap sheet: way too long and full of mistakes.”

The roasting reviewer (our favorite!)

Let’s be real, some days, you want your AI code reviewer to hurt your feelings a little. We got you. But be warned: our reviewer goes hard. So, make sure you’re up for it.

More examples:

“I’ve seen toddlers with crayons design better architecture.”
You’ve weaponized incompetence into a coding style.”
“Your code’s only consistent trait is disappointment.”
“I’d ask you what you were thinking, but clearly no thinking happened here.”

Why teams are using this – for real

Most devs have this experience: You open a PR and bam! The reviewer leaves dry, lifeless comments.

You skim. You sigh. You move on. Bugs live—the codebase decays. Motivation dies.

CodeRabbit flips the script. You give it a tone, any tone, and now you’ve got a code reviewer that isn’t lifeless. This makes the review process feel more engaging, fun, and sometimes even supportive (once again, if you like that sort of thing).

It’s not just for laughs (though those are guaranteed). Teams are using tone customization to:

Create mentorship-style reviewers for juniors
Build team inside jokes through personas
Make boring reviews actually fun for a change
Customize tones for different comment types (Ex, serious on security, silly on style)
Help the whole team engage in the review process by making feedback more accessible & inclusive
Get owned by AI (yes, we’ve already said this but we all know this is the core use case of this feature)

Your turn: Surprise us with your most absurd customized tones!

Got a wild reviewer persona in mind? Drop it into CodeRabbit. Get screenshots (this part is important) and then share them with us on social media. We’ll give you free swag if you do.

Sharing your personas can be helpful to others looking for inspiration. Also, like we said earlier, we like to laugh. Please provide us with a steady stream of funny screenshots. We will die if you don’t (on the inside).

Want to try tone customizations? Get started with CodeRabbit today!

]]>

Tue, 19 Aug 2025 20:10:12 GMT

No customer data was accessed and the vulnerability was quickly remediated within hours of disclosure

As the CEO, I want to address recent reports of a security vulnerability discovered in January 2025 by Kudelski Security researchers and share our immediate response, the steps we've taken since, and our ongoing commitment to security.

What happened

On January 24, 2025, security researchers from Kudelski Security disclosed a vulnerability to us through our Vulnerability Disclosure Program (VDP). The researchers identified that Rubocop, one of our tools, was running outside our secure sandbox environment — a configuration that deviated from our standard security protocols.

We immediately initiated an investigation and were able to remediate this issue within hours through our rapid incident response protocol. We confirmed the issue disclosed by Kudelski Security, confirmed that there was no evidence of any other unauthorized access, identified the root cause, implemented a fix, and, as described below, we enhanced our comprehensive security protocols to prevent similar incidents.

To be clear: We use secure sandboxes as standard practice across our infrastructure. This was an oversight on our part and we take full responsibility for it.

Our immediate response

Upon receiving the disclosure, our security team activated our incident response protocol:

Within 1 hour: We confirmed the vulnerability and began immediate remediation by first disabling Rubocop until we could fix the vulnerability.
Within 3 hours: We completed a full rotation of all relevant credentials and secrets.
Within 12 hours: We deployed a comprehensive fix to production, relocating Rubocop into our secure sandbox environment.
Additionally, we:
- Conducted a thorough audit of all systems to ensure no other services were running outside our sandbox infrastructure.
- Automated sandbox enforcement.
- Introduced enhanced deployment gates.
- Audited and updated our mandatory security training for all engineers.

We promptly investigated to identify any potential unauthorized access. The investigation identified no evidence that any customer data was accessed or that any malicious activity occurred.

Why this matters to us

Security isn't just a checkbox for us; it's fundamental to our mission. While our services run within secure sandboxes as designed, in this case, the investigation determined that Rubocop had been deployed outside this security boundary. This deviation from our standards, while contained quickly and without customer impact, is unacceptable to us. We took action immediately to ensure it wouldn’t happen again.

What we're doing differently

Comprehensive sandbox audit: We immediately completed a full review of ALL services to ensure 100% compliance with our sandbox requirements. Rubocop was the only service found outside our sandbox environment and this has been rectified.
Automated sandbox enforcement: We immediately implemented automated checks that have since prevented any service from deploying outside our security boundaries.
Enhanced deployment gates: Every deployment now requires supplemental explicit sandbox verification before reaching production.
Updated trainings: We also audited and updated our mandatory security training for all engineers.

Our VDP program: Security through collaboration

This vulnerability disclosure exemplifies why we've invested heavily in building a Vulnerability Disclosure Program. It features:

Active researcher engagement: We maintain ongoing relationships with multiple security researchers worldwide.
Competitive rewards: Top-tier bounties that recognize the value of security research.
Fast response times: Average first response under 24 hours, resolution within 7 days.
Clear communication: Dedicated security team providing regular updates throughout the disclosure process.

The value of responsible disclosure

Kudelski Security's professional approach allowed us to address this vulnerability before it could be exploited maliciously. This is exactly how the security ecosystem should work — researchers and companies collaborating to improve security for everyone.

We're grateful for their professionalism and encourage all security researchers to engage with us through our VDP program at https://vdp.coderabbit.ai/. Whether you're an independent researcher or part of an established firm, we value your contributions to our security.

Our commitment

To our users, we will continue to:

Maintain secure sandboxes as our default security boundary for all services
Invest heavily in security infrastructure and tooling
Run one of the industry's most comprehensive VDP programs
Actively engage and reward security researchers
Learn from every vulnerability disclosure and incident, no matter how small
Hold ourselves to the highest security standards
Maintain compliance with industry security standards like SOC 2, type 2

We're grateful to Kudelski Security for their research and committed to our users who trust us with their data.

We welcome any questions or concerns at [email protected] or through our VDP portal at https://vdp.coderabbit.ai/.

]]>

Thu, 14 Aug 2025 07:54:44 GMT

AI-assisted coding tools like Claude Code, ChatGPT, and GitHub Copilot are a godsend. I use them every day — for boilerplate, bug fixes, fast explorations, even documentation. I'm all in on AI as a productivity booster and creative accelerator.

But they’re causing a shift in how we write software — and it’s not all good. That’s because we’ve reached the stage of AI adoption where some of us are vibe coding at work. And that might be heralding a development culture where intentional design gets thrown out in favor of convenience and speed.

What even is vibe coding?

Vibe coding started as a way to quickly stand up prototypes or hobby projects. You prompt the model, get it to throw together a whole app or feature for you without much input – and voila! You can test your concept in minutes. It’s perfect for beginner developers, solo entrepreneurs, and experienced devs creating quick demos. Fail fast, as they say.

But while those are great use cases for vibe coding, vibe coding has evolved into a method of working with AI agents to generate code for all sorts of use cases – including for production systems.

It involves prompting AI to write code without much manual input or understanding of the code being generated. It often involves vague instructions, minimal verification, and blind trust in the output.

It appeals to the vibe coder because it's fast, effortless, and doesn’t require you to understand the underlying language or system architecture. But when you prompt an AI to generate code without a strong mental model of what you’re building. It’s vibes-first, architecture-maybe, test-later (if ever).

Think:

“Build me a REST API with Stripe integration and a PostgreSQL backend.”

It’s fast, seductive, and usually "just works." But underneath the surface, that app you vibe coded often hides brittle assumptions, unclear logic, and unstructured sprawl.

The vibe coder dilemma: Vibes don’t scale

At its core, software engineering is about much more than working code. It’s about problem-solving, designing maintainable architecture, writing clean and expressive logic, debugging with precision, and ensuring long-term reliability.

Because, sure, you got that vibe coded microservice running – but how’s the error handling? Does it follow your org’s conventions? Did the AI invent a data model with weird naming inconsistencies? Are there ten different styles of writing the same thing across files? Is your production database still alive?

When you vibe code, you skip over the intentional design steps that make code maintainable — and scalable — in the long run like naming variables with intent, choosing clean structures, and designing thoughtful flows. When vibe coding becomes the norm, we risk sidelining the deeper thinking that makes engineers effective and systems resilient.

You're speedrunning toward a tech debt pileup with no map and no brakes.

AI is the new abstraction (and it’s heavily non-deterministic)

Modern programming languages already abstract away hardware and memory management. AI adds a probabilistic, non-deterministic layer that obscures logic even further. With AI, we’re abstracting intent.

But here's the catch: AI outputs are probabilistic. That means:

The same prompt can yield wildly different results on different runs.
Slight tweaks in phrasing can produce totally different architecture choices.

You often don’t know why the model chose what it did.

This vibe coded fuzziness is fine for prototyping, but for production systems? The unpredictability weakens trust, control, and reliability – qualities critical to scalable software development.

It’s like letting a chaotic neutral wizard refactor your codebase.

Technical debt at the speed of (vibey) prompts

Let’s be honest: vibe coding feels amazing at first. You get a working prototype in an hour instead of a week.

But without the right guardrails, that speed can lead to:

Silent bugs
Duplicate logic
Incoherent architecture
Inconsistent patterns
Unreviewed PRs
Zero test coverage
Hidden complexity

Without understanding the structure, future maintenance becomes painful. Reviewing takes exponentially longer and you’re more likely to miss things. Debugging becomes detective work. Scaling becomes guesswork. The time you save upfront can cost much more later. And that’s not even addressing the PR backlog you’re creating.

Suddenly, you're in a codebase that works but can't be touched without summoning six hours of debugging, a million tokens-long context window, and three therapy sessions.

Vibe coding: The fragility multiplier

Vibe coded systems tend to:

Break under edge cases.
Confuse the next dev (or even future-you).
Fail silently in production.

The result? You spend more time reviewing, fixing, explaining, and rewriting things than you saved by prompting it in the first place. You've created a system that's not just fragile — it's fragile and mystifying.

Testing, security & all the stuff AI doesn’t do by default

AI won’t warn you if it accidentally vibe codes sensitive data, hardcodes an API key, or skips input validation. It won’t enforce domain-driven design or test coverage unless you ask it to, perfectly.

Without strong engineering intuition, vibe coding can lead to real-world vulnerabilities and brittle systems, especially when security is an afterthought instead of a default. We’ve seen this with the Tea Dating app leaking the private information of over 70,000 customers and AI deleting the SaaStr’s production database.

AI doesn’t:

Write unit tests unless you explicitly ask.
Understand your threat model.
Follow OWASP guidelines.
Validate user input unless prompted perfectly.
Log responsibly (hello, hardcoded secrets and PII leaks).

If you don’t have strong engineering habits already or if you’re not willing to stick to your current habits even in this vibey era, you’ll never know these things are missing until they bite you — hard — in prod.

When vibes replace struggle, you lose the intuition

Struggling with bugs, tracing stack traces, and learning from mistakes builds technical intuition. That frustration is part of the learning path and skipping it can lead to shallow confidence and dependency.

Without struggle, developers don’t build the muscle to solve unfamiliar problems independently, and that’s where true expertise lies. Yes, debugging sucks. But tracing a nasty bug through 12 layers of abstraction teaches you something an LLM never will.

The struggle builds:

Mental models of systems
Pattern recognition
The instinct to smell code rot before it crashes

When you skip that, you build shallow confidence on top of shallow understanding. And when things go sideways, you won't have the tools to fix it.

**Where vibe coding does shine**

Let’s give credit where it’s due. Vibe coding is awesome for:

Rapid prototyping
Generating boilerplate or repetitive tasks
Teaching programming concepts in an interactive way
Communicating product ideas through rough mockups
Brainstorming with frameworks or patterns

Used with awareness, it becomes a helpful tool to the vibe coder. Used blindly, it becomes a liability. We need to understand as devs and teams where to draw the line. And when to bring in support in the form of more rigorous code reviews and unit tests to tackle the technical debt before it solidifies in your codebase.

We’re not losing the craft — we’re drowning it in debt

The biggest risk isn’t that AI kills developer craftsmanship. It’s that technical debt becomes invisible.

As AI coding becomes the default way to build, systems will look complete — but underneath, they’ll be messy, fragile, and undocumented. And no one will know until they try to extend them.

This matters a lot in domains like:

Healthcare
Finance
Infra
Safety-critical systems

However, there’s also a chance that vibe coding evolves into a new layer of development, one that coexists with craftsmanship, where AI handles the tedious things like boilerplate and first-pass code reviews and humans focus on the architecture, ethics, and design behind systems.

That’s the timeline we want to find ourselves in and how we’re approaching AI at CodeRabbit. We’re focused on AI tools that supplement your coding agents by helping you find and prevent technical debt and bugs from making it into production – rather than the other way around.

To vibe code or not to vibe code?

This isn’t an anti-vibe coding article. I’m using AI coding agents in my workflow every day. But tools should amplify our skills, not replace them. They should do the work that’s tedious and repetitive – not the thinking and strategy.

Vibe coding isn’t evil, it’s just easy to misuse. The real danger is letting it become the default mindset on your project before the developers on your team understand what they’re building.

Let’s embrace AI, but keep coding as a craft alive, because good software isn’t just about what works. It’s about what lasts, long after the original dev has gone — and long after the vibes have faded.

Need help keeping tech debt out of prod? Try our AI code review tool free today.

]]>

Wed, 13 Aug 2025 07:00:00 GMT

The VS Code extension we launched back in May has been a game changer for many reasons – but the main one is it allows you to keep the state of flow by coding and reviewing in the same place; your IDE.

With this latest release, we've added:

Ability to send prompts to AI coding tools. You can code with your favorite AI coding assistants, get intelligent reviews from CodeRabbit, and apply all suggested changes with the provided prompts - all without leaving your IDE.
One-click acceptance of suggested changes. Our most requested feature. Apply every suggestion at once instead of clicking through them individually.
Full context awareness. The same level of context awareness that CodeRabbit's PR reviews benefit from is now taken into account for users with Pro accounts. That means code reviews in the IDE now utilize Learnings, run code quality and code security tools, and adhere to agent Code Guidelines.
More integrations. Integrations with Codex CLI, Cline, Roo, Kilo Code, Augment Code.
Ability to give feedback. You can also provide feedback on each suggestion.

That means you can review code, just like a human PR reviewer would, in the IDE – and then quickly apply those changes with the help of CodeRabbit or your AI coding tools. Before you even make a pull request.

What that means for you

This has several great benefits:

It improves the iteration loop. You can code with AI, get feedback with CodeRabbit's AI reviews, and also, apply all suggested changes quicker than before.
Cleaner PRs. PRs can be saved for any stray issues and for human reviews.
You look better. Why ship error-filled code to your boss and team if you don’t have to. CodeRabbit’s IDE reviews are a great way to double check your code before you merge request. And now, they’re even easier.

We hope this will improve the speed at which you ship – and help ease the burden of PR reviews as a whole.

Future roadmap

User-level Learnings: We’ll be adding the ability to add Learnings or provide feedback on suggestions, so our agent automatically learns which suggestions you like and dislike. We currently have org-wide Learnings in the SCM but want to extend this feature to individual developers who want to add custom Learnings that will only apply to them.
Web Queries: We plan to integrate our context enhancing features into our IDE tool so your code is bug-free with fewer false positives and your reviews are always up-to-date on versions, library documentations and vulnerabilities, even if the LLM isn’t.
Docstrings: Want to create Docstrings before you merge? We’ll be adding this feature that’s currently part of our PR reviews to our IDE reviews in the future.

A reminder: Our IDE reviews are free (with rate limits). Download the VS Code extension.

]]>

Tue, 12 Aug 2025 01:00:17 GMT

「信頼されるひとが挑戦できる世の中をつくる」をビジョンに掲げるCAMELORS株式会社は、転職マーケットにいない即戦力人材を中心とした、最速の複業マッチングサービス「SOKUDAN」を主力事業として展開しています。SOKUDANは、フリーランスや副業を希望する個人に向けたサービスで、さまざまな職種のマッチングをサポートしています。エンジニアを中心に、マーケターやセールス、事業企画、デザイナー、採用・人事など多岐にわたる方々に利用されています。

また、SOKUDANは最近の市場動向に合わせ、リモートワークの需要にも対応しています。リモート勤務やオフィス出社の選択肢を含めた多彩なマッチングを行っており、ユーザーのライフスタイルに応じた柔軟な働き方を支援しています。

今回は、SOKUDANの開発を担当されている取締役CPOの玄永俊一郎さん、技術リーダーの竹馬力さんにSOKUDANの開発体制とCodeRabbitの活用についてお話を伺いました。

フルリモート・常時オンラインでの連携した開発体制

SOKUDANの開発体制は、技術リーダーである竹馬さんが全体を統括しつつ、各メンバーが特定の技術領域を担当しています。具体的には、竹馬さんがインフラストラクチャの設計調整を行い、バックエンドエンジニア・フロントエンドエンジニア及びコーダーがそれぞれの専門分野で役割を果たしています。

プロジェクト管理は柔軟性を持ったアジャイルスタイルを採用しており、週1回の定例ミーティングを中心に進行しています。

「常時オンラインでの連携体制を整えており、Slackを活用して密なコミュニケーションを図っています」（竹馬さん）

少数精鋭であるが故の課題

SOKUDANは少数精鋭で業務を遂行しており、開発メンバー間の役割分担が適切に行われています。効率的である半面、各メンバーが自分の専門外の領域に対してもコードレビューを行わなければならない状況にありました。各メンバー間で、そのスキルセットの違いがレビュー作業を複雑にする一因となっていたのです。

「自分の門外漢な部分については、レビューコストがかなりかかります」（竹馬さん）

スキルセットの違いにより、コードレビューに多大な時間がかかる上に、品質担保のための効率的なレビューができていないことが課題でした。そこでコードレビューの時間削減と、品質維持が両立できる方法を模索していました。そうした中で出会ったのがCodeRabbitです。

CodeRabbit導入の決定要因

CodeRabbit導入の決定要因としてSOKUDANのチームは他社サービスの未成熟さと比較して、CodeRabbitが自動コードレビュー分野で先行していた点を挙げています。特に、リポジトリへの簡単な導入手法を評価しており、試験導入が容易だったことが導入に際して大きな決め手となりました。

導入を決めた背景には、少数精鋭のチームで効率的に品質管理を行うためには、AIの活用が不可欠であるという考えが存在していました。

「コスト削減と品質管理の両立が重要で、AI導入はそれを達成するために必須だと考えています」（竹馬さん）

「少数精鋭のチームだからこそ、AIによるサポートで開発の効率化が大事になってきます」（玄永さん）

結果として、CodeRabbitの利用が開発者同士のレビュー文化を促進し、効率的かつ効果的な開発プロセスの構築に貢献しています。

SOKUDANでのCodeRabbitの活用方法

CodeRabbitを導入・運用するにあたり、SOKUDANではいくつかの課題や工夫が行われています。導入直後、「コードレビューでの指摘が若干間違っているというか、その指摘は過剰ではないか」とのコメントがあり、AIのレビューに対する設定変更の必要がありました。そうしたフィードバックはCodeRabbitのAIレビュー精度を高め、開発者による手動レビューの工数を軽減へとつながっています。

もう一つの工夫点として、レビュー精度向上のために「指摘事項を適切に修正して、コミットした際にGitHubにコメントを残しておく」といった運用を挙げています。これにより、AIがより適切な指摘を行うための土台が整えられています。

「後は、プルリクエストのサイズをなるべく小さくして、精度良くレビューできるよう工夫しています」（竹馬さん）

さらに、SOKUDANチームではCodeRabbitによるレビューをメンバー間で積極的に共有し、それを元に互いに学び合うという文化も形成しています。

「共有された内容を見て、自分が専門領域以外のところのレビューを見るのも勉強になる」（竹馬さん）

これはスキルの底上げと、チーム全体の効率化・品質向上につながっています。こうした運用上の工夫が、CodeRabbitを効果的に活用するための鍵となっていると言えるでしょう。

現在、SOKUDANではコードレビューの効率化と品質向上を実現しています。先に挙げた工夫によって、レビューの精度は着実に向上しており、開発者のレビュー工数低減と、プロジェクトに集中できる環境が整っています。

運用しているからこそ分かる、CodeRabbitに求められる進化

SOKUDANでは、CodeRabbitのさらなる進化に期待が寄せられています。特に、現状のAIレビュー機能にビジネスドメインの情報を深く理解する能力が加わると、より高度なコードレビューが可能になるとの声が出ています。

「CodeRabbitを育てていくプロセスの中に、ビジネスドメインを加味した視点が入れられると良いですね」（竹馬さん）

また、「AIエージェントとしての役割を拡大し、さらに人の作業を肩代わりできるような仕組みが整うことで、開発プロセスに大きな変革がもたらされるはず」という期待も寄せられています。例えば、より開発プロセスとのシームレスな統合や、独自機能の強化が求められています。

「単なるコードレビューの自動化を超えた、新たな価値提供に期待しています」（玄永さん）

CodeRabbitは今後も機能開発を通じて、SOKUDANの開発効率化に貢献していきます！

CAMELORS株式会社では、バックエンドエンジニアやUIUXデザイナーなど、幅広い職種で採用を行っています。興味のある方はぜひ、採用情報をご確認ください。

採用情報（CAMELORS株式会社）

]]>

Fri, 08 Aug 2025 04:43:05 GMT

株式会社Lbose（エルボーズ）は、「現場起点」のDX支援を掲げ、製造業や建設業などリアル産業を対象に、業務改善とプロダクト開発を支援する企業です。最近では自社開発のAI OCR「Lbose OCR-CORE」、製造卸・小売業のための注文書受取AI-OCRツール「かんたん受注DX」をリリースし、紙帳票が依然として残る現場において、データ入力業務の効率化を実現しようと取り組んでいます。

現在は、顧客基盤を全国に持ち、熊本を本拠地としながらも全国各地のクライアントと共にデジタル変革に取り組んでいます。そんな同社の技術戦略を担うCTOの南ナリットさん（以下NARIさん）にお話を伺いました。

誰でもレビューできる体制

Lboseの開発は、全て自社主導で行われています。正社員はおよそ10名ほどですが、業務委託として常時稼働のPM、デザイナー、エンジニアが約60名参加しており、プロジェクトごとに柔軟なチーム編成を行うのが特徴です。5,000名の多様な専門性を持つデジタル人材ネットワークから最適なチームを組成し、プロジェクトの立ち上げから実行・検証までを伴走支援しています。

複数のプロジェクトが並行して動いている中、チーム内ではレビュー体制も整備されており、チーム内で相互レビューする体制が整っています。そして、最終チェックはリードエンジニアが担当することで、品質とスピードの両立を図っています。

レビュー負荷が遅延や品質低下につながる

CodeRabbit導入以前のコードレビューでは、リードエンジニアへの負担が課題となっていました。レビュータスクが集中するとマージまでのリードタイムが延びたり、レビューに割く時間が取れなくなって集中したレビューができず、見落としの発生による品質低下を招くこともあり、実装フェーズのボトルネックとなっていたのです。

そのため、社内でChatGPT gpt-4を用いた自作のレビューエージェントを試したものの、精度の面や運用負荷の高さから断念することに。「レビューの負荷が高く、かつ品質も担保したいという中で、限界を感じていました」とNARIさんは振り返ります。

そうした状況の中、CodeRabbitを知ったのは2023年10月頃。X（旧Twitter）で偶然見かけたのがきっかけだったといいます。当時、AIコードレビューの選択肢はまだ少なく、直感的に「使いやすそう」と感じたとのこと。

「ちょうどレビューエージェントを自作していた頃だったんですが、完成度に限界があって。試してみようと軽い気持ちで導入しました」

社内での有効性を確認し、導入へ

導入の動機づけとして最も大きな理由は、リードエンジニアのレビュー負荷を軽減したいというニーズでした。レビューの遅延はプロジェクト全体に影響を及ぼすため、効率化は急務でした。

また、ChatGPTを使った自社ツールではメンテナンスコストが高く、精度にも不安があったことから、専門サービスであるCodeRabbitへの移行はスムーズだったと言います。

「レビュー回数に制限がなく、サブスクリプションで複数プロジェクトで利用でき、コストパフォーマンスが良いと感じました。2週間のトライアルを通じて社内での有効性を確認し、導入してほしいという声が上がりました」

CodeRabbitは自然に使われています

CodeRabbitは現在、メンバーからは「うさぎさん」「CodeRabbitさん」と呼ばれるほど溶け込み、利用されています。リードエンジニアからも「人が気づかないような細かな不具合やtypoも検出してくれるのが非常に良い」と好評です。

レビュアーからは、業務ロジック部分に集中してレビューができるようになり、コード品質を高く維持できているとのコメントをもらっています。また、早期フィードバックによってレビュー時間の短縮にもつながっているとのことです。

反面、ジュニアエンジニアにとってはコメント量が多いと戸惑うケースもあるようです。そこで、コメントを一通り目は通しつつ「対応不要なものは、コメントをして無視して良い」とガイドすることで、運用上の混乱は避けられています。

また、ユニークな“ポエム”機能もオンにしており、「面白いポエムは個人チャンネルで共有する文化があるんですよ」とNARIさんは教えてくれました。

さらなるCodeRabbitの進化に期待

さらなる改善としては、CloudFormationなどのIaCコードへの対応、複数ファイルをまたいだレビュー精度の向上、そしてAIコーディングエージェントに対するレビュー対応が挙げられました。CodeRabbitが実践的に使われているからこそ、期待する機能も具体的です。

また、プロジェクト単位でのラーニング機能、過去レビューのパーソナライズ活用、高速化なども期待されています。

「今でも十分助かっていますが、プロンプト調整や学習機能が進化すれば、さらに効果が高まると感じています」

CodeRabbitは今後も進化し、Lboseの開発チームを支援していきます。

株式会社Lboseでは現在、フロントエンド・バックエンド問わず、プロダクト開発に携わるエンジニアを募集しています。

AIを活用した効率的な開発体験に関心があり、プロダクトの価値を一緒に高めていきたい方を歓迎しています。

私たちと共に、新しい開発の在り方を模索しながら、社会にインパクトのあるサービスをつくっていきませんか？

興味のある方は、ぜひ下記の採用ページをご覧ください。

RECRUIT - 株式会社Lbose（Wantedly）
https://www.wantedly.com/companies/company_1508950/projects

]]>

Thu, 07 Aug 2025 17:00:06 GMT

The wait is over! As the leading AI code review tool, CodeRabbit was given early access to OpenAI’s GPT-5 model to evaluate the LLM’s ability to understand, reason through, and find errors in complex codebases.

As part of our GPT-5 testing, we've conducted extensive evals to uncover its technical nuances, capabilities, and use cases with a focus on the model’s ability to understand and reason through potential issues and bugs in codebases.

Below, you’ll find a breakdown of our structured evaluation approach, detailed findings relative to other popular models, and how we’re planning to incorporate GPT-5 into your AI code reviews to make them even better.

TL;DR: The results

GPT-5 outperformed Opus-4, Sonnet-4, and OpenAI’s O3 across a battery of 300 varying difficulty, error-diverse pull requests.
GPT-5 scored highest on our comprehensive test and found 254 out of 300 bugs or 85% where other models found between 200 and 207 – 16% to 22% less.
On our 25 hardest PRs from our evaluation dataset, GPT-5 achieved the highest ever overall pass rate (77.3%), representing a 190% improvement over Sonnet-4, 132% over Opus-4, and 76% over O3.

How we evaluated GPT-5

We ran the same tests we run on all our models. These evals integrate GPT-5 into our context-rich, non-linear code review pipeline to see how it would perform in a typical code review.

CodeRabbit's evaluation process includes:

LLM-based judging: We perform dual-layered LLM-based judgment that looks at both qualitative and quantitative data such as the quality of a review and a pass/fail of the model’s accuracy.
Human-based judging: We then perform qualitative checks by humans to verify the quality of review comments and depth of the model’s reasoning.
LLM-based metrics collection: We collect metrics that we believe are indicative of a high quality code review and weigh them by their importance. These metrics include:
- Actionable comment counts
- Readability scores (Flesch Reading Ease score)
- Average word count
- Sentence count
- False positives (hallucinations)

Note: Our evaluations were conducted on various ‘snapshots’ of GPT-5 that OpenAI shared with us leading up to the release of GPT-5. While our results changed somewhat with different snapshots, their relative consistency allowed us to make the observations below. The released model might be slightly different.

GPT-5 capabilities: Evaluation results and analysis

Our evaluation of GPT-5’s capabilities found that the model certainly lives up to the hype. GPT-5 outperformed all other models we’ve tested on our datasets – by a lot.

Comprehensive evaluation scores

GPT-5’s weighted score from our comprehensive evaluations was between 3465 and 3541 on different test runs – which is almost 200 points above OpenAI’s O3 model and Anthropic’s Sonnet 4, which were previously our highest scoring models. The maximum possible score is 3651.

Full evaluation scores:

GPT-5: 3465–3541
O3: 3288
Sonnet-4: 3242
Opus-4: 3170

Takeaway:

While a 200 point or 5% increase might not seem significant, the way our tests work is that models initially rack up points finding low-hanging fruit like infinite loops and exposed secret keys. After a point, it then becomes progressively harder to get points since all the remaining points come from flagging much harder to find issues. GPT-5’s ability to get so many more points than other models, therefore, represents a significant leap forward in reasoning.

Pass/fail scales

We also give models a pass/fail score based on how many of the 300 error patterns in our dataset PRs the model was able to find. GPT-5 also achieved the highest success rate on this scale that we’ve ever seen at 254 to 259 out of 300.

Compare that to the performance of other models:

GPT-5: 254-259
Sonnet-4: 212
O3: 207
Opus-4: 200

Since about 100 of the bottom PRs are found by all models, if we just look at the most difficult 200 error patterns, the numbers show even greater improvement with GPT-5 catching 78% of those error patterns and other models catching only 54% to 58%.

GPT-5: 157
Sonnet-4: 117
O3: 113
Opus-4: 108

Takeaway:

Similar to our comprehensive metric, the additional error patterns that GPT-5 was able to find are particularly hard for LLMs to spot, like concurrency bugs or inconsistent domain keys across environments, suggesting the model’s increased ability to reason.

Hardest PRs test

To stress-test each model, we curated 25 of the most difficult pull requests from our Golden PR Dataset. These PRs represent real-world bugs that span:

Concurrency issues (e.g. TOCTOU races, incorrect synchronization)
Object-oriented design flaws (e.g. virtual call pitfalls, refcount memory model violations)
Performance hazards (e.g. runaway cache growth, tight loop stalls)
Language-specific footguns (e.g. TypeScript misuses, C++ memory order subtleties)

Each model was tested across three runs. Below is the average pass rate on this Hard 25 benchmark:

Pass rate chart

Model	Mean Pass Rate (%)
Sonnet-4	26.7%
Opus-4	33.3%
O3	44.0%
GPT-5	77.3%

Takeaway: GPT-5 shines where accuracy, contextual linkage, and depth matter most. It consistently delivers the most complete, test-ready, and forward-compatible code review output among all models we’ve tested to date.

How many kinds of bugs does GPT-5 actually catch?

To better understand what kinds of issues each model identifies—not just how many—our team reviewed every comment across a set of hard PRs and classified them into categories like Concurrency, Security, and Object-Oriented Design.

We applied deduplication across models: if multiple models flagged the same core issue (even if phrased differently), it was counted only once per PR. This ensured we were measuring issue coverage, not comment verbosity.

Then, for each model, we tallied what percentage of those unique issues it successfully caught.

Takeaway:

GPT-5 leads in almost every category, identifying over 60% of concurrency, performance, and memory bugs — and an impressive 80% of security issues.
Security remains the most striking gap: GPT-5 found 80% of security-related bugs, while the next best model (O3) found only 40%.
Even on basic concurrency and performance problems, GPT-5 consistently outperforms by 20-30 points.

Example: GPT-5 uncovers hidden concurrency risks missed by others

In this pull request, a subtle concurrency bug stemmed from a combination of double-checked locking and unsafe access to a shared HashMap in a singleton service class. While most models flagged the obvious thread-safety issue, GPT-5 delivered a comprehensive, production-ready fix—resolving not just the symptom, but the architectural flaws underneath.

The problem

The OrderService singleton used a HashMap to store orders, while concurrent updates were made from a fixed thread pool. This design lacked synchronization, leading to potential data corruption. On top of that, the singleton was initialized using a non-volatile static field—opening the door to unsafe publication and partially constructed objects.

GPT-5’s recommendations

GPT-5 went beyond the basic fix and stitched together a complete concurrency hardening plan:

1. Replace the map with a thread-safe alternative

- private final Map orders = new HashMap<>();

+ private final Map orders = new ConcurrentHashMap<>();

✅ GPT-5 also explained why: “Concurrent updates... are executed on a plain HashMap... not thread-safe and can lead to undefined behavior.”

2. Fixed the broken singleton instantiation

- private static OrderService instance;

+ private static volatile OrderService instance;

Or optionally:

private static class Holder {
  private static final OrderService INSTANCE = new OrderService();
}
public static OrderService getInstance() {
  return Holder.INSTANCE;
}

✅ It flagged the classic memory visibility issue with double-checked locking and offered an alternate pattern to make construction thread-safe.

3. Added a test reset hook to prevent state leakage

// Inside OrderService.java

void clearAllForTest() {
    orders.clear();
}

✅ This enables isolated, repeatable tests when working with a shared singleton across multiple test cases.

4. Added timeouts to catch async test hangs

- future.get(); // Wait for completion

+ assertTimeoutPreemptively(Duration.ofSeconds(5), () -> future.get());

✅ GPT-5 proactively hardened the test suite by guarding against test flakiness in asynchronous flows.

What Sonnet-4 and Opus-4 missed

Both models correctly flagged the unsynchronized HashMap and replaced it with ConcurrentHashMap. However, neither delivered a complete or production-safe remediation:

❌ Singleton issues unresolved:
Sonnet-4 ignored the broken double-checked locking; Opus-4 mentioned it but skipped the actual fix (no volatile, no holder idiom).
❌ No test safety provisions:
GPT-5 introduced clearAllForTest() and timeout guards; Sonnet-4 and Opus-4 missed these entirely or only noted them passively.
❌ Lacked architectural context:
Neither model cross-referenced the broader codebase or justified changes with evidence. GPT-5 backed each fix with reasoning that traced across services, tests, and threading behavior.
❌ Limited scope:
Sonnet-4 made a single, surface-level fix. Opus-4 added some useful logging but missed the deeper structural risks GPT-5 fully addressed.

Why this matters

The real value of GPT-5’s review lies in its depth and awareness. It not only patched the visible race, but also:

Identified deeper architectural risks
Cross-referenced test reliability and code quality
Delivered a set of changes that are safe to merge immediately

This isn’t just a fix—it’s engineering insight. GPT-5 showed how an AI reviewer can reason across system layers, suggest durable solutions, and help teams write safer code with less guesswork.

What’s new (and exciting) about GPT-5

Beyond metrics and the specific things our tests were evaluating, we found that GPT-5 exhibited new behavioral and reasoning patterns.

Advanced contextual reasoning: GPT-5 proactively planned multiple review steps ahead, showcasing expansive creative reasoning rather than strict input-bound logic. For example, GPT-5 demonstrated deep reasoning by connecting evidence across filles in our Concurrency oriented test focused on a ‘Check-then-act race condition’ scenario. It was the only model to detect risk of duplicate creation and introduced an atomic refund pattern grounded in the enum and test suite.
Chain-of-thought reasoning via review threads: In an Object-Oriented test focused on a Virtual call in constructor case, GPT-5 showed layered logic by first identifying a misused polymorphic override and then adjusting its recommendations based on its own earlier suggestions. This shows layered logic by identifying one thing and then showing additional reasoning on the issue later.
Evidence-based diff justification: In a Performance-focused test focused on Unbounded cache growth (no eviction) issue, GPT-5 identified architectural memory risks that other models missed, and backed its recommendation with diff context, usage patterns, and suggested safeguards.
Forward-thinking suggestions: In a Concurrency-related test focused on Incorrect sync primitive usage, GPT-5 not only patched the race but also suggested how to structure future additions, lock hierarchies, and test guardrails to prevent regressions.
Granular, task-oriented recommendations: Unlike previous models, GPT-5 detailed explicit follow-up tasks, creating actionable workflows within the review process itself. This makes the model much better for multi-step workflows.

How we’re using GPT-5 in our AI code reviews

We’re excited that GPT-5 represents a significant advancement in AI-powered code review, pushing the boundaries in detail, accuracy, and contextual reasoning. That’s why we’ll be using GPT-5 as the core reasoning model in our pipeline – starting today. We’re excited that it will be able to find more issues and create more in-depth, context-rich reviews.

If you’ve never tried CodeRabbit, tried it previously, or are a current user, we’d love to hear how you think GPT-5 is improving your review quality and experience.

Try our free 14-day trial today to see the power of GPT-5 yourself.

]]>

Wed, 06 Aug 2025 04:56:16 GMT

10 Advanced Github Copilot tips & tricks | Copilot best practicesの意訳です。

GitHub Copilotは、現代の開発者にとって欠かせないツールのひとつになりつつあります。OpenAIのモデルによって駆動され、エディタ上でリアルタイムにコード提案を行ってくれるこのツールは、使いこなせば生産性を大きく向上させる可能性があります。Microsoftの調査によれば、開発速度が最大で55%向上するとも言われています。

ただし、Copilotは魔法の杖ではありません。放っておけば、頼りになる相棒というよりは、やる気のあるジュニアエンジニアが推測でコードを書くような挙動になることも。Copilotをうまく使いこなすかどうかは、ワークフローの設計と使い方次第です。

本記事では、CopilotがAI開発ツール群の中でどのように機能するのかを概観し、私たちCodeRabbitや多数の開発者コミュニティで得られた知見から、Copilotの効果的な使い方を10個のヒントとして紹介します。

ヒント1：得意な領域で使う

Copilotには得意なタスクがあります。そこに集中させることで、時間を大幅に短縮できます。

特に得意なのは以下のような作業です：

繰り返しが多いコード
単体テストの生成
構文エラーの修正
コードの説明
正規表現の生成

たとえば、ある関数に対して複数のユニットテストを書く必要がある場合、Copilotはコメントを元にそれらを一気に生成してくれます。

逆に、CopilotはUI設計やデータベーススキーマの構築など、コード以外の作業には不向きです。Copilotを「手の早いジュニア」として位置づけ、自分が設計と判断を担いましょう。

ヒント2：必要なコンテキストをきちんと与える

Copilotはエディタに表示されている情報をもとに提案を行います。そのため、関連ファイル（たとえば utils.py と models.py）を同時に開いておくと、より適切な補完が得られます。

また、使用予定のライブラリやフレームワークをあらかじめ import しておくことで、Copilotが適切な構文でコードを生成しやすくなります。たとえば import pandas as pd と書いておけば、DataFrame処理にはpandasを使うという前提で提案されます。

不要なファイルやコードが開いているとノイズになるため、作業に関係のないものは閉じておきましょう。

ヒント3：コメントやdocstringで意図を明確に伝える

Copilotにとって、コメントはプロンプトと同じ役割を持ちます。英語でも日本語でも良いので、やりたい処理を具体的に書いてみましょう。

例：

# 名前のリストを大文字小文字を区別せずにソートする

このように書くと、Copilotは sorted(names, key=str.lower) のような実装を自動で提案してくれます。

特に、入力と出力の具体例をコメントに含めると、より意図が伝わりやすくなります。あいまいなコメントではあいまいなコードが返ってくるので注意しましょう。

ヒント4：意味のある名前を使う

変数名や関数名は、Copilotにとっても重要なコンテキストのひとつです。たとえば check() や data1 のような名前では意図が伝わりづらく、提案されるコードも曖昧になります。

代わりに is_user_promotable() や calculate_invoice_total() のように、機能や目的が明確にわかる名前を使いましょう。

これは人間にとっての可読性だけでなく、Copilotの提案精度にも影響を与える重要な要素です。

ヒント5：CopilotとCodeRabbitを組み合わせて使う

Copilotはコードを書く段階で活躍しますが、書いたコードをレビューする段階ではCodeRabbitの出番です。CodeRabbitは、VS CodeやGitHubのPR上でAIによるコードレビューコメントを自動で追加してくれます。

Copilotでコードを素早く書き、CodeRabbitでレビューすることで、開発スピードと品質を両立できます。Copilotが気づかないエッジケースやチームのコーディング規約違反なども、CodeRabbitが検知してくれます。

CodeRabbitはこちらから14日間の無料トライアルが利用できます。

ヒント6：プロンプトは具体的に、例も添えて

Copilotに期待通りの動作をしてもらうには、明確かつ具体的なコメントが必要です。

たとえばログの1行をパースしたい場合、単に # parse log line と書くよりも、以下のように書く方が望ましい結果になります。

# "2025-06-02 09:00:00 - ERROR - failed to connect"
# のようなログをパースして、datetime, level, message に分割

このように例を添えることで、Copilotがフォーマットや処理の意図を正確に理解しやすくなります。

ヒント7：複雑な処理は小さく分割する

Copilotはシンプルで明確なタスクに強みを発揮します。逆に、複雑すぎる処理を一度に任せると混乱したコードが返ってくることもあります。

そのため、アルゴリズムの実装やCLIツールの構築などでは、まずステップをコメントで書き出し、1つずつCopilotに埋めてもらう方法が効果的です。

こうすることで、コードの確認もしやすく、品質の担保にもつながります。

ヒント8：Copilot Chatとインライン補完を使い分ける

Copilotにはインライン補完とCopilot Chatという2つの使い方があります。

インライン補完が向いている場面：

単純なアルゴリズムやループの補完
コメントからそのままコード生成
繰り返しパターンの埋め込み

Copilot Chatが有効な場面：

コードの意味やエラーの解説
長めのコード生成と改良
セキュリティ意識など、ペルソナ指定のやり取り

両者をうまく使い分けることで、より柔軟な開発が可能になります。

ヒント9：複数の提案を確認し、プロンプトを改善する

Copilotは1つの提案だけでなく、複数の候補を持っています。最初の提案が理想的でない場合は、ショートカットキーやCopilotパネルを使って他の候補を確認しましょう。

それでも良い提案が出ない場合は、コメントやコードを見直して再度試してみると改善されることがあります。

ヒント10：Copilotの出力は必ずレビュー・テストする

Copilotが生成するコードは一見正しそうでも、バグやセキュリティの問題を含んでいる可能性があります。コードを本番環境に取り込む前に、必ずレビューとテストを行ってください。

空入力や例外処理への対応
SQLインジェクションの可能性
非推奨APIの使用

などを確認することが重要です。

静的解析ツールやLinter、セキュリティスキャナー（SnykやCodeQLなど）も積極的に活用しましょう。

まとめ：GitHub Copilotを賢く使いこなそう

Copilotは非常に強力なツールですが、使い方によってその効果は大きく変わります。この記事で紹介した10のベストプラクティスを実践することで、Copilotをただの補完ツールから、本格的な開発支援ツールへと進化させることができます。

定型処理はCopilotに任せて、自分は設計や問題解決に集中する——そんな開発スタイルを実現するために、ぜひ紹介したヒントを取り入れてみてください。

そして、CodeRabbitとの組み合わせもぜひお試しを！

]]>

Wed, 06 Aug 2025 04:51:22 GMT

Context Engineering: Level up your AI Code Reviewsの意訳です。

コードレビューにおいて「コンテキスト」はすべて

CodeRabbitでは、業界でも屈指の“コンテキストを重視した”コードレビューを実現しています。多くのコードレビューツールが「コードベースの認識」レベルにとどまるなか、CodeRabbitはさらに深く掘り下げます。コードベースから数十もの情報を収集し、正確で実用的なレビューを提供しています。

そのために、レビュー対象のコード1行に対して、その背景情報を同じ比重でLLMに入力しています。具体的には、ユーザーの意図、ファイル間の依存関係、Jiraチケット、コードグラフ、過去のPR、チャットでのやり取り、Linterなどから得た成果などです。

さらに、生成されたAIの提案はすべて事後検証され、誤りを防ぎ、精度を高め、レビューガイドラインに適合しているかチェックされます。

これが私たちの「コンテキスト・エンジニアリング」であり、CodeRabbitのレビューが信頼性・品質・関連性で業界をリードする理由です。

本記事では、CodeRabbitのコンテキスト・エンジニアリングにおける主な要素を紹介します。

PRとIssueのインデックス化

レビューはまず、CodeRabbitがリポジトリをクローンし、サンドボックス上で管理するところから始まります。これにより、すべてのレビューがコードベースを認識した上で行われ、かつセキュアな環境が保たれます。

CodeRabbitはプロジェクト構造やコード間の依存関係を解析するだけでなく、過去のPRからもタイトル、説明、コミット範囲などを収集し、「なぜそのコード変更が行われたのか」を理解しようとします。関連する過去PRはレビューコメントにも反映されます。

また、Jira、Linear、GitHub、GitLabなどのIssueをインデックス化し、変更の「意図」も理解します。PRに紐づけられたIssueを分析し、要件がどの程度満たされているかを自動的に評価します。

コードグラフ解析

新たなレビューが開始されるたびに、CodeRabbitはコード間の依存関係をグラフ構造として再構築します。これにより、関数間の依存性を把握し、下流に影響を与える可能性のある変更を検出します。

コードシンボル（型など）の定義を取得し、それをレビューコメントのコンテキスト強化に活用することで、見落とされがちな依存関係の破綻や例外パターンを捉えることができます。

カスタムレビュー指示の対応

CodeRabbitは、各チーム固有のコーディング規約に基づいたカスタムレビューに対応しています。以下のような方法で柔軟にルールを設定可能です。

パスベースのフィルター: 対象ファイルをglob形式で指定し、レビュー対象を限定できます。
パスベースのレビュー指示: 指定パスに一致するファイルに対してのみ、特定のレビュー指示を適用できます。
コーディングエージェントのガイドライン取り込み: CursorやCopilot、ClineなどのAIエージェントに定義されたガイドラインを取り込み、レビューに活用可能です。

チャットからの学習: レビューコメントに対するフィードバックをチャットで伝えるだけで、次回以降のレビューに反映されます。

Linterと静的解析ツール

CodeRabbitは40以上のLinterやSASTツールをプリセットで搭載しており、ユーザーによる設定は不要です。既存のLinter構成があっても、CodeRabbitはより包括的なチェックを行い、検出された問題はAIによるレビューで補完されます。

Linterによる検出結果が有効と判断された場合は、レビューコメント内でその旨が明示されます。また、独自の設定ファイルがある場合はパスを指定することで、そちらのルールも適用可能です。

対応しているLinter一覧は公式ドキュメントから確認できます。

Web検索による最新情報の取得

レビューに使われるLLMが最新の情報を知らない場合、CodeRabbitはWeb検索を実行し、公開されているリリースノートや技術ドキュメントから情報を補完します。

たとえば、Goのバージョンが1.23.6であるコードに対し、LLMが最新バージョンを知らなかった場合でも、CodeRabbitはWeb検索により1.24.1が最新であることを確認し、それに基づくアドバイスを行います。

検証スクリプトによるコメントチェック

最後に、LLMが生成したレビューコメントに対しても、CodeRabbitは自動で検証スクリプトを実行します。これにより、価値の低いコメントはユーザーに届く前に除外され、いわゆる“AIの幻覚”を防ぎます。

高度なコンテキスト・エンジニアリングによるレビュー品質の向上

このように、CodeRabbitでは多角的なコンテキスト情報をLLMに適切な量だけ提供することで、過剰にならず、かつ精度の高いコードレビューを実現しています。

私たちが実現しているのは以下のポイントです。

変更の意図を理解することで、見逃されがちな不具合を検出
コードと同じ比重の情報をLLMに与えることで、効果的な判断を実現
無価値なコメントを排除し、ノイズを最小限に抑制

CodeRabbitを試してみたい方は、14日間の無料トライアルをご利用ください。リポジトリの接続も数分で完了します。ご不明な点はDiscordコミュニティにてお気軽にご相談ください。

]]>

Mon, 04 Aug 2025 02:21:02 GMT

ROUTE06（ルートシックス）は、人とAIの協創によってプロダクト開発を再定義するスタートアップです。現在は、AI駆動の開発プラットフォームを中核に据え、エンジニアやデザイナー、ビジネスパーソンがAIと共創しながらプロダクトを素早く生み出せる環境づくりを進めています。

クライアントワークにおいても、さまざまな大手企業とともに、AIを積極的に活用した受託開発を推進。さらに、要件定義特化型AIプラットフォーム「Acsim」、AIエージェントビルダー「Giselle」、AI時代のDB設計プラットフォーム「Liam」など、自社プロダクトも次々と展開しています。いずれも、AIをフル活用した新しいものづくりを体現する取り組みです。

今回はROUTE06のCTOである重岡さんに、同社におけるCodeRabbitの活用を伺いました。

少数精鋭チームで挑む、柔軟でスピーディな開発体制

ROUTE06の開発組織は、少数精鋭のスモールチーム体制を基本としています。各プロダクトにエンジニア、デザイナー、プロダクトマネージャーが数名ずつアサインされ、それぞれが自律的に開発を推進しています。たとえば、Giselleのチームは5名以下で構成されており、AIの活用によって少人数でも十分に戦える体制になっています。

重岡さんは「今の開発サイクルでは完璧なコードを出すよりも、まず出してあとでリファクタリングすることの方が重要」と語ります。変化の速いAIの時代において、スピードと柔軟性を兼ね備えた開発組織こそが、ROUTE06の競争力の源泉となっています。

急増するプルリクエスト、問われるレビュー体制の進化

ROUTE06では、AIを活用したプロダクト開発が日常化するなかで、PRの量が急増していました。特に、エンジニアだけでなくデザイナーなどの非エンジニアもコードを書く「バイブコーディング」スタイルが浸透しはじめたことで、レビューのボトルネックが顕在化。小規模チームで高速に開発を進める一方で、コードレビューやQAにかかる負荷が大きくなりつつありました。

レビュー可能なエンジニアは社内に複数いるものの、彼らも実装を並行して担当しているため、レビュー待ちの状態が発生しやすく、開発スピードの低下が懸念されていました。加えて、AI生成コードの質が高まるにつれ、一見問題なさそうなコードにも潜むバグや設計のズレを見抜く負担が増しており、レビューそのもののあり方も見直す必要が出てきていたのです。

「レビューの質とスピードをどう両立させるか。それが、開発スピードを維持する上で最も大きな課題でした」と重岡さんは振り返ります。

きっかけは社内から。CodeRabbitのファーストインプレッション

CodeRabbitの存在を知ったのは、社内のエンジニアからの紹介がきっかけでした。情報感度の高いメンバーが「これは使えるかもしれない」と社内に共有したことから、まずはオープンソースプロジェクトで試験的に導入を開始しました。

社内での使い心地やフィードバックを経て、徐々にプライベートリポジトリを含む社内全体での本格導入へとつながっていきました。形式的な指摘に終始せず、レビューの質が実際に上がることで「これなら任せても大丈夫だ」と感じることができたと振り返ります。

「CodeRabbitは入れてすぐに良いレビューをしてくれました。いろいろな設定もいらず、最初からちゃんと動いてくれる即戦力ぶりが良かったです」（重岡さん）

レビューの質と深さを両立させるCodeRabbitの活用

現在、ROUTE06ではCodeRabbitを開発フローに深く組み込み、社内のさまざまなプロジェクトで日常的に活用しています。レビューの初動はまずCodeRabbitが担い、基本的なベストプラクティスや一般的なコード品質に関する指摘は自動でカバーします。人はプロダクトのドメイン知識や設計意図など、より高次な観点に集中できるようになりました。

また、レビューコメントを他のAIコーディングエージェント（Claude Code、Cursor、Devinなど）にそのまま渡してリファクタリングを依頼する運用も確立され、AI同士の連携によって開発スピードがさらに向上しています。プロダクトのライフサイクルが短くなっている現代において、CodeRabbitはレビューのスピードと質を維持するための重要な存在になっています。

「今はもう、人でなくて良い部分は全部CodeRabbitに任せてます。その分、自分たちは本当に見るべきところに集中できるので、レビューの濃度が上がった実感があります」（重岡さん）

単なるツールから開発パートナーへ

ROUTE06では、今後もAIを活用した開発のスピードと柔軟性をさらに高めていく方針です。その中で、CodeRabbitには単なるコードレビュー支援を超えた開発パートナーとしての進化が期待されています。たとえば、チームやプロダクトごとの文脈をより深く理解し、コメントの精度や提案の質を高めていくことで、開発者の判断や意思決定をより強力にサポートしてくれる存在になってほしいと考えています。

「レビューは単にコードを見るだけではなく、そのチームのスタイルや優先順位が反映されるものです。そこまでCodeRabbitがわかってくれるようにあれば、さらに頼れるメンバーになってくれると思います」（重岡さん）

CodeRabbitは、今後もROUTE06の開発文化を支えるパートナーとして、進化し続けます。

]]>

Mon, 04 Aug 2025 02:12:29 GMT

株式会社ジャンボは、自分自身が熱狂し、ユーザーにも熱狂を届けるという「Passion to Life」をビジョンに、世界で通用するプロダクトの創出を目指しています。2025年12月までに、日本で最も愛されるサービスとなることを掲げ、日々プロダクト開発に取り組んでいます。

主力アプリは、リアルタイムで通話やライブ配信ができるコミュニケーションアプリです。累計会員数は1,100万人を突破し、モバイルアプリやWebアプリを含む15以上のサービスを展開しています。ジャンルを超えたつながりを生み出す仕組みにより、国内外問わず幅広いユーザーに利用されています。

今回は、同社CTOの花野さんと、テックリードの石田さんにお話を伺いました。

内製にこだわる自社開発体制

ジャンボは社員50名の内33名が開発者で、モバイルとWeb、バックエンド、インフラに至るまで全ての開発を自社で完結しています。全員が出社して同じ空間で働くスタイルを貫いており、実際に顔を合わせて熱量を共有しながら開発を進めています。

組織体制としては、Web事業部と国内アプリ事業部、海外アプリ事業部の3つに分かれ、それぞれiOSチームやAndroidチームが配置されています。インフラやバックエンドについては、事業部を横断して支える専門チームが担当しています。

増え続けるレビューの負荷と属人化の悩み

日々大量に生まれるプルリクエストに対し、少人数でのレビュー体制には限界がありました。特に大規模な実装になると、確認にかかる時間が増え、精神的な負荷も大きくなりがちです。レビュー担当がつい後回しにしてしまい、チーム全体の生産性に影響を及ぼすことも少なくありません。

また、コードレビューの質や視点がレビュアーに依存していたため、属人化の課題も抱えていました。誰が見るかによって指摘の内容がばらつき、レビュー品質のばらつきが生じる点にも悩んでいたといいます。

「大きい実装を少人数で見る場面も多く、レビューの負担が重くなりやすかったです。AIが最低限の品質を担保してくれるようになってからは、精神的にもかなり楽になりました」（花野さん）

CodeRabbitとの出会いは偶然のX投稿から

CodeRabbitを知ったのは2年ほど前、X（旧Twitter）上で偶然見かけた投稿がきっかけだったそうです。元々新しい技術に敏感だったこともあり、興味を持って試してみることにしました。

実際に使ってみると、自動生成されるシーケンス図に大きな驚きを覚えたといいます。視覚的な理解を助ける機能として、特に印象に残ったそうです。

「最初に見た時のシーケンス図が衝撃的でした。AIってここまでやれるのかと驚きました」（花野さん）

決め手は先発性とカスタマイズ性

CodeRabbitは当時のAIレビュー領域では比較的早い段階で登場したツールだったこともあり、他のサービスと比べて先駆者としての信頼があったと振り返ります。また、YAMLファイルを通じてリポジトリ単位のカスタマイズができる点も良かったとのこと。

他のAIレビューサービスも並行して利用しているものの、CodeRabbitの柔軟性やカスタマイズによるレビュー精度の向上は一歩抜きん出ているとのことです。

「最近ではYAML設定でレビュー品質をかなり調整していて、抽象的な指摘から本質的なフィードバックへと進化していると感じています」（石田さん）

導入による効果と現場での使い方

現在は、すべてのプルリクエストに対して、まずAIレビューを通す運用をルール化しています。これにより、ケアレスミスや簡単な文法ミスなどは事前にフィルタリングされ、人のレビュー工数の削減を実現しています。

なお、オンボーディングの一環として、新しいメンバーには敢えて最初はAIなしで自力でレビューを経験してもらい、その後AIツールに触れてもらうという段階的な導入方法を採用しています。チームごとにリポジトリ設定も分かれており、それぞれのスタイルで最適な利用スタイルをとっているとのこと。

「まずAIに通すという流れができたことで、明らかなミスに時間を取られなくなりました。レビューにかかる心理的ハードルも下がったと思います」（花野さん）

CodeRabbitに今後期待したいところ

現場ではすでにCodeRabbitの活用によって大きな効果を実感しているものの、さらなる進化への期待も挙がっていました。たとえば、設定権限の柔軟化です。リポジトリごとの設定を編集できる権限が柔軟になり、個々のチームが自由にレビューの条件を調整できると、さらなる活用が進むと考えられています。

そのほか、OpenAI以外のモデル、たとえばClaudeなどを明示的に指定できるような仕組みがあると、使い分けの幅が広がるのではという意見もいただきました。

CodeRabbitは今後もジャンボの開発体制をサポートし、アプリのさらなる発展に寄与していきます！

株式会社ジャンボでは、iOSやAndroidエンジニアなど、幅広い職種でエンジニアを採用しています。世界に通用するアプリ開発に興味がある方は、ぜひ採用ページをご覧ください。

]]>

Fri, 01 Aug 2025 09:11:56 GMT

株式会社ロッカが運営するフィヨルドブートキャンプは、実践的なカリキュラムを通じて現場で通用するエンジニアを育成するプログラミングスクールです。GitHub上で開発を進めるチーム開発や、ポートフォリオ制作を通じて、技術力だけでなく開発プロセス全体を学べる環境を整えています。

現在、fjordllc/bootcampのリポジトリにてCodeRabbitが導入されています。その導入に至った経緯、活用について運営責任者の駒形真幸さんにお話を伺いました。

生徒数が増える中でレビューの遅れが課題になった

フィヨルドブートキャンプでは、生徒同士のレビューからメンターによるチェック、最後に代表である駒形さんによるレビューという多段階のプロセスでコードを確認しています。実務に近い開発体験を得られる一方で、レビュー待ちが発生すると学習の流れが滞るという課題がありました。

特に受講生が増えた際にレビューが集中し、対応が追いつかなくなることで、リリースまでに時間がかかってしまう状況が生まれていました。生徒からの不満にもつながるため、改善の必要性を強く感じていたと話します。

駒形さんは「レビューが終わるまでに時間がかかってしまって、生徒の不満につながることがありました」と振り返ります。

複数ツールを比較し、CodeRabbitを選定

レビュー工程を補完するツールとして、当初は複数のAIレビューサービスを検証していました。他のレビューツールも試したものの、レビュー内容の精度や納得感にばらつきがあったといいます。

そうした中で、CodeRabbitはレビューの的確さや不要な指摘の少なさが際立っていたとのこと。実際のPull Requestで何本か比較した結果、フィヨルドブートキャンプの運営体制やレビューのスタイルにも合っていたことが導入の決め手となりました。

「CodeRabbitの指摘が圧倒的に良かったですね」と駒形さんは振り返ります。

AIの指摘にどう対応するか、あらかじめルールを設けた

AIツールを教育の現場に導入する際、懸念されるのは「AIの指摘に納得できない場合の対応」でした。これに対し、フィヨルドブートキャンプでは、AIの指摘に対してはコメントで理由を説明すればよいという明確なルールを整備しています。

この運用により、生徒がAIと対話する感覚でレビューを進められるようになり、混乱なくスムーズに運用が進んでいるとのことです。

駒形さんは「思っていたより問題が起きなくて驚きました」と導入当時の印象を語ります。

CodeRabbitは運営側が導入しただけでなく、生徒にも自然に浸透しています。GitHub上のポートフォリオとなる自作サービスの開発に、自主的にCodeRabbitを組み込む生徒も現れています。

SNS上でも生徒の好意的な反応が確認できており、現場での受け入れの早さとスムーズな導入が伺えます。こうした流れは、学習者自身がAIツールを使いこなすきっかけにもなっているようです。

今後はフロントエンドコースにも導入を拡大予定

現在CodeRabbitは、Ruby on Railsを使ったバックエンドのリポジトリに導入されていますが、今後はJavaScriptを中心とするフロントエンド側への展開を検討しています。導入から一定期間が経過していますが、CodeRabbitのレビュー機能に対して特に不満はなく、当初の目的であったレビューの迅速化が実現できているといいます。

フィヨルドブートキャンプの取り組みは、教育機関におけるAIレビュー導入の好例といえます。レビュー待ちの負担を減らしつつ、生徒の自走を促すというバランスの取れた運用は、多くの開発現場にも応用可能でしょう。

「レビュー待ちの時間がなくなることで、生徒の学習リズムが崩れなくなったのが大きいです」と、駒形さんは今の成果を語ってくれました。

CodeRabbitは今後もAIコードレビューを通じてフィヨルドブートキャンプ、ならびに生徒の方々の学習をサポートしてまいります。

]]>

Wed, 23 Jul 2025 23:10:47 GMT

Copilot has quickly become a staple in the modern developer’s toolkit. Powered by OpenAI’s models, it offers AI-driven code suggestions based on what you’re writing — right in your editor. Used well, it can significantly boost productivity. Microsoft’s data suggests it may help developers code up to 55% faster.

But here’s the catch: Copilot isn’t a magic wand. Left on autopilot, it can feel more like an eager junior dev making confident guesses than a reliable coding partner. The difference between a helpful Copilot and a frustrating one often comes down to how you use it — and whether you’ve built a workflow that plays to its strengths.

In this article, we’ll walk through how Copilot fits into the broader AI dev tool stack and share practical GitHub Copilot tips and tricks for using it more effectively. These strategies are drawn from both our own experience and the thousands of developers using CodeRabbit’s AI code review platform. With the right approach, Copilot can go from a neat autocomplete toy to a genuinely valuable part of your daily development routine.

GitHub Copilot tip 1: Play to Copilot’s strengths

Not every coding task is equal in Copilot’s eyes. One of the most important GitHub Copilot best practices is to use it where it shines, not force it to create code where it doesn’t. Copilot excels at specific categories of tasks that can save you significant time.

Copilot is especially good at…

Writing repetitive code
Generating unit tests
Debugging syntax issues
Explaining code
Generating regex patterns

These are areas where it has seen lots of examples and can confidently suggest solutions.

For example, if you have a function and need to write several tedious unit tests for it, Copilot can draft them in seconds. Consider this simple function and tests:

def multiply(a, b):
"""Return the product of a and b."""
return a * b

Copilot can help create unit tests for the above function quickly:

In a scenario like the above, Copilot generated the TestMultiply class almost entirely from a comment or prompt. It’s excellent for boilerplate code, repetitive patterns, and well-defined algorithms.

On the flip side, Copilot is not a silver bullet for everything.It’s not designed to handle tasks unrelated to coding (don’t expect it to plan your database schema or design UI work) and won’t replace your problem-solving skills.

Think of Copilot as a junior developer at your side. It’s fast and often right about everyday tasks, but you (the senior developer) are still in charge of decision-making and critical thinking. Use Copilot for the “heavy lifting” on mundane code and let it suggest solutions for routine problems, but always apply your judgment on whether to use those suggestions. That way, you’ll save time and reduce drudgery while keeping yourself focused on the challenging problems and design decisions.

GitHub Copilot tip 2: Provide ample context (open files, set imports, etc.)

A hot tip for GitHub Copilot is to open all the relevant files in your project when you’re coding a particular feature. That’s because Copilot works by looking at the context in your editor to predict what you might want next. The more relevant context you give it, the better the suggestions.

For instance, if you’re implementing a function in utils.py that interacts with models.py, have both files open.

Copilot will process all open tabs (often called “neighboring tabs”) to inform its suggestions. This broader view helps it understand your project structure and produce more accurate code. In fact, simply opening related files in VS Code or your IDE can significantly enhance Copilot’s completions by providing extra context for definitions and usages across your project.

Similarly, explicitly set up your imports, includes, and dependencies before expecting the best suggestions. You know what libraries or frameworks you intend to use – tell Copilot by importing them at the top of your file. This gives Copilot a heads-up on what tools it should use.

It’s often best to manually add the modules or packages (with specific versions, if needed) before asking Copilot to generate code using them. By doing so, you avoid Copilot defaulting to an outdated library or missing an import.

For example, if you plan to use pandas in your code, write import pandas as pd yourself; then when you ask Copilot to manipulate a DataFrame, it will already know to use pandas and won’t attempt a pure-Python solution or an incorrect import.

Also, be mindful of irrelevant context. Copilot’s window of attention is limited. If you have a lot of unrelated files open or leftover code in your editor, close or remove them when you switch tasks. Keeping only the pertinent files and context visible ensures Copilot isn’t “distracted” by code that doesn’t matter to your current goal.

GitHub Copilot tip 3: Write descriptive comments and docstrings as prompts

You understand prompt engineering when you’re directly calling an LLM. But did you know that there are some sneaky ways to prompt engineer in Copilot? One of the most effective GitHub Copilot tips is to guide the AI with natural language comments.

Think of writing comments as a form of prompt engineering. Before you write the code, describe in plain English (or your preferred language) what you intend the code to do.

For example, we want a function to sort a list of names case-insensitively. We might start with:

The moment you write that comment and pause, Copilot will likely suggest the rest of the function (e.g., using sorted(names, key=str.lower)). A top-level comment at the start of a file, or a docstring/comment above a function, helps Copilot understand the overarching objective before diving into implementation details.

This process is similar to giving a human colleague a quick overview of the task at hand, it sets the stage so the following code makes sense in context.

When writing these comments, be clear and specific about the desired behavior. Mention any requirements or constraints. For a more complex example, suppose you need a function to format a person’s name as "LASTNAME, Firstname".

You could provide an example in the comment to clarify your intent:

By including the example of input and output, you give Copilot a crystal-clear idea of what you want.

Which is exactly the desired solution. The first comment was a prompt describing the goal and even provided a test case, and Copilot filled in the implementation.

Use this technique liberally. Add a brief docstring or comment for each function describing what it should do (and how at a high level, if you have an approach in mind). Copilot can detect the comment syntax for your language and will often even help complete the comment if it recognizes a pattern (for example, it might suggest a template for a Python docstring).

By writing specific, well-scoped comments before the code, you essentially “program” Copilot with your intent.

Remember the old saying: garbage in, garbage out. If you feed Copilot an ambiguous comment like “# do something with data”, you’ll get ambiguous code. Instead, describe the task clearly – “# Calculate the average value from a list of numbers, ignoring any nulls” and watch Copilot more reliably produce the correct logic.

GitHub Copilot tip 4: Use meaningful names for clarity

You might hate being stickler about style but variable and function names are another form of context that Copilot relies on.

A tip that might seem obvious but is often overlooked is togive your functions and variables meaningful, descriptive names. If you have a function named foo() or variable data1, Copilot has virtually no clue what you intend, beyond what it can infer from a possibly sparse usage context.

In contrast, names like calculate_invoice_total() or user_email_list immediately convey intent to humans and the AI. In fact, Copilot’s suggestions will improve dramatically when your code is self-documenting.

A function called fetchData() doesn’t mean much to Copilot (or to a coworker) compared to a function named fetch_airport_list() or get_user_profile. The latter gives far more hints of what the function should do.

For example, consider these two scenarios:

Vague naming:

# Determine if a user is eligible for promotion
def check(user, data):
# ...

With a function name like check and a parameter data, Copilot might struggle. “Check” could mean anything. Could it check a password or a value in data? Its suggestions might be generic or incorrect because it’s guessing your intent.

Descriptive naming:

The function name is_user_promotable clearly signals a boolean decision, and the parameters user_profile and promotion_rules indicate the data involved. Copilot can use this information to guess that you might iterate over rules, check user attributes, etc., and its completion will align with that logic.

Adopting clear naming conventions isn’t just a general coding best practice for developers. It’s a GitHub Copilot best practice, too, because Copilot can only infer intent from what it sees. If what it sees are meaningful identifiers and not cryptic ones, it will return far more relevant code. This tip also pays dividends for code maintainability – since you’ll get better AI suggestions and cleaner code for your team.

GitHub Copilot tip 5: Pair Copilot with CodeRabbit for AI-assisted code reviews

While Copilot is fantastic during the coding phase, what about after you’ve written your code? Enter CodeRabbit, an AI-powered code review developer tool that complements Copilot in the development workflow.

CodeRabbit acts like an AI “pair reviewer,” scanning your code (either in the IDE or on your Git platform) and providing feedback and suggestions for improvement.

We’ve found that using Copilot and CodeRabbit together creates a powerful feedback loop: Copilot helps you generate code quickly and CodeRabbit helps ensure that code meets quality standards before it gets merged.

You don’t get the developer who wrote the code to review it, so why get the same AI system to do so?
An AI code reviewer also allows you to standardize your quality gate if your team is using multiple AI coding agents – as so many teams are these days.
Finally, purpose-built AI coding agents like CodeRabbit do a more thorough job and have more features that means the average user is able to find 50% more bugs in half the time they’d typically spend on a code review.

CodeRabbit integrates into VS Code and pull requests on platforms like GitHub. In your IDE, you can invoke CodeRabbit to review the file or the diff you’re working on. It will directly add AI-powered inline review comments in the code, pointing out potential issues, much like a human reviewer would.

For example, CodeRabbit might flag that your function lacks error handling for a specific edge case or suggest a more appropriate HTTP status code.

On GitHub or GitLab, CodeRabbit can automatically comment on PRs with its findings, saving human reviewers time by catching obvious problems first. It also provides line-by-line code reviews, highlighting possible bugs, code smells, style issues, or even missing unit tests.

How best to use Copilot and CodeRabbit together

Think of Copilot and CodeRabbit as two halves of a complete AI-assisted development cycle.

You use Copilot while writing code to speed up implementation. Then you use CodeRabbit to review that code and catch anything Copilot (or you) might have missed.

Copilot might generate a solution that works but isn’t optimal and CodeRabbit could point out a performance issue or a more idiomatic approach.
Copilot might not know your project’s specific coding standards, but CodeRabbit can enforce them during review. Perhaps your team prefers format() over f-strings, etc.. CodeRabbit can comment on that.
Copilot might help you quickly whip up a new API endpoint, CodeRabbit could then run and immediately warn, “Hey, you didn’t handle the case where this input is null,” or “This SQL query might not be parameterized.” You can address those before your human colleagues even look at the code.

Essentially, Copilot gets you to a working draft faster, and CodeRabbit gives you confidence to ship it by auditing the code. It’s like having an AI pair programmer and an AI code auditor working together.

In the context of a complete AI dev tool stack, Copilot and CodeRabbit cover a lot: Copilot for coding, CodeRabbit for review, and you might even use other AI tools for testing or security.

To get started, you can install CodeRabbit’s IDE extension or add it to your GitHub repository as a GitHub App from the marketplace. We highly recommend this for teams and there’s even a 14-day trial.

GitHub Copilot tip 6: Be specific and provide examples in prompts

When it comes to guiding an AI model, specificity is king. If you’re asking Copilot to write code to transform data, consider providing a short example of the data format in a comment or docstring. If you want a function to calculate something, state the formula or an example scenario in natural language.

Copilot’s underlying model is essentially trying to predict what a knowledgeable developer would write next. If your prompt (context + comments) is vague, the model must guess and may go wrong. But if you spell out details and even including sample inputs and outputs if possible, that helps

For instance, suppose you need to parse a log line like "2025-06-02 09:00:00 - ERROR - failed to connect". Instead of just writing # parse log line, you could write:

This specific prompt gives Copilot a clear blueprint: it knows the log format and the desired output types. With the example shown, the chances of Copilot writing a correct parse_log_entry implementation (splitting by " - ", parsing the date with Datetime, strptime, etc.) are much higher. Without the example, Copilot might misidentify the format or split incorrectly.

When prompting Copilot for non-trivial code, spell out the details. If a function has constraints (e.g. “input can be null” or “assume list is sorted”), mention them. If there’s a particular approach you want (e.g. “use binary search” or “use recursion”), hint at it in your comment. And if possible, provide a quick example. The model will take these as strong cues and align its suggestions accordingly.

GitHub Copilot tip 7: Break complex tasks into smaller steps

Copilot works best when it’s dealing with a focused, well-defined task. If you ask it to do too much at once, you might get a muddled or incomplete answer. A great strategy is to break down big problems into bite-sized pieces and tackle them one by one with Copilot’s help.

For example, imagine you need to implement a complex algorithm. Instead of prompting Copilot to write the whole thing in one go (which might result in a long, confusing blob of code), start by outlining the high-level steps as comments or pseudocode.

You might write a few lines of comments or stub functions and then let Copilot fill in each part. Generate code incrementally, rather than all at once. This approach makes Copilot’s job easier (each step has more specific context). That makes it easier for you to review and trust the code at each step.

Let’s say you’re building a small command-line program. First, prompt Copilot to parse command-line arguments, then separately prompt it to implement the business logic, then prompt it to handle output. You can check Copilot’s work at each stage by breaking the flow.

Think of Copilot as participating in a step-by-step refinement of your code. So, don’t try to have Copilot write an entire module in one shot. Instead, have it write one function at a time, or even one logical block at a time, especially if the logic is intricate.

By decomposing tasks, you also naturally create opportunities to review each piece. This incremental approach leads to better quality code and fewer surprises. It’s much easier to debug a smaller Copilot suggestion than the 50-line monolith it spit out because you asked a very broad question.

GitHub Copilot tip 8: Leverage Copilot Chat vs inline completions wisely

GitHub Copilot has two primary flavors – traditional inline code completion and the newer Copilot Chat (an interactive chat interface available in VS Code, Visual Studio, and other environments).

Knowing when to use Copilot Chat vs when to rely on inline suggestions can make a big difference in your workflow. It’s one of those subtle hot tips for GitHub Copilot that can transform how you approach a problem.

Inline code completions (the original Copilot experience) are best for:

In-the-flow coding assistance: When you’re writing code and want Copilot to suggest the next line or block as you type. This works great for completing a small algorithm, filling in a loop, or writing boilerplate in place.
Filling in repetitive code or simple patterns: For example, generating a quick data class, an API call, or the next cases in a series of if/elif conditions. The inline suggestions excel at continuing your current context.
Generating code from a commented intent: as we’ve seen, if you write a comment # do X, the inline completion often does X immediately in code form.

On the other hand, Copilot Chat is more powerful when you need more interaction or have questions about your code:

Explaining or analyzing code: You can ask Copilot Chat “What does this function do?” or “Why am I getting a KeyError here?” and get a natural language answer. The chat can act like a super-smart rubber ducky for debugging.
Larger code generation tasks with iteration: If you want to generate a sizable chunk of code (say a whole function or class) and then refine it, Copilot Chat is ideal. You might ask it to write the code, then say, “now optimize this” or “can you refactor that part using a dictionary instead of if-else?” This back-and-forth is something inline suggestions can’t do easily.
Using personas or specific commands: Copilot Chat has a concept of keywords and skills (and allows system-level instructions like “Act as a senior developer…”) which you can use to influence its style or thoroughness. For instance, you could instruct it to be security-conscious when writing the code.

To illustrate, if I have a piece of code and I’m not sure it’s efficient, I might use Copilot Chat: “Explain the complexity of this function. How can I improve it?” Copilot Chat might identify the bottleneck and even suggest a more efficient approach.

On the other hand, if I just need the next few lines of a loop, inline completion is faster. I can just hit Tab and keep coding.

Tip: If you have access to both, don’t forget you can use them together. Maybe start writing a test function (inline completion helps you fill out test cases), then switch to Chat to ask Copilot to generate some additional tests or explain a failing test. Each tool has its sweet spot.

GitHub Copilot tip 9: Cycle through suggestions and refine your prompts

By default, Copilot might show you one suggestion – the most likely completion – for your prompt or code context. But what if that suggestion isn’t what you want? Many users forget that Copilot usually has multiple suggestions under the hood. Don’t settle for the first thing it offers if it’s not quite right. A GitHub Copilot tip that’s helped me is to use the keyboard shortcuts (or the Copilot panel) to cycle through alternative suggestions.

There might be a gem in suggestion #2 or #3 that better fits your needs than suggestion #1.

Additionally, you can open the Copilot sidebar (or the full chat interface, if available) to explicitly ask for more options. In some IDE setups, hitting a special shortcut (like Ctrl+Enter in VS Code with Copilot Chat enabled) will even reveal multiple completions at once. Scanning through a few options can save you time you’d otherwise spend editing a less ideal suggestion. It’s like getting a second and third opinion from the AI.

If none of the suggestions look good, that’s a signal to refine your prompt or add more context. Perhaps your comment was too short or ambiguous. Try rephrasing it or adding another detail and then trigger Copilot again. The model’s output can vary greatly with slight changes in how you ask.

For instance, if the #sort list didn’t give the desired result, then #sort the list of names in alphabetical order might produce a better suggestion.

Another trick is giving feedback to Copilot. If you’re using Copilot Chat or the sidebar, you might have thumbs-up/down buttons you can press to give feedback on suggestions. While this won’t instantly change the current suggestion, it does send feedback to improve the model’s future behavior through reinforcement learning. And in chat mode, you can directly say, “No, that’s not what I meant. I actually want X,” and the AI will try again.

GitHub Copilot tip 10: Review, test, and verify Copilot’s output

Copilot can generate code that looks perfect at first glance, but remember, it’s not guaranteed to be 100% correct or optimal.

Always review and test the suggestions before integrating them into your codebase. This tip cannot be stressed enough. Copilot may introduce bugs, security issues, or logically wrong code if the prompt is misunderstood. You, the developer, are the last line of defense to ensure quality.

To review Copilot’s output, first, read the code carefully and make sure you understand it. If Copilot suggests a complex algorithm or some math you’re unsure about, ask Copilot (via Chat) or use your knowledge to break down what it’s doing.

You can ask Copilot Chat to explain the suggested code in plain language as a helpful Copilot trick. Often, I’ll paste a large suggestion into the chat and prompt: “Explain what this code does.” You can immediately catch any assumptions or errors if the explanation reveals any assumptions or errors.

Next, consider edge cases and correctness.

Does the code handle empty inputs?
What about error conditions?
If something looks fishy (like a potential off-by-one error or an unbounded recursion), address it or prompt Copilot to fix it.

Security and style are also important. If your prompt didn't specify otherwise, Copilot might use a deprecated function or an insecure approach. Always double-check things like SQL queries (are they parameterized to prevent injection?), file operations (are files closed properly?), and any cryptography or authentication code it writes (does it follow best practices?).

Linting and static analysis tools are your friends here. Run your linters or code formatters on Copilot’s code to catch style issues, and use any security scanners (like Snyk or CodeQL) if applicable to flag vulnerabilities.

Finally, remember that Copilot might occasionally produce code that is oddly similar to public examples (especially for very common algorithms). It’s rare, but if you’re working on a closed-source project and have strict license requirements, be mindful of this. You can configure Copilot to avoid suggestions that match public code, if needed.

Now, it’s time to use these hot tips for GitHub Copilot!

GitHub Copilot is a game-changer for developers but like any powerful tool, it yields the best results when used skillfully. We’ve covered our top 10 hot tips for GitHub Copilot – from crafting great prompts and leveraging context to integrating Copilot with an AI reviewer like CodeRabbit. By implementing these GitHub Copilot best practices, you’ll find Copilot becomes much more helpful.

It can handle the boilerplate and suggest clever solutions, allowing you to focus on higher-level thinking and problem-solving. Also, don’t forget that Copilot and the surrounding AI ecosystem are evolving rapidly and new features (like the CLI tool, vision-based Copilot, etc.) are coming out regularly. Stay curious and keep experimenting with how you use it.

Perhaps you’ll discover new GitHub Copilot tips and tricks beyond the ten we’ve shared. Or we’ll just write another article.

Interested in using CodeRabbit with Copilot? Start your free 14-day trial.

]]>

Wed, 23 Jul 2025 05:30:23 GMT

やまだしさんが開発する「Repomix」は、GitHubリポジトリやローカルのコードベースを1つのテキストファイルに集約し、Claude、Gemini、ChatGPTなどのAIに渡してコード分析や実装方針の検討ができるCLIツールです（RepomixのGitHubリポジトリはこちら）。

自身の開発でもRepomixを活用しており、DiscordやIssue、Xへの返信の相談や、ドキュメントが少ないライブラリの調査などに役立てているそうです。現在は、ライブラリ化やMCPサーバー、stdin対応などを通じて、より幅広い開発フローへの統合を目指しています。

一人開発にのしかかるレビュー負担

やまだしさんは、Repomixを個人プロジェクトとして設計から実装、リリース、サポートまで一人で担っています。そんな中、AIコーディングツールの普及によりPRの数や速度が増え、メンテナンスの負担が一層大きくなっていったと言います。

「AIツールは開発を加速させる一方で、生成されるコードの品質にはばらつきがあります。それを人間が一つ一つ見ていくのは大変で、開発効率化のメリットが半減してしまうと感じていました」

OSS活動が本業の余力で行われているからこそ、レビューに割く時間の大きさはプロジェクトの継続性にも関わる深刻な課題でした。

CodeRabbitを選んだ理由

CodeRabbitを導入したのは、ちょうどCursorが普及し始め、自身もRepomixでRepomixを改善していた時期でした。知り合いから紹介を受けたことがきっかけで知ったそうです。

実はその頃、他のAIレビュー系スタートアップからも利用オファーを受けており、いくつかのツールを試していたとのこと。その中で最も優れていたのがCodeRabbitでした。

「OSSプロジェクトなら無料で使えるという点も、個人開発者として非常に魅力的でした」

当時はGitHub CopilotによるPRレビューやGemini Code Assistなども存在せず、AIによる自動コードレビューを本格的に提供していたのはCodeRabbitが初めてだったと振り返ります。

他ツールとの比較

現在、やまだしさんは以下の4つのツールを併用しています。

CodeRabbit
Gemini Code Assist
GitHub Copilot
Claude Code Action

それぞれに特長があり、CopilotはGitHubとの統合がスムーズで使いやすく、Gemini Code Assistは他ツールが見逃すような観点で指摘してくれることがあると評価しています。その中でも、総合的にはCodeRabbitが最も優れているとのこと。

「レビューの精度が高いのはもちろん、開発者目線でのUXが本当に素晴らしいです」

特に便利だと感じているのは「Prompt for AI Agents」。ファイルパス、行数、指示が含まれたコメントをコピーして、そのままCursorやClaudeに渡せるのが便利で、VSCode拡張でも同様のことができる点が評価されています。

導入時の課題と運用面の工夫

半年前まではレビューの精度に不安があり、「たまに良い指摘をしてくれるから入れておこう」という位置づけだったそうです。しかし、最近では精度が大きく向上し、「実用的で的確な指摘をしてくれるようになった」と変化を実感しています。

また、設定面ではCodeRabbitのYAML形式によるカスタマイズを活用しているとのこと。

「複数のプロジェクトで使う場合、それぞれの設定を把握するのが難しくなるので、コード管理することを強くおすすめします」

精神的な負担を減らす存在に

開発スタイルとしてClaude CodeやCursorと組み合わせて使うことが多いというやまだしさん。CodeRabbitとの相性も非常に良く、今ではなくてはならない存在になっているそうです。

「以前はPRで凡ミスを見つけるとげんなりしていたのですが、今ではCodeRabbitが細かいバグを見つけてくれるので、その精神的な負担が大幅に減りました」

時間効率の改善は1〜2割かも知れませんが、「やらなくていい作業を任せられる」という点に、数値以上の価値を感じていると語ります。

今後の進化に期待すること

今後のアップデートにおいては、モデルの精度向上よりも、開発者視点の“気の利いた機能”を増やしてほしいと語ります。

「Prompt for AI Agentsのように、実際の開発フローを理解しているからこそできる工夫が魅力です。かゆいところに手が届く機能、待っています（笑）」

単なるレビューツールではなく、開発者の“パートナー”として進化していってほしい──そんな想いが込められていました。

CodeRabbitはRepomixのさらなる開発をサポートすべく、進化していきます。

]]>

Thu, 17 Jul 2025 20:37:15 GMT

The art and science of context engineeringの意訳です。

平凡なAIエージェントと卓越したAIエージェントの違いは、ひとえに「コンテキスト」にあります。

最近「コンテキストエンジニアリング」という言葉が話題になっています。2024年6月末、Shopify CEOのTobi Lutkeがこの話題に触れ、Andrej Karpathyも、優れたコンテキストエンジニアリングがAIアプリの差別化要因だと指摘しました。

CodeRabbitは、コンテキストの重要性を示す好例です。私たちは毎日、ユーザーのプルリクエストやIDE上で数万件のコードレビューを行っています。CodeRabbitの各レビューコメントは、レビュー対象のコードに関連する多様な情報源からコンテキストデータを収集し、さらに検証エージェントがその提案がPRやコードベース全体の文脈に合致しているかを再確認する、非線形なレビュー・パイプラインによって生み出されています。

コンテキストエンジニアリングは、単に一般的なコーディング規約にパターンマッチするだけのAIコードレビューツールと、プロジェクト固有のアーキテクチャやパターン、目標を深く理解し、実際に価値あるレビューを提供できるツールとの差を生みます。

コードレビューにおけるコンテキストエンジニアリングの本質

CodeRabbitが扱うコンテキストは、次の3つに大別できます。

意図：開発者やチームがコード変更で達成しようとしている目的。プルリクエストの目的、解決したい課題、期待する成果などが含まれます。
環境：システムの現状。ファイルの関係性、コード依存関係、プロジェクト構造、既存のパターンなど技術的な背景です。
会話：通常のLLMのマルチターン会話でやりとりされるチャットメッセージやツールの応答など、その他の情報です。

これらの要素が適切にバランスされAIに提示されることで、単なる構文エラーだけでなく、アーキテクチャの不整合やパフォーマンスのボトルネック、設計上の改善点まで指摘できるインテリジェントなコードレビュアーが実現します。

コンテキストエンジニアリングの最適解を探る

AIによるコードレビューのために適切なコンテキストを用意するには、いくつかの課題を乗り越える必要があります。ここでは、特に難しい3つの課題を紹介します。

1. AIエージェントの「ゴルディロックス問題」

コンテキストが少なすぎると、AIは不足した情報を推測し「ハルシネーション（幻覚）」を起こしやすくなります。
不要なコンテキストが多すぎると、重要な情報が埋もれ、AIが本質から外れた部分に注目したり、情報過多で混乱します。
ちょうど良いコンテキストは、AIエージェントが正確なインサイトを得るのに必要な情報だけをノイズなく提供します。

2. トークン単位での処理

人間はドキュメント全体をざっと見て重要な部分を直感的に把握できますが、AIモデルはトークン単位で情報を処理し、すべてのテキストに同じ重みを与えます。PRの全コード変更をそのままプロンプトに入れると、AIが些細な点に注目し、重大な問題を見逃すこともあります。重要な変更を優先し、不要な部分は除外する工夫が必要です。

3. コンテキストウィンドウの制約

最先端のAIモデルでも、一度に処理できるテキスト量（コンテキストウィンドウ）には限界があります。特に大規模なコードベースや複雑なPRでは、戦略的なコンテキスト選択が不可欠です。

CodeRabbitのコンテキストエンジニアリング手法

CodeRabbitでは、これらの課題を解決し、常に高品質なコードレビューを実現するために、多層的なコンテキスト準備手法を開発しています。私たちのシステムは、AIの理解力を最大化するために、情報を収集・フィルタリング・構造化する高度な非線形パイプラインを採用しています。上記の図は、私たちがコンテキスト準備で活用している多様な情報源の一部です。

リポジトリ・PR情報のインテリジェントな収集

まず、プルリクエスト自体に関する最も重要な情報を抽出します。

メタデータ：PRタイトル、説明、影響範囲のコミットなど、変更の「なぜ」を特定するための基本情報を収集します。
差分分析：インクリメンタルレビューでは、前回レビューからの正確な変更点を算出し、AIが新規・修正部分だけに集中できるようにします。
パスフィルタリング：生成ファイルや依存ファイルなど補助的なファイルを除外し、本当に重要なコード変更にAIの注意を向けます。

複数ソースからの知識統合

優れたコードレビューには、現在の変更だけでなく、より広い技術的・ビジネス的な背景理解が不可欠です。

過去の学び：エージェントが過去のレビューから得た知見をベクトルデータベースに蓄積し、関連するフィードバックやユーザーの好みを再利用して、コメントの構成に反映します。
PR意図分析：PRの説明や関連Issueを解析し、変更の根本的な目的を抽出。CodeRabbitのレビューが開発者の目標と一致するようにします。
コードグラフ分析：コード依存関係をグラフ構造で表現し、ファイル間の関係性をAIが理解できるようにします。これにより、アーキテクチャ全体への影響も考慮したレビューが可能です。

戦略的なコンテキスト組み立て

必要な情報を収集した後は、AIエージェントが理解しやすい形でプロンプトを最適化します。

パッキングとソート：ファイルを複数のグループに整理し、基盤となる変更から依存先へと論理的な順序でレビューできるようにします。
適応的な複雑度管理：ファイルごとに異なるAIエージェントを割り当て、複雑さや重要度に応じてレビューの深さを調整し、効率とコストを最適化します。
スマートなコンテキスト削減：コンテキストがサイズ制限を超える場合は、要約や分割によって最も重要な情報を残しつつ、範囲を縮小します。

プロンプトエンジニアリング

次の段階では、AIに最適な指示を与えるプロンプトを作成します。私たちのプロンプトは、コードとコンテキストの比率が1:1になるよう設計しており、コンテキストの重要性を物語っています。

レベルに応じたプロンプト：ファイルの複雑さや重要度に応じて、基本的なチェックからアーキテクチャ分析まで、異なる深さのレビューを行います。複雑度ごとに異なるプロンプトやモデルを使い分けます。
構造化されたレビューガイドライン：明確な指示により、AIエージェントが状況ごとに最も価値あるフィードバックに集中できるようにします。過去の有用なレビューコメントのデータも活用します。
コンテキスト強化：プロンプトには、プロジェクトのコーディング規約やパターン、過去の知見も含め、AIがチーム固有のベストプラクティスに沿った提案を行えるようにします。
コンテキスト選択：前段階のエージェントによるコンテキスト準備の結果をもとに、最終的なノイズカットを行います。

検証エージェント

レビュー工程の最終段階では、AIによる品質保証の役割を担う「検証システム」が自動的にレビューコメントの妥当性をチェック・改善します。AIレビュアーが自信を持てない場合に発動します。

発動条件：メインレビュアーがコメントを生成した際、確信が持てない場合は特別な検証リクエストをコメントに含めて検証を依頼します。
証拠収集：検証システムは探偵のように、
- 実際にシェルコマンドを実行して主張を検証
- ウェブ検索で追加情報を収集
- システムの知識データベースから関連情報を取得
反復的な分析：一度だけでなく、複数回の調査を重ねて分析を深めます。各ラウンドが前回の結果を踏まえて進行し、徹底的な検証が行われます。
意思決定：証拠をもとに、
- 元のコメントが正しいと判断し解決
- 人間による確認が必要（結論保留）
- 問題が確定し修正が必要
- 元のコメントが誤りと判断し撤回のいずれかを決定します。

コンテキストエンジニアリングがレビュー品質に与える影響

私たちのコンテキストエンジニアリングパイプラインの高度化は、レビュー品質の向上に直結しています。

誤検知の削減

適切なコンテキストをAIに与えることで、開発者の時間を浪費するような的外れな提案や誤った指摘を大幅に減らせます。プロジェクト固有の慣習を理解し、意図的なパターンを問題として誤検知しません。

より深いアーキテクチャ洞察

コードの関係性やプロジェクト構造を把握することで、単なるリントやパターンマッチでは見抜けないアーキテクチャ上の問題も指摘できます。実際、多くのユーザーが「CodeRabbitがPRの変更による他の依存箇所への影響まで指摘してくれた」と評価しています。

ベストプラクティスの一貫適用

過去の知見やチーム固有の知識を取り入れることで、すべてのレビューでコーディング規約やベストプラクティスを一貫して適用できます。最近では、お気に入りのコーディングエージェントからコーディングガイドラインをインポートする機能も追加され、チームでの共有がより簡単になりました。

継続的な学習と進化

この手法により、レビューのたびにプロジェクト固有の知見が蓄積され、将来のレビューがさらに価値あるものへと進化します。

良いコンテキストエンジニアリングの重要性

コンテキストは、LLMの技術的要件にとどまらず、効果的なAIエージェントに不可欠な要素です。情報を丁寧に収集・フィルタリング・構造化・提示することで、CodeRabbitは単なるコードレビューにとどまらず、コードの全体像を深く理解し、開発者の生産性向上や堅牢なコード、チームの効率化に貢献します。現在のAIコーディングエージェントは「AIスロップ（粗雑なAIアウトプット）」を多く生み出しがちですが、こうしたアプローチがますます重要になっています。

AIコードレビューにおけるコンテキストエンジニアリングは、まだ始まったばかりです。私たちは今後も品質向上のために手法を磨き続けます。モデルの進化とともに、さらに多くのことが実現できるようになるでしょう。

コンテキストがレビューにどれほど影響するか、ぜひ14日間の無料トライアルで体験してください

]]>

Thu, 17 Jul 2025 20:24:09 GMT

Good code review advice doesn't come from threads with 🚀 in themの意訳です。

かつてコードレビューは静かで品のある儀式のようなものでした。非同期でじっくり（あるいはパッシブアグレッシブに）フィードバックを送り、ときどき「nit: この変数名を変えてみては？」とコメントして自己満足する、そんな時間でした。しかし、ソーシャルメディアが登場してから状況は一変しました。今や開発者たちはホットな意見をSNSで発信し、その多くは「他人と一緒に働いた経験がないのでは？」と思わせる内容ばかりです。

一方には「とりあえずマージして後で直せばいい」と主張するX（Twitter）インフルエンサーがいます。イテレーション（とダウンタイム）が命だそうです。もう一方には「本物の開発者はコードレビューなんて不要、チームを信頼すべき」と語るLinkedInのグラインド系がいます。

そして、トラウマが伝統にすり替わったようなアドバイスも。「うちのチームは毎週金曜にライブでお互いのコードを酷評します。これで人間力が鍛えられます」とか「対面でしかコードレビューしません。誠実さが保たれます」など。要するに、コメント欄を公開処刑の場に変えただけです。なかなか大胆な戦略ですね。

インフルエンサーが開発サイクルに与える影響をコミカルに描いたのが、私たちが作成したコミック『バスローブインフルエンサーの戦い』です。今タイムラインに溢れる様々なコードレビュー論争と、それを本当にチームが全部真に受けて実践したらどうなるかを、ユーモラスに表現しています。このコミックの制作でインフルエンサーは傷つけていません…たぶん。

CodeRabbitは、良いコードレビューのマナー論争が大好きですが、同時にコードを出荷することや、同僚のメンタルを守ることも大切にしています。そこで、タイムラインでよく見かける「イマイチなコードレビュー論」と、その代わりに実践したい方法をまとめました。これを読めば、チームメンバーがあなたのリポジトリ追放を密かに画策する心配も減るはずです。

悪いコードレビュー傾向 #1：「とりあえずマージ」主義

「WIP – とりあえず今はマージします」と書かれたプルリクエストを見たことがあれば、「とりあえずマージ」派の開発者ということでしょう。彼らは「とにかく早く出して、全部壊して、後で直せばいい」という流派の信者です。品質がどうであれ、とにかく出荷するのが強みで、後で直すのがアジャイルだと信じています。

しかし、実際に「後で直す」ために悪いコードをマージすると、たいてい直されません。誰も直しません。そのまま本番環境に行き、壊れます。そして、あなたの同僚が日曜の夜に謎のバグをデバッグしながら、あなたの名前を呪文のようにささやく羽目になります。

もし開発をスピードアップしたいなら、「とりあえずマージ」よりも良い方法があります。それは、自分のコードを他人にレビューしてもらう前に一度自分で見直すことです。新鮮な目で読み返し、意味不明なロジックや不要なコードを整理しましょう。そして、CodeRabbitのようなAIレビューアをIDEで使うことで、うっかり恥ずかしいミス（使っていないライブラリをimportするなど）も防げます。

悪いコードレビュー傾向 #2：「コードレビューは新人向け」

もう一つよくあるのが、「スタッフエンジニア Lv.4」になったらコードレビューは不要、という考え方です。自分はもはや人間には理解できないレベルに達したので、誰にもレビューされる必要がないと信じています。

このマインドセットは、シニア開発者を「レビュー不可能な魔法使い」に変えてしまいます。彼らはアーキテクチャのサイドクエストに消え、数週間後に2,000行のPRと「大丈夫、テスト済みです」という言葉と共に戻ってきます。

実際は、シニアであればあるほどコードレビューは重要です。なぜなら、あなたの仕事はより影響力が大きく、複雑で、今後の基盤になるからです。そのコードに問題があれば、将来の開発者が苦しむことになります。

また、コードレビューは双方向です。シニアエンジニアは建設的なフィードバックの仕方を示したり、アーキテクチャの背景を共有したり、丁寧なコメントで後輩を指導できます。逆に、ジュニアが素朴な疑問を投げかけたり、独特な変数名に気付いたりして、シニアが学ぶこともあります。どのレベルでもレビューを当たり前にしましょう。

悪いコードレビュー傾向 #3：「細かい指摘は有害」

最近、「コードレビューで細かい指摘をするのはマイクロ・アグレッション（重箱の隅をつつく攻撃）だ」という意見が増えています。変数名の変更や30行の関数を分割する提案をすると、創造性を妨げたり、無駄に開発を遅らせていると受け取られることもあります。確かに、空白や「utils」か「helpers」かなど、どうでもいい指摘が14件も来たら嫌になります。しかし、すべての細かい指摘が無意味なわけではありません。

実際、「小さなこと」とされる多くは本当は小さくありません。可読性の小さな傷が積み重なり、コードベースは内側から腐っていきます。分かりにくい名前や一貫性のない構造、場当たり的な例外処理が積み重なると、デバッグは「心理的ホラー」と化します。

とはいえ、細かい指摘が攻撃のように感じられてはいけません。目的はスタイルガイドの知識をひけらかすことではなく、次の人（さらにその次の人）が理解しやすいコードにするためです。パッシブアグレッシブなLinterのような口調にならないようにしましょう。

まずは、基本的な指摘は本物のLinterに任せましょう。たとえばCodeRabbitには30種類以上のLinterが内蔵されているので、インデントやセミコロンなどの指摘に脳のリソースを使わずに済みます。その分、ロジックや可読性、アーキテクチャなど、人間にしかできない部分に集中できます。

悪いコードレビュー傾向 #4：レビューをパフォーマンスにしがち（ライブレビューの罠）

プルリクエストの良いところは、夜10時にスウェット姿で、スナックをかじりながら、ひとり静かにレビューできる点です。逆に最悪なのは、突然Zoomに呼び出されて「みんなであなたのコードを見てみましょう」と画面共有されることです。

ライブコードレビューは、コラボレーションや「文脈共有」として導入されがちですが、実際は尋問とパフォーマンスアートの中間のようなものです。自分の関数がオークションのようにスクロールされ、他の開発者が「自分もそこ気になってました」と追い打ちをかけてきます。

特にジュニアや内向的な人、公開コード解剖が苦手な人にはつらい時間です。正直なフィードバックがしづらくなり（CTOの前で「ここ分かりにくい」とは言いにくい）、みんなのカレンダーも圧迫されます。3つの非同期コメントで済む話が、45分の会議と存在意義の危機に変わります。

ライブレビューにも適切な場面はありますが、「毎スプリント必ず」ではありません。オンボーディングや大規模なアーキテクチャ変更、ポストモーテムなど、意図的に使いましょう。実施する場合は、事前に予告するのがマナーです。

悪いコードレビュー傾向 #5：AIが書いたコードを自分で読まない

開発現場が混沌とする中、AIがカフェイン漬けのインターンのように登場しました。AIコーディングツールは強力ですが、同時にカオスです。動くコードを出してくれる一方で、架空の関数を混ぜてくることもあります。

実際、5,000行のPRにfoo7という変数や、謎のタイムアウト、税金をメタファーだけで説明したようなロジックが混じっているのを見たことがあります。たまにコンパイルも通りますが、AIアシスタントがロジックバグや重複コード、AIにしか理解できない関数を紛れ込ませていないとは限りません。

AIツールは助けになりますが、批判的思考を手放してはいけません。「プロンプトを投げて、コミットして、ランチに行く」だけのワークフローなら、あなたも問題の一部です。AIが書いたコードは必ず見直し、リファクタし、正気かどうか確認しましょう。そして、マージ前にローカルでAIレビューを走らせてください。

CodeRabbitのIDE内AIレビューアなら、多くの問題をチームメイトが気付く前にキャッチできます。会議前にロボットが「チャック開いてるよ」と教えてくれるようなもので、上司に指摘されるよりずっと気楽です。

コードレビューのマナー：みんながレビューされたいと思うレビュワーになるには

ここまで読んで「自分はどれにも当てはまらない」と思った方、おめでとうございます。では次のステップです。どうすれば「フィードバックをもらって嬉しい」と思われるレビュワーになれるのでしょうか。

まずは人間らしく。なぜそう思うのか、理由も添えて丁寧にコメントしましょう。
分かりにくい部分は、悪意や無能を疑うのではなく、素直に質問しましょう。
本当に巧みだったり美しい実装には、素直に「ナイス！」と伝えましょう。PRで「いい仕事ですね」と一言あるだけで、他のコメントが全部宿題の赤ペンでも救われます。
機械のエラー出力のような口調は避けましょう。フレンドリーなトーンで。絵文字も適度ならOKです（👀や🎉は効果的です）。
チームメイトがマージコンフリクトで泣いているときでも、ちょっとしたユーモアを添えるのもアリです。むしろ、笑わせてあげたいときこそ。

コードレビューの黄金律はシンプルです。「自分が受け取りたいフィードバックを相手にも送る」。そして、すべてのコメントにNotionドキュメントで変数名の哲学を反論しないこと。人生は短いのですから。

「悪いアドバイスをしないAIレビューアが欲しいですか？ IDEやGitプラットフォームでCodeRabbitを試してみてください

]]>

Thu, 17 Jul 2025 19:22:03 GMT

Context is everything – especially in code reviews

At CodeRabbit, we have engineered the most context-rich code reviews in the industry. While other code review tools might skim the surface and settle for just “codebase awareness,” we go deeper. We pull in dozens of data points from your codebase to deliver reviews that are accurate and actually helpful.

We do this by packing a 1:1 ratio of code-to-context in our LLM prompts. For every line of code under review, we’re feeding the LLMs an equal weight of surrounding context. That includes key things like user intent, file dependencies, and expected outcome gathered from sources such as Jira tickets, code graph, past PRs, learnings from chat conversations, Linters, and more.

And we don’t stop there. Every suggestion from our AI is verified post-generation to reduce hallucinations, ensure accuracy, and match it to your code reviews guidelines before it ever reaches your PR.

This is context engineering – and it’s why CodeRabbit’s reviews lead the industry when it comes to relevance, quality, and trust.

In this post, we’ll look at some of the most critical data points that go into CodeRabbit’s Context Engineering approach.

PR and Issue Indexing

Every code review starts with CodeRabbit cloning your repo and keeping it in a sandbox. This ensures that all reviews are completely codebase aware and the code is kept secure in an isolated environment. CodeRabbit analyzes your codebase to understand the existing file relationships, code dependencies, project structure, and patterns across your codebase. To learn more about how CodeRabbit uses GCP Cloud Run Sandboxes, check out this GCP blog.

CodeRabbit also looks at your past PRs to gather additional context including PR titles, descriptions, and affected commit ranges so that it can get more information about the "why" behind code changes. Any previous related PRs are included in the review comments. By better understanding the “why” behind code changes being reviewed, CodeRabbit generates more context-aware AI code reviews.

Additionally, we also index your issues (Jira, Linear, Github and Gitlab Issues) to understand the “intent” behind code changes. Any issue tickets attached to your PR are analyzed and an assessment of the code changes in the PR against the requirements in the linked issues is automatically generated. This helps us understand if the asks in the issue ticket is adequately addressed by the PR.

Code Graph Analysis

Every time a new review is triggered, CodeRabbit builds a graph representation of code dependencies. These are re-generated each time to make sure no new dependencies are missed. Understanding how various functions depend on each other across the codebase is critical to identifying any downstream conflicts that may cause breaking changes.

CodeRabbit analyzes definitions of code symbols (e.g. Types) from this code graph and uses those definitions to enhance context when providing review comments. This helps catch more edge cases and breaking dependencies that could otherwise be missed. You can see the code definitions that were used in the review comment.

Custom Review Instructions

CodeRabbit includes custom review instructions specific to each team’s coding standards. Reviewing your code according to your own custom rules is a critical component of any intelligent code review and CodeRabbit provides a lot of flexibility in terms of how you can provide custom code review instructions:

Path-based filters: These are helpful to reduce the number of files and speed up the code review. Provide the file path in the form of a glob pattern and CodeRabbit will exclude those files from review, or provide an inverse blog pattern if you only want those files reviewed.
Path-based instructions: These are custom review instructions that only apply to files that match the provided glob pattern. Both path filters and instructions are highly deterministic and will kick in when the pattern matches. These are useful to include when you want certain review rules to only apply to some functions.
Coding agent guidelines: If you already have your coding guidelines set up in an AI coding agent (Cursor, Copilot, Cline, Windsurf, etc.) then CodeRabbit can import them and use those existing rules in its code reviews. Just provide the rules file path and we will import the code guidelines.

Learnings from chat: This is a simple and intuitive way to provide feedback on review comments. Don’t like something in the review? Just chat with CodeRabbit and tell it you don’t want those kinds of comments or you would like it to analyze similar issues through a specific lens and it will include your chat feedback in future code reviews.

Linters and Static Analyzers

CodeRabbit packages 40+ Linters/SAST tools with zero-touch configuration needed from the user. While most customers may have some Linters pre-configured, we provide a much more comprehensive set of Linters. These Tools are automatically invoked during the code review process and their results are validated by our verification agent to cut down on the noise that’s often typical of Linters. A more exhaustive list of Linters helps catch more bugs.

We check your code by running it through all of the 40+ supported Linters that are relevant to the code in question.. If you prefer to use your config file and the tool supports a custom config file, then you can provide the file path for your config file and we will use the rules from your config file in code reviews. When a bug is caught because of a Linter, you will see that called out in the review comment.

You can check out the full list of supported Linters in our documentation.

Web Query

Sometimes the underlying LLM used in reviews may not be up to date to review the code accurately. For example: the LLM may not be aware of a latest security update or a new patch release for a particular programming language. In those cases, CodeRabbit will run a real-time web query to fetch technical information from publicly available release notes or technical documentation and include that in the code review.

This helps ensure your code review includes the latest info and doesn’t accidentally flag errors due to outdated info. In the example below, the code was referencing Go version 1.23.6, and the LLM was not aware of a newer version, but CodeRabbit was able to run a Web Query and figure out that actually the latest Go version is 1.24.1 and recommended the user to refer to the latest Go release.

Verification Scripts

Lastly, CodeRabbit also runs verification scripts on the review comments provided by the LLMs to make sure that the review comments will meaningfully improve the codebase. These verification scripts are generated in the sandbox and any low value feedback is automatically filtered out and not passed on to the user, helping filter out most of the AI hallucinations that can sometimes occur.

Industry leading context engineering for better code reviews

As you can see, CodeRabbit has an extensive Context Engineering approach that ultimately provides more accurate code reviews by giving LLMs just the right amount of contextual information to catch more bugs without overwhelming them.

We achieve this by:

Understanding the intent behind code changes to catch otherwise hard to find bugs
Feeding the right amount of information to LLMs with a 1:1 ratio of code to context
Filtering out low value review comments to maintain a high signal to noise ratio

If you’d like to give CodeRabbit a try, we offer a free 14-day trial and it only takes a couple of minutes to give access to your repo. Let us know if you have any questions on Discord!

]]>

Wed, 16 Jul 2025 23:52:45 GMT

The difference between a mediocre and exceptional AI agent comes down to one thing: context.

Context engineering has recently become a buzzword. In late June, Shopify CEO Tobi Lutke tweeted about it and Andrej Karpathy chimed in to point out that good context engineering is what sets AI apps apart.

CodeRabbit is a great example of the difference context makes. Every day, we perform tens of thousands of code reviews — either on our users’ pull requests or in their IDEs. Each review comment that CodeRabbit makes is the result of a carefully engineered non-linear review pipeline that pulls in contextual data, related to code being reviewed, from dozens of sources and then runs verification agents to check again that every suggestion we share makes sense within the context of both the PR we’re reviewing and the greater codebase.

Context engineering is the difference between an AI code review tool that merely pattern-matches against generic coding standards and one that deeply understands your project's specific architecture, patterns, and goals – and can actually add value to your code review.

The nature of context engineering in code reviews

We break down the context in which CodeRabbit operates into three distinct parts:

Intent: What the developer or team aims to achieve with the code changes, including the purpose of a pull request, the problems they are trying to solve, and the intended outcomes.
Environment: The current technical state of the system, including file relationships, code dependencies, project structure, and existing patterns.
Conversation: The rest of the regular stuff that goes into a multi-turn LLM call, i.e., your chat messages, tool call responses, etc.

When these elements are appropriately balanced and presented to an AI system, the result is an intelligent code reviewer that catches not just syntactic issues but also architectural inconsistencies, potential performance bottlenecks and opportunities for higher-level design improvements.

Finding the context engineering sweet spot

Creating the proper context for AI-powered code review involves navigating several challenges. Here are three challenges that make context engineering particularly difficult.

1. The Goldilocks problem for AI agents

Too little context leads to "hallucinations"—where the AI makes assumptions about missing information, often incorrectly.
Too much irrelevant context dilutes the signal, causing the AI to focus on unimportant aspects or become overwhelmed with information.
Just the proper context provides precisely what the AI Agent needs for accurate insights without noise.

2. Token-by-token processing

Unlike humans, who can quickly scan a document and mentally prioritise important sections, AI models process information token-by-token, giving equal weight to each piece of text. If we put all the code changes from a PR in the prompt, the AI may latch on to something insignificant and skip major issues. Curating the context is important. You need to prioritise important changes and discard the unimportant ones.

3. Context window limitations

Even the most advanced AI models have finite context windows, the maximum amount of text they can process at once. This limitation makes strategic context selection critical, especially for large codebases or complex pull requests.

The CodeRabbit approach to context engineering

At CodeRabbit, we've developed a multi-layered approach to context preparation that addresses these challenges and delivers consistently high-quality code reviews. Our system employs a sophisticated, non-linear pipeline designed to gather, filter, and structure context in ways that maximise AI comprehension. The diagram above lists just some of the dozens of sources of context we draw on in our context preparation process.

Intelligent repository and PR information collection

Our context preparation begins with extracting the most relevant information about the pull request itself:

Metadata: We collect essential data like PR title, description, and affected commit range to determine the "why" behind code changes.
Differential analysis: For incremental reviews, we calculate exact changes since the last review, ensuring the agent focuses only on what's new or modified.
Path filtering: Our system distinguishes between meaningful code changes and ancillary files (like generated assets or dependencies), focusing the AI's attention on what truly matters.

Knowledge integration from multiple sources

A great code review requires more than examining the current changes in isolation. Next, we work on understanding the broader technical and business context:

Historical learnings: We employ a vector database to store our agent’s learnings from past reviews, allowing the system to recall relevant feedback patterns and user preferences so it can structure review comments with these in mind.
PR intent analysis: We analyse PR descriptions and related issues to extract the underlying objectives of changes, ensuring CodeRabbit's review aligns with developer goals.
Code Graph Analysis: We then construct a graph representation of code dependencies to help the AI understand how files interrelate, enabling reviews that consider architectural impact.

Strategic context assembly

Once we have gathered all the raw information required for reviewing, we optimize how the prompt is packaged for the AI agent.

Prompt engineering

The next stage of our review pipeline involves crafting the perfect instructions for the AI. We average a 1:1 ratio of code to context in our prompts which shows how important context is:

Level-appropriate prompts: We adjust the review depth based on file complexity and importance, ranging from basic checks to in-depth architectural analysis. For different complexity levels, we use different prompts and models.
Structured review guidelines: Clear instructions help the AI Agent focus on the most valuable types of feedback for each specific situation based on our own historical data on helpful review comments
Context Enrichment: The prompts include relevant project coding standards, patterns, and historical insights that guide the AI toward company-specific best practices.
Context Selection: We perform a final pass of the context with results of the previous agents, which did the context preparation to cut noise.

Verification agents

The final target of our review process is our verification system which is an AI-powered quality assurance layer that automatically validates and improves review comments. It’s activated when the AI reviewer needs to double-check its findings.

Impact of context engineering on review quality

The sophistication of our context engineering preparation pipeline directly translates to tangible benefits in review quality, including:

Reduced false positives

By providing the AI with the proper context, we dramatically reduce irrelevant or incorrect suggestions that waste developer time like, for example, changes to function calls that don’t align with the team’s coding standards. The system understands project-specific conventions and avoids flagging intentional patterns as issues.

Deeper architectural insights

With more knowledge of code relationships and project structure, CodeRabbit can identify architectural issues that simple linting tools or pattern matching will miss. For example, many of our customers recount how CodeRabbit is able to flag when changes in a PR will affect other dependencies in their codebase.

Consistent application of best practices

By incorporating historical learnings and team-specific knowledge, we consistently apply coding standards and best practices across all reviews. We continue to make it easier for teams to share their coding guidelines – including recently enabling the ability to import coding guidelines from your favorite coding agent.

Enhanced learning over time

Our approach enables the system to improve with each review, building a growing knowledge base of project-specific insights that make future reviews even more valuable.

The importance of good context engineering

Context is not merely a technical requirement for LLMs, but a requirement for effective AI Agents. By thoughtfully gathering, filtering, structuring, and presenting the context, CodeRabbit doesn't just review code. Instead, it understands code in its full complexity so that it can provide insights that make developers more productive, code more robust, and teams more effective. This is increasingly important as AI coding agents currently tend to generate a significant amount of AI slop.

This is only the beginning of context engineering for AI code reviews – we are always refining our approach to improve review quality. With model capabilities constantly expanding, in the future, we’ll be able to do much more.

Interested in seeing how context makes a difference in our code reviews? Start your 14-day trial!.

]]>

Wed, 16 Jul 2025 05:00:43 GMT

Code reviews used to be a quiet, dignified affair—an async ritual where you thoughtfully (or passive aggressively) gave feedback and occasionally dropped a “nit: maybe rename this var?” comment to feel alive. But then social media got involved. Now, developers are sharing hot takes about code reviews, many of which sound like they were written by someone who’s never met another human being.

In one corner, we have Twitter influencers claiming you should “just merge it and fix it later” because iteration (and downtime) is life, apparently? In another, LinkedIn grindset bros are insisting that real devs don’t need code reviews because “you should trust your team.”

And then there’s the advice that seems like trauma disguised as tradition: “Our team just roasts each other’s code live on Fridays. It builds character.” Or: “We only do in-person code reviews, it keeps people honest.” Translation: we’ve replaced the comments section with live humiliation. Bold strategy.

To illustrate the impact of influencers on development cycles, we created a comic. The Battle of the Bathrobe Influencers playfully pokes fun at some of our favorite influencers and how many different code review takes there are on the timeline right now and how that might affect individual developers if teams actually listened to and followed all the competing takes. No influencers were harmed in the making of this comic… we hope.

At CodeRabbit, we love a good code review etiquette debate but we also like shipping code and keeping our coworkers emotionally intact. So, we’ve put together a list of some code review bad takes we’ve seen floating around the timeline—and what to do instead if you don’t want your team quietly scheming your removal from the repo.

Bad code review trend #1: The “Just merge it” mentality

If you've ever seen a pull request titled “WIP – just gonna merge this for now,” you’ve met a ‘Just merge it” dev. They’re disciples of the “move fast and break literally everything” school of development. They believe in shipping code quickly no matter how bad is a strength and fixing things later is part of the agile lifestyle.

But here’s what actually happens when you merge bad code to “fix later”: you don’t fix it later. No one does. It goes to production. It breaks. Now your teammate is debugging a mystery bug on a Sunday night and slowly whispering your name like it’s part of a ritual curse.

If you’re trying to speed things up, here’s a radical idea that’s better than just merging it: review your own code before you ask someone else to. Give it a read with fresh eyes. Clean up nonsense logic, remove dead code, and maybe run it through an AI reviewer like CodeRabbit in your IDE so you don’t miss something humiliating (like importing a library you’re not even using. Again).

Bad code review trend #2: “Code reviews are for juniors”

Another fun trend: the belief that once you hit “Staff Engineer IV (Ascended Form),” code reviews are beneath you. You’ve achieved such enlightenment that no mortal can possibly understand your code.

This mindset turns senior devs into unreviewable wizards who vanish into architectural side quests and return weeks later with 2,000-line PRs and the words “don’t worry, I tested it.”

Here’s the truth: the more senior you are, the more important it is to have your code reviewed. Not because you don’t know what you’re doing but because your work is more impactful, more complex, and more likely to become the terrifying foundation for everything built after it. If that code is flawed, future developers will suffer.

Plus, code reviews are a two-way street. Senior engineers can model how to give constructive feedback, share architectural context, and mentor others by writing thoughtful comments. They can also learn from junior devs who ask great questions or spot something weird because they’re not desensitized to your obscure variable naming scheme yet. So, don’t skip reviews. Normalize them at every level.

Bad code review trend #3: “Nitpicks are toxic”

There’s a growing movement online that says nitpicky comments in code reviews are basically microaggressions. That if you suggest someone rename a variable or break up a 30-line function, you’re stifling creativity and slowing down development for no good reason. Look, we get it. No one wants to receive 14 comments about whitespace and whether it’s “utils” or “helpers.” But not all nitpicks are created equal.

The truth is, a lot of “tiny” things in code aren’t actually tiny. They’re death by a thousand readability cuts. Codebases rot from the inside out when no one sweeps up the small stuff. Confusing names, inconsistent structure, or one-off edge case handling eventually add up to a debugging experience best described as “psychological horror.”

That said, nit-level feedback shouldn’t feel like an attack. The goal isn’t to flex your encyclopedic knowledge of the company style guide—it’s to help your teammate write code the next person (and the next-next person) can understand. You can do that without sounding like a passive-aggressive linter.

Start by offloading the basics to, well… actual linters. CodeRabbit, for example, comes with 30+ linters built in, so you don’t have to waste brain cells correcting indentation or lecturing someone about semicolons. That frees you up to focus on logic, clarity, and architecture—things humans are best at.

Bad code review trend #4: Frequently turning code reviews into a performance (aka the live review trap)

You know what’s great about pull requests? You can review them in sweatpants at 10pm, covered in crumbs, muttering to yourself in peace if you want. You know what’s not great? Being called into a surprise Zoom meeting where your manager shares their screen and says, “Let’s walk through your code together... as a team.”

Live code reviews are the latest workplace horror disguised as collaboration and “context sharing.” And sometimes they are! But often they’re just a weird hybrid of interrogation and performance art. You sit there while someone scrolls through your functions like they’re auctioning off your dignity and other devs chime in with helpful thoughts like, “Yeah, I had questions about that, too.”

These sessions are especially rough for junior devs, introverts, or literally anyone who doesn’t enjoy public code autopsies. They discourage honest feedback (because who wants to be the one to say “this is confusing” in front of the CTO?) and eat up everyone’s calendars. What could’ve been solved with three async comments now requires a 45-minute meeting and an existential crisis.

There is a place for live code reviews—but it’s not “every sprint, all the time.” Use them intentionally: for onboarding, gnarly architectural changes, or postmortems. And if you’re going to do one, give people a heads up.

Bad code review trend #5: Not reading through your AI-code first

As if things weren’t chaotic enough in development, we now have AI entering the scene like a caffeinated intern who means well but keeps hallucinating entire modules. AI coding tools are powerful – but also deeply chaotic. They’ll give you working code and sprinkle in a few imaginary functions for flavor.

We’ve seen it all: five-thousand-line PRs with variables named foo7, random timeouts, and logic that reads like someone asked ChatGPT to explain taxes using only metaphors. Sometimes it even compiles! But that doesn’t mean your AI sidekick didn’t also sneak in a few logic bugs, duplicate code blocks, or functions that only make sense if you’re a large language model yourself.

AI tools are here to help – not replace critical thinking. If your workflow is “vibe prompt, commit code, start lunch,” then yes: you’re part of the problem. Review what your AI assistant wrote. Refactor. Sanity check. And for the love of merge conflicts, run a local AI review before unleashing it on your team.

CodeRabbit’s AI reviewer in your IDE can catch a lot of issues before your teammates have to. It’s like having a robot tell you your fly’s down before you walk into a meeting – deeply appreciated and MUCH less awkward than your boss doing it.

Code review etiquette: How to be a code reviewer people want to work with

Let’s say you’ve made it through this blog post without recognizing yourself in any of these horror stories. You win! Now, let’s level up: how do you become the kind of reviewer people actually like getting feedback from?

Start by being human. Leave thoughtful comments that explain the why, not just the what.
Ask questions when something’s unclear instead of assuming malice or incompetence.
If something’s genuinely clever or elegant? Say so! A simple “nice work” in a PR can be a game-changer—especially when the rest of the comments feel like homework corrections.
It also helps to not sound like a compiler error in human skin. Use a friendly tone. Emojis are fine in moderation (a well-placed 👀 or 🎉 goes a long way).
You can even sprinkle in a bit of humor, assuming your teammate isn’t currently crying over merge conflicts – or maybe because they are and you want to make them laugh.

The golden rule of code review etiquette is simple: give the kind of feedback you’d want to receive. And maybe don’t reply to every comment with a Notion doc defending your variable naming philosophy. Life’s too short.

Want an AI reviewer that won’t give bad advice? Try CodeRabbit in your IDE and git-platform.

]]>

Thu, 10 Jul 2025 03:07:13 GMT

What percentage of your code should be AI generated?の意訳です。

率直に言っておきます。このタイトルはほぼ釣りタイトルです。

この記事では、「コードベースの20%、30%、あるいは50%をAIで生成すべきだ」と主張するつもりはありません。ただ、近いうちに誰かがそう言い出すかもしれない兆しがあるため、この記事を書いています。その“誰か”とは、あなたの上司かもしれませんし、経営幹部かもしれません。

2025年4月、GoogleとMicrosoftが、自社の新規または既存コードの最大30%がAIによって生成されていると公に発表しました。興味深いのは、両社がほぼ同時期に同様の割合でAI活用を定量的に示した点です。さらに注目すべきは、彼らがその発言をどこで行ったかということです。

Googleのスンダー・ピチャイは、この数字を決算説明会で述べました。これは初めてではなく、2024年11月の決算説明会でも新規コードの25%がAI生成であると発言しており、AI導入の進捗を投資家に報告する文脈で話されています。

一方Microsoftのサティア・ナデラは、それに続いてLlamaConでのマーク・ザッカーバーグとの対談中に、自社コードの20〜30%がAI生成であると発言しました。

これらの発言から読み取れるのは、投資家や経営陣がAI導入状況や企業の将来性を測るための新たな指標を見出した可能性があるということです。では、このような割合が本当に意味するのは何なのでしょうか。そして、もしこの数字がAI導入や競争力の尺度として広く使われるようになった場合、企業や開発者はどのように「正しい割合」を判断すればよいのでしょうか。

AI導入率という虚栄の指標が、投資家と経営陣のお気に入りに

2023〜2024年にかけて、なぜ多くの上場企業が急いで「AI戦略」を打ち出したのか不思議に思ったなら、その理由は株式市場がそれに反応したからです。AI導入によって収益増加やコスト削減が見込まれるという投資家の期待が、AI戦略を発表した企業の株価を平均2%押し上げたと報告されています。

さらに67%の企業は、株価が6%以上上昇するというより大きな恩恵を受けました。BuzzFeedに至っては、生成AIを使ってコンテンツを制作する計画を発表しただけで、株価が120%も上昇しました。AI戦略を明確にしなかった企業は、株式市場で不利な扱いを受ける傾向にありました。

こうしてAI導入は、実際のビジネス上の利益に加え、発表するだけで株価を押し上げる手段として、経営陣にとって優先事項になっていきました。

その結果、すべてのAI導入が「良い導入」と見なされる企業文化が一部で生まれつつあります。それが投資家や経営層を喜ばせ、「生産性やスピードが向上した」という印象を与えるためです。

2025年4月には、Lead Devが記事で、企業がAIコーディングの利用を義務づけており、それが「開発者を追い詰めている」と報じました。これには、AIコーディングエージェントの提案受け入れ数を増やすよう求める指示や、従業員ごとのAI使用率をランキングする公開ボード、さらにAI使用量の増加が求められる曖昧なOKR（目標と成果指標）などが含まれます。

問題は、こうした指標が複雑な結果を伴う行動を粗雑に測定・誘導してしまう点にあります。記事にリンクされたRedditの投稿のコメントでもそれが示されています。ある開発者はこう述べています。「毎月の全社エンジニア会議で、Copilotの使用状況が報告される。そして使用率をもっと上げるように言われる。その数枚後のスライドで、重大インシデントの件数も増加していると報告される」

AIツールの効果を巡って、開発者と経営層の間に認識のズレがあることは明白です。2024年にAtlassianが実施した2,000人以上のITマネージャーと開発者への調査では、経営層は「AIこそが開発者の生産性と満足度を高める最重要要素」と見なしていましたが、AIによって生産性が向上したと回答した開発者は3分の1にとどまりました。

AIの使用量ばかりを測定して、生成されたコードの品質やデバッグ・レビューにかかる時間を考慮しないのであれば、AI導入が企業にとってマイナスに働く場面でもそれを推進することになります。

最悪の場合、コードベースの50%がAI生成になる一方で、バグが急増し、ユーザーからの苦情や障害も増えることになりかねません。しかも、開発者がレビューや修正に同等の時間を費やすため、時間の節約にもならないかもしれません。

とはいえ、決算説明会では「30%がAI生成」と言えば聞こえは良さそうです。

AI生成コードの割合はどこまでが“やりすぎ”なのか

GoogleやMicrosoftのような企業は、自社のコードベースの多くをAIで生成しようと競っています。MicrosoftのCTOは2030年までに全コードの95%がAI生成になると予測してさえいます。しかし、多くの企業がそこまでの高い目標を掲げているわけではありません。

2025年3月、Y Combinatorのマネージングパートナーであるジャレッド・フリードマンは、現コホートのスタートアップのうち4分の1がコードベースの95%がAI生成であると発言しました。

そのYouTube動画のコメント欄を要約すると、多くの開発者は「コードの95%がAI生成」という状況を“破滅的”だと受け止めています。

この問題は「ゴルディロックスのジレンマ」に近いです。つまり、多すぎても少なすぎてもダメで、ちょうどよい“魔法の数字”があるということです。では、それはいったい何パーセントなのでしょうか。

コードのどの程度がAI生成になると“やりすぎ”になるのでしょうか。40%？50%？75%以上？それはアプリケーションの種類や使用言語によって異なるのでしょうか。企業ごとに違うのでしょうか。そして、この割合を公表することで、その企業について何を示しているのでしょうか。

コードの割合を測定する意味はあるのか？

この疑問に答えるために、そもそもこの指標が何を測定しているのかを掘り下げましょう。

GoogleやMicrosoftのような企業が「30%のコードがAI生成だ」と発言するとき、それは何を計測しているのでしょうか。通常、これはCopilot、Cursor、Claude、WindsurfといったAIコーディングツールによって作成された行数やコミット数を指しています。しかし、コード行数という指標は、昔から生産性や価値の測定には不適切だとされています。さらに、AIが生成した後で大幅に修正された行はカウントされていない可能性もあります。

AIコーディングツールは、ボイラープレートや繰り返しがちなコードを書くのに長けています。つまり、もともと開発者が素早く書けるような、複雑性の低い、価値の低いコードです。こういった“簡単に稼げるポイント”をカウントすることで、AI導入率は膨らみますが、実際の生産性向上を意味しているとは限りません。さらに、存在しないAPIキーを生成するなど、AIによる幻覚的生成も頻繁に報告されています。

より重大なのは、単純な量に基づく指標では、生成されたコードの複雑さや品質が反映されない点です。AI生成コードのデバッグやレビューにどれだけの開発者の時間が必要だったのかも見えてきません。こうした文脈がなければ、「30%がAI生成」という数字は、実際の効率性や品質にはほとんど関係がありません。

開発者フォーラムや調査では、AI生成コードがバグや脆弱性を増やすことへの不満が繰り返し示されています。ある調査では、AIコーディングツールが最大41%多くのバグを生むと報告されています。たとえば、Harnessが最近発表した調査によると、67%の開発者がAI生成コードのデバッグに人間のコードよりも多くの時間を費やしており、68%がセキュリティ問題の修正に多くの時間を費やしていると回答しました。さらに59%は、AIコーディングツール使用時の半分以上のケースでデプロイに問題が発生していると答えました。

こうした背景を踏まえると、MicrosoftのようなAIエージェントを販売している企業が「コードベースに占める割合」という指標を成功基準として推進したがる理由は明白です。それは、AI活用をより包括的に評価する指標よりも、都合がよいからです。

しかし、誤ったOKR、硬直的なノルマ、開発者のAI使用率を可視化する公開ランキングなどは、文脈や品質を無視し、開発者が意味のない数字を追いかけることに時間を費やす原因になります。AI投資の価値を本当に測定したいのであれば、もっと適切な指標があります。

「コード生成率」という指標はコスト削減志向に偏っている

さらに憂慮すべき点として、多くの開発者が報告しているのは、こうしたAI導入の義務化が社内の採用凍結と結びついているということです。特にエントリーレベルの職種が影響を受けていることが多く、経営陣の間では「AIが開発者を置き換える」という楽観的な見方が広がっているようです。

たとえば、MetaのCEOであるマーク・ザッカーバーグ、SalesforceのCEOであるマーク・ベニオフ、AWSのCEOであるマット・ガーマンらは、AIが開発者の仕事を置き換える可能性について頻繁に言及しています。

2025年初頭、ベニオフはポッドキャストで「今年は誰も雇わないかもしれない」と発言しました。「AIエージェントによって信じられないほどの生産性向上が得られている」との理由からです。

もしあなたが、こうしたAIの可能性に対して過剰な楽観を感じているなら、その感覚は正しいといえるでしょう。実際、Microsoftのナデラ自身も、Copilotを使ったAIコード生成について「言語によって成果にばらつきがある」と認めています。

このような指標の問題点は、多くの企業がAI生成コードを導入することで、開発者の削減、あるいは少なくとも人数削減が可能になると信じてしまっている点です。もし、「コードベースの50%をAIに置き換えることで、最終的に開発者を半分に減らせる」と本気で思っているなら、それは大きな誤解です。

このような前提でこの指標を追いかけると、企業は大きな問題に直面する可能性があります。技術的負債や不具合の増加だけでなく、投資家をも失望させる結果になるかもしれません。人員削減が実現しなければ期待外れになりますし、仮に実現したとしても品質低下が明らかになれば、企業の収益に悪影響が及ぶでしょう。

開発者が本当に必要とする指標

コード行数の生成だけを測定するのではなく、企業は生成されたコードの品質やAI導入による実際の生産性への影響を測定すべきです。以下は、AI活用を評価するために併せて追跡すべき指標の例です。

AIコーディングツール導入前後でのバグ発生率
本番環境での障害発生頻度とAI使用量との相関
デバッグや複雑なコードレビューを含めた開発ライフサイクル全体での実際の時間短縮量
開発者からの定性的フィードバックをもとにした満足度と生産性の変化

もう一つの戦略として、ChargeLabが実施したアプローチがあります。LeadDevの記事でも紹介されたこの戦略は、開発者主体のAI活用方針です。彼らの開発者はAIツールを自由に選択でき、その結果、生産性が40%向上したという測定結果が得られました。これは義務化によるものではなく、開発者が自身で設定した文脈に応じた、意味のある指標に基づいていたからです。

また別のLeadDevの記事では、AI導入はコード生成に偏るべきではなく、コードレビューやリファクタリング、テスト、ドキュメント作成など、開発ライフサイクル全体での生産性向上に焦点を当てるべきだと述べています。「コードベースのAI比率」という指標は、それらの潜在的な効率化領域を無視してしまいます。

実際、DORAのAIがソフトウェア開発に与える影響に関するレポートでも、生産性向上を実現するための5つの戦略が示されており、その1つ目が「AIをコード生成だけでなく、開発の全工程に活用する」ことです。AIがあらゆるフェーズで活用される時代はすでに始まっており、私たちも先月の投稿で2025年の主要開発トレンドとしてこの流れを取り上げました。

有用なAI導入指標を作るには

効果的なAI導入指標は経営層ではなく、現場のエンジニアチームから生まれるべきです。現場を知らない経営陣が一方的に作った指標ではなく、開発の実態に即した評価基準が必要です。

理想的な指標は、次のような要件を満たしているべきです。

日々のワークフローを理解している開発者と協力して作成されている
表面的なAI導入率ではなく、実際の生産性やビジネス成果に連動している
柔軟で文脈に応じた実験を奨励し、厳格な強制ルールではない

たとえばChargeLabの戦略では、「AIによって年間100万ドルの開発時間を削減する」といった広い目標を掲げつつ、各チームがその目標をどのように達成するかは自由に決められるようになっていました。

これは、明確な方向性と開発者の裁量の両立を実現しており、単純なノルマやツール一択の指示ではなく、測定可能で意味のある成果に焦点を当てています。

このような指標があれば、「ここは手書きしたほうがいい」という場面では人間がコードを書き、コードレビューやQA・テスト工程ではAIを活用するなど、より柔軟で合理的な選択が可能になります。

AI生成コードの割合を追跡すべきなのか？

結論として、「AI生成コードの割合」という単一の指標は、使い方に注意が必要です。それだけでは情報が単純化されすぎており、誤った行動を助長し、開発者にストレスを与える可能性があります。

その代わりに、エンジニアリングリーダーや開発者は、生産性、コード品質、開発者の満足度に明確に結びついた指標に注目すべきです。こうした複合的で成果志向の指標こそが、AI導入の真の効果を示すのです。「コードの何パーセントがAI生成か」だけでは決して見えてこない、本当のインパクトがそこにはあります。

より良いコードをより早くリリースするためのAIツールを試してみませんか？ CodeRabbitの14日間トライアルを今すぐ始める

]]>

Thu, 10 Jul 2025 02:51:10 GMT

Code Guidelines: Bring your coding rules to CodeRabbitの意訳です。

Cursorや他のAIコーディングツール（使ったことがない人はいないのでは？）を使っている、あるいは試したことがあるなら、その欠点にも気づいているかもしれません。新しいファイルを作成するたびにプロジェクト構成の説明を繰り返さなければならなかったり、UIコンポーネントにshadcn/uiを使っている場合だったり、ビルド時に先にバックエンドの依存コンポーネントをビルドしなければならなかったりと、何かと面倒がつきまといます。

CursorやWindsurf、Copilot、その他のAIコーディングエージェントは、再利用可能かつスコープ付きの指示を各リクエスト時や必要に応じて読み取れることで、こうした問題に対応しました。これは賢明な判断です。

Code Guidelinesの使い方

CodeRabbitは、以下のコーディングルールファイルを自動的に検出します。

Cursor: .cursorrules
GitHub Copilot: .github/copilot-instructions.md
Cline: /.clinerules/*
Windsurf: /.windsurfrules
Claude: /CLAUDE.md
その他にも対応しています

Code Guidelinesを使えば、これらのルールがすべてのプルリクエスト（PR）レビュー時のコンテキストとして使用されます。

さらに、独自のルールを設定ファイルとして読み込むことも可能です。たとえば docs/STANDARDS.md や team/code-style.txt にガイドラインが記載されている場合、そのパターンを指定するだけでCodeRabbitが自動的に読み取ります。

なぜ重要なのか

この仕組みは非常に大きな意味を持ちます。第一に、AIコーディングエージェントで設定したルールをそのままCodeRabbitでも使えるようになるということです。つまり、コンテキストの切り替えが不要になりますし、同じスタンダードを2度説明する必要もなくなります。

また、CodeRabbitに対して、どのようにPRレビューを行ってほしいか、どのベストプラクティスに従ってほしいかを明確に指定できるようになります。CursorやGitHubなどで定義したガイドラインをCodeRabbitでも活用できるのです。

例えば、以下のようなメリットが挙げられます。

スタイルやフォーマットの一貫性を強制できる .cursorrules に「関数はcamelCase、コンポーネントはPascalCase」と書かれていれば、CodeRabbitはレビュー時に違反箇所を指摘します。
アーキテクチャパターンの維持が可能になる 好きなファイル構成、モジュールの境界、依存関係ルールなどを一度定義すれば、CodeRabbitはすべてのPRがそれに従っているか確認します。
チーム独自のスタンダードを適用できる 例えば「必ずearly returnを使う」「継承より合成を優先する」といったルールも、レビューの一部としてチェックできます。

今後の展開は？

CodeRabbitが優れたPRレビューを行える理由は、強力なコンテキストエンジニアリングにあります。Code Guidelinesは、そのコンテキストの質をさらに高める機能です。今後の展開に関してヒントをひとつだけ言うなら、「コンテキストこそが王者」です。ぜひ今後のアップデートにもご注目ください。

まだCodeRabbitを試していないなら、今が絶好のチャンスです。 無料で始める ことで、コードレビューにかかる時間もバグも半減させましょう。

]]>

Tue, 08 Jul 2025 19:57:24 GMT

We’ll come clean: That title is mostly clickbait.

This isn’t an article where we tell you that 20% or 30% or even 50% of your codebase should be AI-generated. We’re writing this because it’s looking like very soon someone could be telling you that. Maybe your boss. Or your C-suite.

In April, both Google and Microsoft came out publicly with claims that up to 30% of their new or existing code was AI-generated. What’s interesting is that both Microsoft and Google decided to quantify their AI usage in similar percentages during the same month. Even more notable is where they made these claims.

Google’s Sundar Pichai shared that stat in an earnings call – and this wasn’t the first time he talked about AI-generated code at Google in this way. During Google’s November 2024 earnings call, Pichai stated that 25% of Google’s new code was AI-generated. So, he explicitly framed it as an update for investors on Google’s progress at implementing AI.

Not to be outdone, Microsoft’s Satya Nadella came out a few days later during a fireside chat with Mark Zuckerberg at LlamaCon with the claim that 20% to 30% of Microsoft’s codebase was AI-generated.

What we likely saw in those two updates was the birth of a new metric that investors and executives will believe says something important about the state or future of a business. But what does a percentage like that really say? And, if it’s going to be used more widely as a way to measure AI adoption or a business’ competitiveness, how would companies and developers even decide what the ‘right’ percentage looks like?

AI vanity metrics are the darlings of investors and execs

If you wondered why so many publicly traded companies rushed in 2023 and 2024 to share their ‘AI strategy,’ it’s because the stock market rewarded them for it. With investors projecting both revenue increases and cost savings from adopting AI, the stock prices of companies who announced an AI strategy increased by 2% more on average than companies that didn’t.

But 67% of companies had even better results – their stock prices soared over 6% higher. BuzzFeed’s stock even went up a whopping 120% just for announcing they planned to use generative AI to create content. Companies that didn’t articulate their AI strategy were generally punished in publicly traded markets.

AI adoption has since become a key priority for Executives – both for the actual business benefits it promises AND for the stock price increases they now rely on whenever they announce new AI investments or adoption.

That’s leading to what some have characterized as a toxic culture in some companies where all AI adoption is counted as good AI adoption because it makes investors and executives happy – and gives the veneer of increased productivity and velocity.

In April, Lead Dev wrote about how companies are now instituting AI coding mandates and how that’s – in their words – “driving developers to the brink.” These mandates can be anything from the request to increase the number of suggestions you accept from AI coding agents to public leaderboards ranking AI usage by employee to vague performance-based OKRs where devs are simply expected to use AI ‘more’ from quarter to quarter.

The problem? Like many metrics, they’re blunt ways to measure and incentivize behaviors that have complex outcomes. This is demonstrated in the comments of a Reddit post the Lead Dev article links to. Says one dev: “At our monthly engineering all hands, they give us a report on our org’s usage of Copilot (which has slowly been increasing) and tell us that we need to be using it more. Then, a few slides later we see that our severe incidents are also increasing.”

It’s clear there’s a disconnect between engineers and executives on the benefits of AI coding tool use. An Atlassian survey of over 2,000 IT managers and developers in 2024 showed that leaders listed AI as the most important factor in improving developer productivity and satisfaction but only a third of developers reported experiencing AI-related productivity gains.

By only measuring AI use and not the quality of the code that AI generates or the actual time saved once debugging and more involved code reviews are factored in, you could be incentivizing AI use even when AI use hurts your company.

In that case, you might achieve 50% of your codebase being AI generated while also adding exponentially more bugs to your code and increasing issues and customer complaints. And to make it worse, you might ALSO not be saving any time since your devs could be spending an equal amount of time reviewing and fixing that code as they would have if they wrote it from scratch.

But your AI usage would sound impressive during an earnings call, right?

What percentage of AI code is too much?

While companies like Google and Microsoft are racing to make as much of their codebase as possible AI generated – and Microsoft’s CTO is even predicting that 95% of all code will be AI generated by 2030 – it’s unlikely other companies are aiming that high.

On a YC podcast back in March, Jared Friedman, Y Combinator Managing Partner claimed that a quarter of the accelerator’s current cohort have codebases that are 95% generated by AI.

I’ll summarize (and sanitize) the YouTube comments on that one for you: Most devs, unsurprisingly, felt that having a codebase that was 95% AI-generated was a recipe for disaster.

It appears we have a Goldilocks dilemma here. There is a magic number that is neither too low, nor too high but just right when it comes to AI-generated code. But what is it?

At what point does your code become too AI-generated? Is it 40%? 50%? 75%? Over 75%? Does it depend on the application? The language? Does it vary from one company to another? And what are you actually saying about a company by sharing this percentage?

Should percentage of code be a metric that’s tracked?

To answer this question, let’s dig into what this metric might actually be measuring.

When companies like Google and Microsoft say 30% of their code is AI-generated, what are they counting? Usually, this refers to the number of lines or commits that originated from AI coding tools like Copilot, Cursor, Claude, or Windsurf. But raw lines of code is a notoriously poor metric of productivity or value. And that doesn’t account for the lines AI wrote that were then heavily edited.

AI coding tools often excel at writing boilerplate or repetitive code—exactly the kind of low-complexity, low-value code that developers could produce rapidly themselves anyway. Counting these easy wins inflates AI adoption numbers without necessarily indicating meaningful productivity gains. Another challenge is that developers report frequent and concerning hallucinations like making up API keys that don’t exist.

More critically, a metric based purely on volume doesn’t capture the complexity or quality of the code generated. It doesn’t tell you how much developer time was needed to debug or review the AI-generated code. Without these nuances, a 30% metric means almost nothing about actual efficiency or quality outcomes.

Developer forums and surveys consistently highlight frustration at AI-generated code's tendency to introduce more bugs and vulnerabilities with one survey estimating that AI coding tools add up to 41% more bugs. For example, Harness recently released a survey that revealed 67% of developers spend more time debugging AI-generated code than human-generated code and 68% spend more time resolving security issues. Even worse was that 59% in the Harness survey said that they experienced problems with deployments at least half the time they used AI coding tools.

Perhaps, for this reason, it’s not surprising why companies that are selling coding agents, like Microsoft, might want teams to adopt ‘percentage of the codebase’ as a success metric over others that give a more holistic view of the use of AI in development.

But misguided OKRs, rigid quotas, and public leaderboards of each developer’s AI use ignore context and quality, leading developers to spend more time chasing meaningless metrics than delivering high-quality software. If you’re trying to measure the value of your AI investment, there are much better metrics to track.

‘Percentage of code’ metrics seem focused on cost cutting

What’s more concerning is that many devs report that AI mandates like these are often connected to hiring freezes at their companies – with entry-level jobs being hit harder. That suggests many execs share the ‘optimism’ that AI will replace developers – something that people like Meta CEO Mark Zuckerberg, Salesforce CEO Marc Benioff and AWS CEO Matt Garman are talking about publicly a lot these days. In early 2025, Bernioff shared his plans on a podcast saying, “Maybe we aren’t going to hire anybody this year. We have seen such incredible productivity gains because of the agents.”

If this seems overly optimistic of AI’s potential to you, you’re right to feel that way. After all, Microsoft’s Nadella even admitted that they were seeing ‘mixed results’ with AI-generated code in certain languages with their own Copilot tool in the LlamaCon chat.

The problem with metrics like this is that many companies now believe that the use of AI-generated code is a way to reduce costs by replacing developers – or, at least, reduce their numbers. But, if you're measuring what percentage of your codebase is AI-generated because you believe you’ll eventually be able to cut your workforce by 50% once you achieve 50% AI code, you’re going to be sadly mistaken.

In that case, adopting this metric appears like it could put companies on a collision course – not just to create more technical debt and issues – but also to disappoint investors. Either those layoffs won’t materialize or, if they do, they’ll lead to increased issues and noticeable quality degradations that will impact a company’s bottom line.

Metrics developers actually need

Instead of just measuring raw code generated, companies should be measuring the quality of that code and the true productivity impact of AI adoption. For example, in addition to AI usage and adoption metrics companies should also track:

Bug rates before and after adopting AI coding tools.
Deployment stability (frequency of production incidents) against AI usage,
Actual time saved in the full development lifecycle once debugging and more complex code reviews are added in.
Developer satisfaction and productivity based on qualitative feedback.

Another strategy is to follow what ChargeLab has done. Mentioned in the LeadDev article, it has a more dev-focused AI strategy. Their developers choose their AI tools freely, which resulted in a measured 40% productivity increase. This increase was not driven by mandates but by empowering developers with context-specific and meaningful metrics they themselves set and allowing them to have choice over their tools.

Another LeadDev article also suggested that AI adoption shouldn’t be narrowly focused on code generation since productivity gains can equally be had at other parts of the software development lifecycle like code reviews, refactoring, testing and documentation. Metrics around how much of your codebase is AI-generated ignore the potential savings from those areas.

Indeed, the DORA report on the Impact of AI in Software Development outlined 5 strategies for ensuring AI actually helps with productivity gains. The first strategy? Use AI at all stages of the development cycle, not just for code generation. The use of AI throughout the entire development cycle is becoming so common, that we flagged it as the main development trend we expect to see in 2025 in a post last month.

Creating useful AI adoption metrics: best practices

Effective AI adoption metrics should come directly from engineering teams themselves, not from executives disconnected from the realities of the company’s codebase.

Metrics should:

Be developed in collaboration with engineers who understand day-to-day workflows.
Align with real productivity and business outcomes, not superficial adoption targets.
Encourage flexible, context-aware experimentation rather than rigid enforcement.

ChargeLab's strategy, for example, involved setting a broad organizational goal (e.g., saving $1 million annually in dev time by using AI) but giving teams freedom in how to achieve it.

It balances clear direction with developer empowerment, focusing on measurable, meaningful outcomes instead of simplistic quotas that are narrowly limited to the use of one universal tool.

Such a metric would allow developers to decide to write code manually when it makes more sense to and to save time at the code review or QA/testing phase by deploying AI tools there instead.

Should we track AI-generated code percentages at all?

Ultimately, "percentage of AI-generated code" as a standalone metric has limited value. It’s too simplistic, incentivizes the wrong behaviors, and risks causing developer frustration.

Instead, engineering leaders and developers should focus on metrics tied explicitly to productivity, code quality, and developer satisfaction. These nuanced, outcome-oriented metrics provide true insight into AI’s impact far beyond what a simplistic “percentage of your codebase” metric could ever convey.

Want to try an AI tool that will help you ship better code faster? Start a 14-day CodeRabbit trial today!

]]>

Wed, 02 Jul 2025 07:00:00 GMT

If you're using or have tried out Cursor or another AI coding tool (who hasn't?), chances are you've also seen the downsides of it. Maybe you’re having to repeat how to structure the project when creating new files. Or maybe you're using shadcn/ui for UI components. Or maybe when you want to run a build that requires another command to build a backend dependency first.

Cursor, Windsurf, Copilot, and other AI coding agents addressed this problem by adding support for reusable, scoped instructions that can be read on every ask – or when requested. Smart move!

Using Code Guidelines

CodeRabbit will automatically detect your coding rules for:

Cursor .cursorrules
GitHub Copilot .github/copilot-instructions.md
Cline /.clinerules/*
Windsurf /.windsurfrules
Claude /CLAUDE.md
And more

With Code Guidelines, these rules are used as context on every PR review.

We also support adding your own rules by specifying a custom file pattern. Got your guidelines in docs/STANDARDS.md or team/code-style.txt? Just add the pattern and CodeRabbit will pick it up.

Why it matters

This has huge implications. For one, that means CodeRabbit will follow the same rules you set with your AI coding agent. No more context switching. No more explaining the same standards twice.

It also gives you a way to specify exactly how CodeRabbit should do a PR review and what best practices you want the CodeRabbit review agent to follow. Now, you can just add these guidelines to your rules (be it from Cursor, GitHub, etc.) and:

Enforce style and formatting consistency - If your .cursorrules say "use camelCase for functions, PascalCase for components," CodeRabbit will flag violations during review.
Maintain architecture patterns - Define your preferred file structure, module boundaries, or dependency rules once and CodeRabbit ensures every PR follows them.
Apply team-specific standards - Whether it's "always use early returns" or "prefer composition over inheritance," your unique team preferences become part of every review.

What's next?

CodeRabbit’s not-so-secret magic to PR reviews is the level of context engineering it does. Code Guidelines make CodeRabbit’s reviews even better by adding your coding rules. As for what’s next, I'll give you a hint: context is king. Stay tuned.

If you haven't tried CodeRabbit yet, there's never been a better time. Get started for free and start cutting your code review time (and bugs) in half.

]]>

Thu, 26 Jun 2025 08:15:43 GMT

Introducing the AI dev tool tech stackの意訳です。

4月、MicrosoftとGoogleが、自社で生成されるコードの30%がAIによるものであると発表しました。これは、AIコーディングツールが新たな段階に突入したことを意味しています。これらは、大企業においてさえエンジニアリングワークフローの重要な一部になっています。

現在、開発者界隈のX（旧Twitter）は「バイブコーディング」に夢中で、多くの開発者が次のような疑問を抱いています。AIの実際の活用例とはどのようなものか？開発者たちはエージェント的なコーディング機能を使って、本番環境向けの機能をまるごと「バイブコーディング」しているのか？それとも主に補完やプロトタイピングに使っているだけなのか？

開発者が本当に知りたいのは、チーム、企業、業界全体でのAI活用の成功例です。チームが実際に使っているAIツールは何か？どうすれば本当の価値を引き出せるのか？企業はAI利用にどんなルールを設けているのか？AIコーディングツールは本当に生産性を向上させているのか？それともバグを増やしながらコーディング速度だけが上がっているのか？

CodeRabbitでは毎月数百のエンジニアリングチームとAIの活用方法について話をしています。それによって、AI導入のトレンドをいち早くキャッチできており、ここ数ヶ月で開発チームの思考に顕著な共通点が見られるようになってきました。

ここからは、私たちが顧客から聞いていることと、それがなぜ「2025年はAI開発ツールスタックの年」と言えるのかを紹介します。

誰もがAIの課題を抱えている

私たちが話すチームの多くが、AIコーディングツールの一番の課題として挙げるのは「生産性や開発者体験（DevEx）の向上が一貫しないこと」です。調査によれば、AIコーディングツールはコードに最大41%多くのバグを生む可能性があるとされており、新たな課題も生んでいます。

数週間前、Cursorのデザインチーム責任者であるRyo Luが、Cursorでコードを書く際の注意点についてスレッドを投稿しました。そこでは、「AIスパゲッティコード」を避けるための12のステップが紹介されています。

ホビープロジェクトやシニア開発者ばかりのチームであれば、スパゲッティコードを排除できるかもしれません。しかし、規制の厳しい上場企業で、ジュニア開発者がレガシーコードベースにAIを使って変更を加えたら…想像するだけで恐ろしい話です。

さらに、バグ以外にもAIツールは他の開発工程でもボトルネックを生んでいます。

コードを書く量が増えれば、それだけコード（レビューすべき・テストすべき・ドキュメント化すべき・リファクタリングすべき）も増えます。結果として、「画期的な」AIの生産性向上効果は、開発プロセスの他の手動工程で滞ってしまうのです。しかも、AI生成コードは問題を抱えやすいため、それらの工程にかかる時間も増大します。

新しいコーディングには、新しい技術スタックが必要

多くの開発者が今年、大きな気付きに至りました。それは「変革的なテクノロジーを導入するなら、他の開発工程も変える必要がある」ということです。つまり、エンドツーエンドのAI開発ツールスタックが必要だということです。

破壊的技術はしばしばエコシステム全体に変化をもたらします。GitHubが2008年に登場したとき、3年後にはCircle CIやJenkinsといった関連ツールも登場しました。AIコーディングツールも、それより早いペースで同様の変化を起こしつつあります。

数年間AIツールを使ってきた開発リーダーたちは、AIツールが役立つときもあれば、逆効果になることもあると認識しています。真の生産性向上を得るには、AIが生む新たな課題に対処するための補完ツールが必要です。

また、AI導入をスタックとして捉えることで、コード生成だけでなく、その他の手作業（レビュー、テストなど）にもAIを使って生産性を高められます。誰もコードレビューやテストを書くのが好きなわけではないのですから、AIに任せた方が効率的です。

実際、開発工程の他フェーズでのAI活用は、コード生成以上のROIを生む場合もあります。なぜなら、それらのツールはバグを除去する方向に働くからです。

AI開発ツールスタックには何が含まれるのか？

私たちが顧客から見ているAI開発ツールスタックは、ソフトウェア開発ライフサイクルのあらゆる段階をサポートするAIツール群から構成される多層的なスタックです。

ここではそのスタックの各層と、その結びつき、そしておそらく今年末までに（すでにそうでないなら）ほとんどの開発者が使うことになる理由について紹介します。

基盤レイヤー： AIコーディングツール
必須レイヤー： AIコードレビューツール
任意レイヤー： AIテストツール
任意レイヤー： AIリファクタリングツール
任意レイヤー： AIドキュメント生成ツール

基盤レイヤー：AIコーディングツール

多くのチームはまずここからスタートします。これらのツールは、現在書いているコードの補完や自然言語プロンプトから関数・テスト・コンポーネント全体を生成することで、開発者がより高速にコードを書くことを支援します。最近ではコードベースの理解力が増し、コード品質への配慮も強化され、エージェント的なマルチステップタスクへの対応にも注力されています。しかし、これらのツールは依然としてバグや脆弱性、パフォーマンス低下の要因になりやすく、多くのコード修正やレビュー作業が必要になります。

近年よく聞かれるのが次の2点です。

開発者は1つのツールだけでなく、目的に応じて複数のツールを使い分けている（この投稿がそれを皮肉っています）。
使用するツールに対するこだわりが強くなり、AIアシスタントの選択がPC派かMac派かのように分かれるようになった。

その結果、多くのチームが特定のツールに一律でライセンスを付与するのではなく、開発者自身が使いたいAIアシスタントを選べるようにしています。その方が使いこなされやすく、企業にとっても利益があります。

これらのツールは以下の5カテゴリに分類できます（一部ツールは複数カテゴリに該当）：

補完系ツール： GitHub Copilot, Cursor Tab, Windsurf, TabNine, Sourcegraph Cody, Qodo, Jetbrains
AIコーディングアシスタント： GitHub Copilot, Cursor, Windsurf, Claude Code, OpenAI Codex CLI, Zed, Cody (Sourcegraph), Aider, Qodo, Cline, Roocode, Blackbox, OpenHands, Gemini Code Assist, Augment Code, Amazon Q, JetBrains AI Assistant
エージェント型コーディング： Cursor, Windsurf, GitHub Copilot, Claude Code, OpenAI Codex, Cline, Roocode, Blackbox AI, Continue, Devin, Jules, Augment Code, OpenHands
AIアプリ生成ツール： Lovable, v0, Bolt, Builder.io, Figma Make, Fine.dev, Stitch
コードベースコンテキストツール： Repomix, Repo Prompt, Context7

必須レイヤー：AIコードレビューツール

AIコードレビューツールはスタックの中心に位置します。なぜなら、AIによって急増したコードを手動でレビューし続けるのは現実的ではないからです。

レビュー負荷が増せばチームのバーンアウト（燃え尽き症候群）リスクも高まり、品質劣化にも繋がります。研究によれば、開発者が集中してレビューできるのは最大で約400行まで。それを超えると見逃しやミスが増え、結果的に本番環境でのバグ対応が必要になります。

AIコードレビューツールは、PRマージの高速化（最大4倍）やレビュー時間の短縮（最大50%）に役立ち、AIによって増加したバグの混入を防ぐためにも不可欠です。AI生成コードのバグ増加率が最大41%とも言われている中、この層の活用はAI活用による生産性を損なわずに済む重要な鍵となります。

さらに、コード品質の向上、レビュー疲労の軽減、チーム横断でのベストプラクティスの標準化にも寄与します。出力結果も、開発者のプロンプトスキルに左右される生成系AIツールとは異なり、一貫性があります。

加えて、AIは「繰り返し」「面倒」な作業の自動化に強みを持ちます。PRに何十個もコメントを手で付ける代わりに、AIに任せてクリックで修正案を適用し、見逃したバグまで検出してもらえるのです。

主なツール分類：

AIコーディングツール内機能： Cursor, GitHub Copilot, JetBrains, Windsurf Forge（非推奨）
GitベースのAIコードレビュー： CodeRabbit, Bito, Greptile, Qodo, Graphite Diamond
IDEとGitの両対応ツール： CodeRabbit, SonarQube, Qodo, Sourcery

任意レイヤー：AI QAテスト生成・実行ツール

多くの開発チームにとって、QAテストはすでに何らかの形でAIが関与している領域です。しかし、最近登場しているAI搭載QAツールは、さらに多くの面倒な作業を自動化できることを約束しています。特に、ユーザーの行動をシミュレートするようなエンドツーエンドテストの生成や保守といった、時間がかかる作業に対して有効です。

すべてのテストシナリオを手作業で考える代わりに、自然言語の説明からAIがテストケースやスクリプトを自動生成してくれます。

主な利点は、やはりスピードです。AIはテストスイートの生成・実行を数分でこなし、数十のシナリオを一度に生成できます。また、人的ミスで見落とされがちなパターンも網羅的にチェック可能です。さらに、一部ツールではUIやデータの変化に応じて自動でテストを更新してくれる「自己修復」機能もあり、保守コストを抑えながらテストスイートを常に最新状態に保てます。

分類は以下の通りです：

AIテスト生成ツール： Testim, Mabl, Functionalize, testRigor, Autify, ACCELQ, Qodex, Tricentis
AIテスト実行・保守ツール： MuukTest, Applitools, Sauce Labs, Perfecto, Meticulous

任意レイヤー：AIリファクタリングツール

一部のAIコーディングツールはリファクタリングも可能と謳っていますが、実際には品質が不十分なことも多くあります。そのため、多くの企業はリファクタリングに特化したAIツールを導入しています。これらは、開発ツールスタックの一部としてAIコーディングツールとは別に活用されます。

AIリファクタリングツールは、煩雑かつ繰り返しの多い改善作業を自動化することで、コードベースの最適化（軽微な修正から大規模なアーキテクチャ変更まで）を支援します。手作業で非効率な箇所を探したり、同じ変更を複数箇所に反映させる必要はありません。自然言語で目的を伝えるだけで、AIが修正点を見つけ、必要に応じて実行します。

分類は以下の通りです：

半自動ツール： CodeGPT, GitHub Copilot, Amazon CodeWhisperer, Sourcegraph Cody
完全自動ツール： Claude Code, Devin, OpenAI Codex

任意レイヤー：AIドキュメント生成ツール

AIを導入する際、最初に注目されることは少ないものの、いざ使い始めると非常にありがたみを感じるのが「ドキュメント作成」です。インラインコメントやdocstringなど、最も退屈で後回しにされがちな作業のひとつです。

このようなAIツールを使えば、新しい関数を1つ1つ手作業で記述したり、古いガイドを見直したりする必要がなくなります。コードをもとにAIが可読性の高い最新のドキュメントを自動で生成し、膨大な時間の節約に繋がります。

主なツール：

コードレベルのドキュメントツール： DeepWiki, Cursor, CodeRabbit, Swimm, GitLoop, GitSummarize

サンプルスタック

では、実際のAI開発ツールスタックはどのような構成になっているのでしょうか？企業によって構成は様々ですが、以下に代表的なパターンを紹介します。

「包括的」スタック

私たちが接している企業の中には、AIコーディングツール、コードレビューツール、QAツール、リファクタリングツール、ドキュメントツールのすべてを含むエンドツーエンドのAI開発ツールスタックを導入している、あるいは導入中の企業が増えています。

これらの企業は、C-suiteやエンジニアリング部門が中心となってAI導入を積極的に推進している場合が多く、AIコーディングツールの初期導入にも前向きで、その効果を実感していたことが背景にあります。さらに高い生産性と開発者体験（DevEx）を追求し、他レイヤーのツール導入に取り組んでいます。

「選べるAIツール」スタック

開発サイクル全体にAIツールを導入し、かつチームメンバーにツール選択の自由を与える企業も増えています。これらの企業は、ツールによって得意分野が異なること、そして開発者自身が使いやすいと感じるツールの方が成果につながることを理解しています。

このアプローチは、AI導入率を高めるだけでなく、開発者の満足度や体験の向上にも繋がっています。AIコーディングツール（CursorかCopilotかClaude Codeか？）だけでなく、他のツールにも選択肢を用意している企業もあります。

「複数コーディングツール」スタック

「ツールの選択肢を与える」だけでなく、「複数ツールの併用を許容」している企業も存在します。たとえば、LovableでUIプロトタイプを作成し、Cursorでアプリ本体を実装したり、TabNineで補完を使いながら、ChatGPTでコード生成を行ったりします。

生産性向上の理由が合理的であれば、複数のツール利用を許可する企業が増えています。

「部分導入」スタック

すべてのツールを一度に導入している企業ばかりではありません。多くの企業が部分的な導入から始めています。

典型的には、AIコーディングツールとAIコードレビューツールを導入し、さらに1つの補助ツール（AIリファクタリング、QA、ドキュメントツールのいずれか）を追加するという構成です。どのツールを追加するかは、コードベースの性質や社内のスキル、ニーズによって異なります。たとえば、大企業は社内にQAチームを抱えていることが多いためAI QAツールを採用しやすく、一方で小規模企業はQAを外部に委託するケースが多くなります。

「必須」スタック

最後に、AIコーディングツールとAIコードレビューツールのみで構成される「ミニマルな」スタックも多くの企業で採用されています。AIコーディングによってバグが増えたりレビューが煩雑になる中で、それを緩和するための最低限の構成です。

特にコードレビューツールは、時間短縮・品質維持の両面で高いROIを誇り、多くの企業が導入しています。

独自のAI開発ツールスタックを構築するには：考慮すべき点

AI開発ツールスタックの構築に関しては、さまざまなアプローチがあります。
多くの企業はまずAIコーディングツールを導入し、そこから発生する課題に応じて、必要に迫られて個別のツールを後追いで導入するという流れです。

一方で、CTOや技術リーダーが中心となり、開発サイクル全体を見渡した上で意図的にツールを選定・検証し、PoC（概念実証）を行ってから導入する企業もあります。中には、AIコーディングツールの導入をあえて後回しにして、まずはコードレビューツールで技術的負債の処理から始めた企業もあります。

私たちは能動的な構築アプローチを推奨しています。なぜなら、多くのチームが問題が深刻化してから対処を始め、納期遅延や開発者の燃え尽きに苦しんでいるケースを見てきたからです。

特定のツール導入に関する事例や詳細については、こちらの別記事をご覧ください。各カテゴリごとの具体的な活用例や、チームへの効果について紹介しています。

あなたのAI開発ツールスタックの構築状況や導入して良かったツールについても、ぜひ教えてください。
X（Twitter）や LinkedIn でタグを付けてシェアしてください！

当社のAIコードレビューツールを試してみたい方はこちら → 14日間の無料トライアル！

]]>

Thu, 26 Jun 2025 07:58:15 GMT

AI adoption: How developers are using AI dev toolsの意訳です。

前回の投稿では、2025年がAI技術スタックの年である理由、スタックの階層構造、そして実際にチームが使用しているサンプルスタックについて説明しました。

今回は、チームが実際にこれらのツールをどのように活用しているかをより深く掘り下げます。スタックの各階層内における異なるツールタイプ、そして開発者がこれらのAIツールを使って開発プロセスを高速化したり、AIコーディングツール導入時のよくある課題を解決したりしている方法を見ていきます。

AIアシスタントのプロンプトを改善するのに役立つコードベースコンテキストツールから、エージェント機能を持つAIコードレビューツールまで、各階層は現実世界の課題（従来から存在するものと、AIコーディングツール特有のもの）に対するユニークなソリューションを提供します。

実際の例を通して、各段階でのAI統合方法を紹介していきます。

AI開発ツール技術スタック

前回の投稿で言及したように、チームは次第にAI開発ツールスタックを構築しています。これは、ソフトウェア開発ライフサイクルの各段階をサポートするよう設計されたAI駆動ツールの階層セットです。

以下は、スタックの階層、それらの接続方法、そして近いうちにこれらのツールのほとんどを使い始める理由についての概要です。

基盤層： AIコーディングアシスタント
必須層: AIコードレビューツール
オプション層： AI QAテストツール
オプション層： AIリファクタリングツール
オプション層： AIドキュメンテーションツール

基盤層：AIコーディングアシスタント

AIコーディングアシスタントは、AIツールを導入するほとんどのチームにとっての基盤です。以前に、これらのツールが幅広い機能にわたる自動補完の提案によるコーディング加速化から、シンプルなプロンプトから関数やコンポーネント全体を生成することを話しました。

開発者は複数のコーディングアシスタントを使用する機会が増えており、異なるメリットや個人的な好みに応じて、異なるツールを選択しています。これにより、全体的な生産性と満足度が向上します。この良い例が、ChargeLab が開発者とチームにAIツールを選択させることで40%の生産性向上を達成した方法です。

これらのツールを5つのカテゴリに分けています。ただし、多くのツールは複数のカテゴリにまたがっています。

タブ補完ツール：より賢い自動補完

ツール: GitHub Copilot、Cursor Tab、Windsurf、TabNine、Sourcegraph Cody、Qodo、Jetbrains

これらのツールはアプリ全体を構築しません。代わりに、IDE内での反復的なコーディングタスクに対して文脈に応じたコード提案を提供することで、より速くコードを書き、認知的労力を節約するのに役立ちます。
AIコーディングツールに関する話題が現在エージェント機能に集中している一方で、タブ補完は私たちが話している企業で、最も一般的に使用されているAI機能のままです。AIコーディングツール使用の約90%が自動補完ツールであると推定しています。これは、独立して作業するよりも開発者を補完することに焦点を当てているため、大幅な編集を必要とするバグを導入する可能性が低いためでしょう。
一部の開発者は、他のタイプのAIアシスタントよりもタブ補完ツールを好みます。これは、自分でコードを書くよりも時間を節約しながら、書くコードをより制御できるためです。その使用は通常、クラスやインターフェイス名などのシンプルなものの自動化に焦点を当てる傾向があります。そのため、どの開発者でも簡単に使用できます。
タブ補完ツールは次第により文脈を理解し、予測的になってきており、現在作業中のコードだけでなく、コードベースの文脈を理解するようになっています。

AIコーディングアシスタント：文脈認識型、多目的ツール

ツール: GitHub Copilot、Cursor、Windsurf、Claude Code、OpenAI Codex CLI、Zed、Cody by Sourcegraph、Aider、Qodo、Cline、Roocode、Blackbox、OpenHands、Gemini Code Assist、Augment Code、Amazon Q、JetBrains AI Assistant

AIコーディングアシスタントは、多くの場合、タブ補完、コード生成、AIチャット、エージェントコーディング機能などの多数のAIコーディングツールを提供するAIネイティブエディタの新しい種類の一部です。
AIコーディングアシスタントは、インライン説明付きのコードブロック全体を書くのが最も得意です。新機能の最初のドラフトを作成したり、単体テストを生成したり、リファクタリングを行うのに非常に効果的です。ただし、タブ補完ツールよりもコードにバグや問題を追加する可能性が高く、良い結果を得るには一般的に良いプロンプティングが必要です。Cursor によるこのツイートが証明しているように。そのため、提案の品質は開発者のプロンプティングの専門知識によって変わります。
AIアシスタントはコードを生成するだけでなく質問に答えることもできるため、コードを自分で書く際のコンテキストスイッチング（そしてStack Overflowで費やす時間）を減らすのにも役立ちます。
ほとんどのAIコーディングアシスタントはCursorやJetBrains、Windsurf、Zed、Copilot（VS Codeにある）のようにIDEベースですが、Claude Code、Aider、OpenAIのCodexを含む、CLIで動作するものもあります。
AIコーディングアシスタントはタブ補完ツールよりも文脈を理解し、時間をかけてコードベースとコーディングスタイルを学習して、提案の関連性を高めることに焦点を当てています。

エージェント型コーディングツール：次のフロンティア（ただし高価）

ツール: Cursor、Windsurf、GitHub Copilot、Claude Code、OpenAI Codex、Cline、Roocode、Blackbox AI、Continue、Devin、Jules、Augment Code、OpenHands

これらのツールはAIコーディングアシスタントと重複することが多いですが、すべてのアシスタントがエージェント機能を持つわけではなく、よりエージェント型コーディングに焦点を当てたコーディングエージェントもあるため、独自のカテゴリに分類しています。
これらのツールは、コーディングタスクへのアプローチ方法や問題の解決方法を決定するためにコードベースを分析できます。そして、自律的または半自律的エージェントが、テストの作成、コードのテスト、複数パッケージのインストール、コードの問題修正、または新しいコードの生成とリクエストに基づくPRの作成などのタスクを解決または完了するよう働きます。また、コードベースを理解してファイルを要約することもできます。
通常、コードの変更やファイルの作成を含む特定のタスクを実行する能力があります。また、開発環境に統合され、ツールと対話できます。
多くはDevinやClaude Code、OpenAI CodexのようにDevinのように直接的な監督なしに自律的にタスクを実行する能力があり、他のものはCopilotやWindsurfのように承認が必要な提案を行います。タスクによって、開発者は一方または他方のタイプのツールを好む場合があります。
エージェント型コーディングツールはまだ初期段階にありますが、急速に進歩しています。適切な手で適切なタスクに使用すれば、大きなリターンを提供できますが、または本当に創造的な新しいバグを作成することもあります。

AIアプリ生成ツール：アプリやウェブサイトを高速生成

ツール: Lovable、v0、Bolt、Builder.io、Figma Make、Fine.dev、Stitch

これらのツールは、個々のコード行を完成させたり機能を生成したりするのではなく、アプリやウェブサイト全体を迅速に生成することに焦点を当てています。フロントエンドUIデザインからバックエンドインフラストラクチャの設定まで、フルスタックアプリケーションを迅速に構築することを約束し、クラウドデータベースと統合されています。そのため、主に非開発者にアピールします。また、プロセスをさらに簡素化するため、ノーコードツールの終焉を告げる可能性があります。
そのため、開発者の間でのアプリ生成ツールの人気の高まりは、プロダクションに入るものを作成するのではなく、新しいアイデアを迅速にプロトタイピングすることに焦点を当てることが多いです。
アプリ生成ツールは、コード生成ツールよりもエージェント的で、開発プロセスのより広い範囲を処理します。これは初期の大幅な時間節約を意味する可能性がありますが、生成されたアプリを正確な仕様にカスタマイズするために、下流でより多くの監視と編集が必要になる場合があります。しかし、多くの開発者は、生成されたアプリが実際にプロダクション準備ができているかどうか疑問に思っています。
アプリ生成ツールは急速に進歩して次第に洗練されており、開発者がアプリケーションのアイデアを記述し、AIが最小限の手動介入でこれらの記述を機能的なコードベースに変換することを可能にしています。ただし、通常、既存のコードベースやアプリケーションで作業する多くの開発者にとっては、まだ限定的な用途のままです。

コードベースコンテキストツール：最新のコードベースコンテキスト

ツール： Repomix、Repo Prompt、Context7

これらのツールは、AI支援ソフトウェア開発の重要な実現要因です。大きなコードベースの関連する部分を構造化してAIモデルに配信し、AIが多くのファイルを横断して効果的に推論するために必要なコンテキストを提供します。
開発者は単純にAIアシスタントにプロンプトを送り、これらのAIツールがコードベースの最も関連する部分をキュレーションしてモデルに供給し、アシスタントが大規模または複雑なプロジェクトで盲目的に動作しないようにします。
コードベースコンテキストツールは、AIエージェントのプロンプトを圧縮および構造化して、指定されたトークン制限内で大きなコードベースの機能的理解を維持できるようにし、消費ベースの価格設定を使用するツールを使用する際のコストを削減しながら、生成されるコードの品質を向上させることもできます。

必須層：AIコードレビューツール

次は、より高速なAI支援コーディングによって生じる増加したワークロードに直接対処する重要な層であるAIコードレビューツールです。

前回の投稿では、これらのツールがチームが生産される増加するコード量をより良く管理し、手動レビューによるバーンアウトを回避するのに役立つ方法を強調しました。AI駆動のコードレビューは、チームがPRを大幅に速く統合できるようにしてプロセスを高速化するだけでなく、早期にバグを捕捉し、レビュアーの疲労を減らし、ベストプラクティスを標準化することで品質を大幅に改善します。CodeRabbitのようなものには、エージェントワークフローさえあり、単体テストの生成、マルチファイル編集、新しいPRの作成などに役立ちます。

AIコーディングツールの後、AIコードレビューツールは開発チームが最も採用する可能性が高いAIツールです。既存のコードレビューバックログに対処し、疑わしいコード品質のAI記述PRの大量流入に対処するためです。

最終的に、これらは退屈なタスクを自動化し、開発者が本当に楽しんでいる高インパクトな作業に集中できるようにします。

これらのツールは主に3つのタイプに分かれます：

AIコーディングツールの機能：AIコーディングツール自身をレビュー

ツール： Cursor、GitHub Copilot、JetBrains、Windsurf Forge（廃止）

一部のAIコーディングアシスタントは、サブスクリプションに含まれる機能やアドオンとしてコードレビューツールを提供しています。例えば、CursorのサブスクリプションにはIDEベースのコードレビューが含まれ、GitHub Copilotのサブスクリプションには CI/CDベースのレビューが含まれます。2025年4月まで、WindsurfはサービスのアドオンであるCI/CDベースのコードレビューツールであるForgeも提供していました。しかし、最近それを廃止し、コードレビューをメインのAIコーディングアシスタントの機能として再リリースしました。
開発においてコードレビューがAIの中核的使用例であるため、既存のAIコーディングアシスタントツールの一部としてAIコードレビューを含めることは一部の人にとって理にかなっています。しかし、コーディングアシスタントが生成するコードをレビューする際の効果について多くが疑問視しています。
コードレビューでは、ベストプラクティスは常に、複数の異なる目が潜在的な問題を探すという目標で、仲間やシニア開発者にレビューを任せることでした。コーディングアシスタントの機能としてAIコードレビューを持つことは、中央のセキュリティと品質プロトコルから外れます。コードに41%多くのバグを追加したAIツールが、最初からそれらがバグであることを認識していなかった場合、どうやってそれらのバグを見つけることが期待できるでしょうか？
さらに、AIコーディングアシスタントは、しばしば低レイテンシとリアルタイム応答を優先し、品質に焦点を当てたスタンドアロンのAIコードレビューツールよりも、潜在的により表面的なコードレビューにつながります。
Forgeの廃止は、機能またはアドオンとして、コードレビューがAIコーディングアシスタントによる製品開発の中核的焦点となる可能性が低いことも示唆しています。特に、スペースがより競争的になり、企業が中核提供を改善することにより多くの時間を費やすようになるにつれて。これは、スタンドアロンソリューションがより包括的で多くの機能を持ち、追加の価値を提供できることを意味する可能性があります。

Gitベースの AIコードレビューツール：チームの時間を節約するレビュー

ツール： CodeRabbit、Bito、Greptile、Qodo、Graphite Diamond

これらのツールは、プルリクエストを開いたときに自動レビューを実行します。一次パスのAIコードレビューは、シニアエンジニアが問題にコメントを追加する時間を節約するために、バグ、セキュリティ脆弱性、構文エラー、スタイル問題などを見つけます。
これらのツールは、CI/CDワークフローと既存のコードレビュープロセスに完璧に適合し、PR要約と1クリック修正を提供して、問題のレビューと修正の両方を簡単にします。
コードベース認識と強化されたコンテキストにより、これらのツールは一般的な問題を早期に捕捉し、見逃す可能性のあるバグを見つけ、コードベース全体でコード品質を向上させることができます。
CodeRabbitのような提供は、エージェント的チャットとワークフローさえ持っており、AIレビュアーと単純にチャットすることで、docstringと単体テストの生成、マルチファイル編集、PRの作成などを行うことができます。
CI/CD段階でAIコードレビューを行うことは、コードベース全体でコード品質基準を実装しながら、コードレビュープロセスを合理化するために重要です。

IDEとGitベースの両方のAIコードレビューツール：すべての段階でのレビュー

ツール： CodeRabbit、SonarQube、Qodo、Sourcery

IDEとCI/CDツールの両方でコードレビューを提供するAIコードレビューツールはほとんどありません。これらのツールは、開発サイクルの複数の段階でバグを減らすことで、最も包括的なコードレビューサポートを提供します。
IDEをCI/CDレビューと接続することは、よりシームレスなワークフローと追加の品質チェックを可能にする多層レビューも可能にします。

オプション層：AI QAテスト生成・実行ツール

前回の議論で、QAテストは従来から機械学習やAIの形式を組み込んできましたが、新しいツールはテストの最も反復的で時間のかかる側面を自動化することで、さらに一歩進んでいることについて触れました。

これらのAI駆動ツールは、シンプルな記述から広範囲で現実的なテストシナリオを生成し、テストプロセスを大幅に高速化します。速度を超えて、人間のテスターが見落とすかもしれない多数の順列を考慮することで、テストカバレッジも向上させます。さらに、これらのツールの一部は、アプリのUIや基盤データが変更されたときにテストを自動更新する「自己修復」機能を提供します。

これらを2つのカテゴリに分けます：

AIテスト生成ツール：テスト生成のみ

ツール: Testim、Mabl、Functionalize、testRigor、Autify、ACCELQ、Qodex、Tricentis

AIテスト生成ツールはテストを実行または管理しません。代わりに、自然言語の記述に基づいて、または既存のコードパスを分析することで、テストケース、スクリプト、またはシナリオの作成を自動化します。
主な魅力は、各個別のテストケースを手動で定義する退屈で反復的な作業を減らし、QAエンジニアが堅牢なテストスイートを迅速に構築するのを支援することです。
開発者とQAチームは、初期テスト作成を高速化し、大規模、複雑、またはレガシーアプリケーションのテストカバレッジを拡張する際に特に有用であるため、これらのツールを評価しています。
ボリュームを迅速に生成するのに優れている一方で、これらのツールは通常、精度とカバレッジを確保するために手動での微調整とレビューが必要です。
次第に、これらのツールは、既存のコードとユーザージャーニーをより知的に解析できるより深いコンテキスト認識を活用し、現実世界の使用例と密接に一致するテストケースを提案できるようになっています。

AIテスト実行・保守ツール：エンドツーエンドのAIテストサポート

ツール: MuukTest、Applietools、Sauce Labs、Perfecto、Meticulous

フルライフサイクルのAI QAツールは、テスト生成を超えて、テストプロセス全体を処理します。テストケースの作成から自動実行、さらにはアプリケーションの進化に応じた保守まで。
チームは、初期のテスト作成だけでなく、コードベースやUIの変更時に必要な継続的な保守を自動化することで、QAワークロードを劇的に削減するため、これらの包括的なツールを好むことが多いです。
これらのツールは保守負担を大幅に軽減しますが、複雑なシナリオでは時々苦労することがあります。
次第に洗練されたこれらのツールは、既存のCI/CDワークフローにシームレスに統合し、複数の環境とデプロイメント段階にわたって継続的で自動化されたテストカバレッジを提供します。

オプション層：AIリファクタリングツール

もう一つの重要な領域はAIリファクタリングツールです。一般的なAIコーディングアシスタントはリファクタリング機能を主張するかもしれませんが、その成果はしばしば不十分です。これにより、多くのチームがコードベースの最適化と改善のために明示的に設計された専門のAIリファクタリングツールを採用するようになりました。

これらの専用ツールは、単純に自然言語の指示に基づいてリファクタリングの機会を迅速に特定し、実行することで退屈なタスクを自動化し、手動作業を大幅に削減し、コードの保守性を向上させます。

これらのツールを2つのタイプに分けます：

半自動ツール：リファクタリングツールのタブ補完

ツール： CodeGPT、GitHub Copilot、Amazon CodeWhisperer、Sourcegraph Cody

半自動リファクタリングツールは完全に主導権を握りません。代わりに、IDE内でコード改善を積極的に提案し、各提案を迅速に受け入れ、拒否、または変更できるようにします。
これらのツールは、メソッドの簡素化、ループの再構築、関数ロジックの最適化など、コミットする前に人間の目から恩恵を受ける小規模で段階的なリファクタリングに焦点を当てています。
開発者は、細かい制御を提供するため、複雑または機密なリファクタリングタスクで半自動ツールを好みます。
魅力はそのバランスにあります。ルーチンリファクタリングを高速化しながら、開発者の判断の余地を残し、望ましくない変更や微妙なバグがプロダクションに忍び込むリスクを最小限に抑えます。
次第に、半自動リファクタリングツールは、より広いコードベースを分析してより賢く、より関連性の高い提案を提供するより深いコンテキスト認識を活用しています。

完全自動ツール：AIにより多くの自律性を与えたい場合

ツール: Claude Code、Devin、OpenAI Codex

これらのAIツールは、しばしば単一の指示やルールセットから、コードベース全体にわたって大規模で反復的なリファクタリングタスクを自動的に処理します。
半自動ツールが開発者が手動でレビューするリファクタリングの機会を強調する一方で、完全自動ツールは、依存関係のアップグレード、フレームワークの移行、またはコードスタイルの標準化など、バルクタスクで優れており、反復作業の何時間も節約する可能性があります。
これらのツールは、何百または何千のファイルにわたって同じリファクタリングを手動で適用するのに数日を費やすことなく、大規模に技術的負債に取り組むことを探しているチームにアピールします。
開発者は、明確に定義された反復的なリファクタリングに対する信頼性と一貫性を評価しますが、完全自動ツールは一般的に明示的なルールを与えられたときに最もよく機能します。人間の判断を必要とする微妙なコード改善にはあまり適していません。
次第に、完全自動リファクタリングツールは複数のプログラミング言語を解析し、既存のCI/CDパイプラインに直接統合できるようになっています。

オプション層：AIドキュメンテーションツール

最後に、AIドキュメンテーションツールは、AIを採用する際の最初の考えではないことが多いですが、非常に価値があることが証明されています。

以前に述べたように、これらのツールは、インラインコメントやdocstringなどのコードドキュメンテーションを書いて更新するという、しばしば恐れられるタスクに取り組みます。AIを活用することで、開発者はコードベースから直接、明確で正確かつ最新のドキュメンテーションを迅速に生成でき、手動でドキュメンテーションを維持するのに費やしていた大幅な時間と労力を節約できます。

コードレベルのドキュメンテーションツール：

ツール: DeepWiki、Cursor、CodeRabbit、Swimm、GitLoop、GitSummarize

AIドキュメンテーションツールは、コード構造と動作を分析して、読みやすくコンテキストを理解したドキュメンテーションドラフトを自動的に生成します。インラインコメント、docstring、APIリファレンス、さらには内部設計とアーキテクチャドキュメントを生成することで、ドキュメンテーション時間を半分以上削減する可能性があります。
これらのツールは、コードが進化するにつれて、すべての関数コメントやAPI記述を手動で更新することなく、ドキュメンテーションをコード変更と継続的に同期させ続けることを望むチームにアピールします。
次第に、AIドキュメンテーションツールは複数のプログラミング言語をサポートし、IDEやCI/CDパイプラインに直接統合し、開発者がコードを書く際にドキュメンテーションを積極的に促すことで、時間の経過とともにドキュメント品質を改善し、技術的負債を削減します。

独自のAI開発ツールスタックの構築

AI開発ツールスタックの採用は、新しいAIツールをいくつか混ぜるだけではありません。開発ワークフローのすべての部分に戦略的にAIを取り入れることです。

コーディング、レビュー、テスト、リファクタリング、ドキュメンテーションまで、すべてのステップでAIを戦略的に使用することで、チームがより多くのことを成し遂げ、フラストレーションを減らし、コードベースの全体的な品質を大幅に向上させることができます。

あなたがどのようにAI開発ツールスタックを構築し、何があなたにとって機能しているかについてもっと聞きたいです。TwitterやLinkedInで私たちをタグ付けしてください。

私たちのAIコードレビューツールを試すことに興味がありますか？ 14日間の無料トライアルを始めましょう！

]]>

Tue, 17 Jun 2025 01:28:26 GMT

RBAC for CodeRabbit Usersの意訳です。

皆様へ - すべてのユーザー向けにRole-Based Access Control (RBAC)が利用可能になったことをお知らせします。本機能は、組織管理者がユーザーの実行できるアクションを細かく制御できる権限セットを割り当てるものです。これらの設定は、CodeRabbitアプリのSubscriptionsメニューから確認できます。

CodeRabbitの設定と構成に関連する、それぞれ異なる権限を持つ3つの主要なロール（権限）を定義しました：

Admins: CodeRabbitのすべてを設定し、コードレビューを実行する完全なアクセス権 — レビュー設定、統合管理、ロールの割り当て、学習の編集、ダッシュボードの表示、レポート生成、サブスクリプションと請求管理。
Members: コードレビューを実行する限定的なアクセス権と、組織またはリポジトリレベルの設定、統合、学習、ダッシュボード、レポート、サブスクリプション詳細への読み取り専用権限。
Billing Admins: サブスクリプションと請求管理のみを担当するオプションのロール。このロールは設定を構成したり、コードをレビューしたりする能力はなく、課金対象外です。

ロールは各組織ごとに個別に割り当てられます。複数組織の場合、一つの組織でのロールは他の組織には適用されません。「Admin」ユーザーのみがこれらのロールを変更し、他のユーザーを「Admins」、「Members」、または「Billing Admins」として追加できます。

新しいロールはSubscriptionメニューから確認できます

ボットユーザーは自動的に「Member」ロールが割り当てられ、これは変更できません。CodeRabbitシートが割り当てられているユーザーのみが、管理者によってロール変更が可能です。

CodeRabbitロールの権限

CodeRabbitサブスクリプションの課金管理のみを担当するユーザー（新しいユーザーの追加、シート数の増加、プラン変更など）には「Billing Admin」ロールを割り当てることをお勧めします。専任の「Billing Admin」がいない場合は、組織内の他の「Admin」も同様にすべての請求とサブスクリプションタスクを実行できます。

CodeRabbitのすべての機能と設定への書き込みアクセスが必要なユーザーには「Admin」ロールを割り当てる必要があります。主にAIコードレビューの実行のみに関心があるその他のユーザーは、「Member」ロールに制限される場合があります。

以下は、3つのロールそれぞれの異なる権限セットを説明する詳細なマトリックスです。

リソース	Admin	Member	Billing Admin
組織設定	書き込み	読み取り専用	アクセス不可
リポジトリ設定	書き込み	読み取り専用	アクセス不可
統合	書き込み	読み取り専用	アクセス不可
学習	書き込み	読み取り専用	アクセス不可
ダッシュボード	書き込み	読み取り専用	アクセス不可
レポート	書き込み	読み取り専用	アクセス不可
ユーザー管理	書き込み	読み取り専用	読み取り専用
サブスクリプション管理	書き込み	読み取り専用	書き込み
請求管理	書き込み	アクセス不可	書き込み

「Admins」は「Billing Admins」と同レベルのアクセス権も持ちますが、その逆は成り立ちません。すべての「Admin」は「Billing Admin」が実行できるのと同じタスクを実行できます。「Billing Admin」のみである必要があるユーザーは、「Admin」によって手動で招待される必要があります。以下のスクリーンショットは、そのユーザーがGitプラットフォームに存在しない場合に、「Admin」がメールを使用して別の「Billing Admin」を招待する方法になります。また、「Member」ロールのユーザーについては、ダッシュボードのメトリクスについて、そのGitプラットフォームで所属するチームの分のみが表示されます。

メールを使用してBilling Adminsを招待している例

Billing Adminsとして追加され、Gitプラットフォームに存在しないユーザーは、Gitプラットフォームの認証情報の代わりにメールでログインオプションを使用してログインする必要があります。

GitプラットフォームからCodeRabbitへのロールマッピング

GitのOrganizationに存在するすべてのユーザーに対して、いくつかのロールがデフォルトで割り当てられます。これらは「users」メニューで確認できます。デフォルトのロールは、そのユーザーがGitプラットフォーム組織で持つ権限にマッピングされ、CodeRabbitによって自動的に継承されます。以下のマッピングルールに基づくCodeRabbitのデフォルト割り当てを変更したい場合は、ユーザーに手動でロールを割り当てる必要があります。

GitHub	Gitlab	Azure DevOps	Bitbucket	CodeRabbitロールへのデフォルトマッピング
Admin / Billing Manager	Owner	Admin	Owner	Admin
Member	Maintainer		Member	Member
	Developer			Member
	Reporter			Member
	Planner			Member
	Guest			Member
	Minimal Access			Member
手動追加	手動追加	手動追加	手動追加	Billing Admin

Azure DevOpsは「Admin」ユーザーのみを報告することに注意してください。ユーザーがAzure DevOps組織に存在し、「Admin」でない場合、デフォルトで「Member」ロールを割り当てます。

まとめ

RBACのTL;DR：

CodeRabbitユーザーに3つの異なる権限を割り当てられるようになりました：
- Admins - すべてを設定する書き込みアクセスでコードレビューを実行
- Member - 様々な設定への読み取り専用アクセスでコードレビューを実行
- Billing Admins - 専任ユーザーが請求とサブスクリプションを管理する必要がある場合のみの特別なロール
新規および既存ユーザーのCodeRabbitロールは、Gitプラットフォームの同等のロールに自動的にマッピングされます。CodeRabbitの「Admins」のみがこれらの権限を変更できます。
すべての権限は特定の組織にマッピングされます。複数の組織のユーザーは、各組織で異なる権限を持つことができます。
CodeRabbitトライアルを開始するのは、Gitプラットフォームで「Admin」同等の権限を持つユーザーでなければなりません。

質問やフィードバックがある場合は、私たちのコミュニティDiscordサーバー（無料ユーザー向け）までお問い合わせください。CodeRabbitの有料顧客およびアクティブな無料トライアル期間中の方は、このサポートページから技術チームに連絡して、より迅速な応答を得ることができます。お問い合わせの際は組織名を提供してください。

今後の予定

私たちはユーザーの声を聞き続け、そのフィードバックを取り入れています。以下の機能が近中期的なロードマップにあります：

セルフホスト顧客へのRBACの拡張。RBACリリースのv1はSaaS顧客のみに限定されています
「Member」レベルのユーザーがCodeRabbitトライアルを開始する機能
管理者がカスタム権限セットを選択して新しいロールを作成できるカスタムロール定義
CodeRabbitで設定されたすべての組織での一貫したロール可用性
SSO統合（SAML / OIDC）

次のステップ：CodeRabbitにログインし、Subscriptionsメニューに移動して、組織内のユーザーのCodeRabbitロールを確認または変更してください。詳細についてはドキュメントも参照できます。

]]>

Mon, 16 Jun 2025 18:07:16 GMT

Hey folks - we’re excited to share that Role-Based Access Control (RBAC) is now available for all CodeRabbit customers. This gives your Org Admins the ability to assign granular permission sets that control the actions that users can take. You can find these settings under the Subscriptions menu in the CodeRabbit app.

We have defined three main roles, each with different permissions as they pertain to CodeRabbit settings and configurations:

Admins: Full access with the ability to run code reviews and configure everything in CodeRabbit — review settings, manage integrations, assign roles, edit learnings, view dashboards, generate reports, subscription and billing management.
Members: Limited access with the ability to run code reviews, with read-only permissions to access org or repo level settings, integrations, learnings, dashboards, reports, and subscription details.
Billing Admins: optional role that is only responsible for subscription and billing management. This role has no ability to configure settings or have code reviewed, and it is not a paid seat.

The roles are assigned separately for each Org. If you have multiple Orgs, then roles in one Org do not apply to other Orgs. Only “Admin” users can change these roles and add other users as “Admins”, “Members” or “Billing Admins.”

New roles can be found under Subscription menu

Note that bot users are automatically assigned a “Member” role and this cannot be changed. Only users that have a CodeRabbit seat assigned to them can have their role changed by an admin.

CodeRabbit role permissions

We recommend assigning the “Billing Admin” role to users who will only be responsible for managing the financial aspects of your CodeRabbit subscription, such as adding new users, increasing the number of seats, changing plans, etc. If you do not have a dedicated person that will act as a “Billing Admin” then any other “Admin” in your Org can also perform all billing and subscription tasks.

You’ll need to assign the “Admin” role to users who must have write access to every feature and config setting in CodeRabbit. Other users who are primarily concerned with running AI code reviews only may be limited to the “Member” role.

Here is a detailed matrix that explains the different permission sets for each of the three roles.

Resource	Admin	Member	Billing Admin
Org Settings	Write	Read-only	No access
Repo Settings	Write	Read-only	No access
Integrations	Write	Read-only	No access
Learnings	Write	Read-only	No access
Dashboards	Write	Read-only	No access
Reports	Write	Read-only	No access
User Management	Write	Read-only	Read-only
Subscription Management	Write	Read-only	Write
Billing Management	Write	No access	Write

Note that “Admins” also have the same level access that “Billing Admins” do but the reverse is not true. Every “Admin” can perform the same tasks that a “Billing Admin” can. Any user that must only be a “Billing Admin” needs to be invited manually by an “Admin.” The screenshot below shows how an “Admin” can invite another “Billing Admin” using their email, if that user does not exist in your Git platform. Also, for users with “Member” role, the metrics in the dashboards will only be visible for the Team that they are a part of in their Git platform.

Invite Billing Admins using their email

Users that are added as Billing Admins, and those that do not exist in your Git platforms, must login using the Login with Email option instead of the Git platform credentials.

Role mapping from Git platform to CodeRabbit

Some roles are assigned by default for all users that exist in your Git organization. You can review these under the “users” menu. The default roles are mapped to the permissions that user has in your Git platform organization and are automatically inherited by CodeRabbit. You will have to manually assign roles to users if you want to change CodeRabbit’s default assignment that is based on the mapping rules below.

Github	Gitlab	Azure DevOps	Bitbucket	Default Mapping to CodeRabbit Role
Admin / Billing Manager	Owner	Admin	Owner	Admin
Member	Maintainer		Member	Member
	Developer			Member
	Reporter			Member
	Planner			Member
	Guest			Member
	Minimal Access			Member
Added Manually	Added Manually	Added Manually	Added Manually	Billing Admin

Note that Azure DevOps only reports “Admin” users. If a user exists in Azure DevOps organization and is not an “Admin” then we assign the “Member” role to them by default.

TL;DR

The TL;DR for the RBAC roll-out:

You can now assign three different roles to CodeRabbit users:
- Admins - run code reviews with write access to configure everything
- Member - run code reviews with read-only access for various configs
- Billing Admins - special role, only if a dedicated user must be the one to manage billing and subscription
CodeRabbit roles for new and existing users are automatically mapped to equivalent roles in your Git platforms. Only CodeRabbit “Admins” can change these roles.
All roles are mapped to a specific Org. Users in multiple orgs can have different roles in each Org.
Users with “Admin” equivalent roles in their Git platform must be the ones to initiate a CodeRabbit trial.

Have questions or feedback? Reach out to our team via our community Discord server (for free users). Paying CodeRabbit customers and those in an active free trial period, can reach out via this support page to reach our technical team for a faster response. Please provide your Org name when you reach out.

What’s next?

We continue to listen to our customers and incorporate their feedback. The following features are on our near to medium term roadmap:

Expanding RBAC to our self-hosted customers. v1 of RBAC release is limited to SaaS customers only
Ability for “Member” level users to start a CodeRabbit trial
Custom role definitions where admins can pick and choose a custom set of permissions and create new roles
Consistent role availability across all organizations configured with CodeRabbit
SSO integration (SAML / OIDC)

Next steps for you: Login to CodeRabbit, navigate to Subscriptions menu and review or change the CodeRabbit roles for users in your organization. You can also refer the documentation for more details.

]]>

Tue, 10 Jun 2025 00:49:14 GMT

In our last post, we covered why we think 2025 is the year of the AI tech stack, the layers in the stack, and even shared some sample stacks we’ve been seeing teams using.

Here, we'll dive deeper into how teams are actually putting these tools to work. We’ll look at stack’s layers, the different types of tools in each, and how developers are using these AI tools to speed up their development process or tackle common pain points from adopting AI coding tools.

From codebase context tools that help you prompt AI assistants better to AI code review tools with agentic actions, each layer brings unique solutions to real-world headaches — both those that have always existed and the ones that are specific to AI coding tools.

We’ll walk through practical examples and share how we’re seeing teams integrating AI at every step.

The AI dev tool tech stack

As we mentioned in our previous post, teams are increasingly building out AI dev tool stacks—layered sets of AI-powered tools designed to support each stage of the software development lifecycle.

Here's a quick overview of the stack's layers, how they connect, and why you'll likely start using most of these tools soon.

Foundational: AI coding assistants
Essential layer: AI code review tools
Optional layer: AI QA test tools
Optional layer: AI refactoring tools
Optional layer: AI documentation tools

Foundational: AI coding assistants

AI coding assistants are the foundation for most teams adopting AI tools. Previously, we talked about how these tools span a wide variety of functions from accelerating coding by suggesting autocompletes to even generating entire functions and components from simple prompts.

Increasingly, developers use multiple coding assistants, choosing different tools for different strengths and personal preferences – which helps boost overall productivity and satisfaction. A great example of this is how ChargeLab was able to improve productivity by 40% by allowing their developers and teams to choose which AI tools to adopt.

We break these tools into five categories – though many tools span multiple categories.

Tab completion tools: Autocomplete but smarter

Tools: GitHub Copilot, Cursor Tab, Windsurf, TabNine, Sourcegraph Cody, Qodo, Jetbrains

These tools don’t try to build your entire app. Instead, they help you write code faster and save cognitive effort by providing contextual code suggestions for repetitive coding tasks inside your IDE.
While the buzz around AI coding tools is currently focused on agentic capabilities, tab completion remains the most commonly used AI functionality in companies we’re talking with. We estimate that around 90% of AI coding tool use has so far been with autocomplete tools. That’s likely because they’re focused on complementing the developer over doing work independently so are less likely to introduce or require significant editing.
Some developers prefer tab completion tools over other types of AI assistants since they give them more control over the code they write while still offering time savings over writing it themselves. Their use also tends to be focused on automating simple things like classes and interface names. For that reason, they’re easy for any dev to use.
Increasingly, tab completion tools are more context aware and predictive – understanding the context of your codebase and not just the code you’re currently working on.

AI coding assistants: Context-aware, multi-purpose tools

Tools: GitHub Copilot, Cursor, , Windsurf, Claude Code, OpenAI Codex CLI, Traycer Zed, Cody by Sourcegraph, Aider, Qodo, Cline, Roocode, Blackbox, OpenHands, Gemini Code Assist, Augment Code, Amazon Q, JetBrains AI Assistant

AI coding assistants are often part of a new breed of AI-native editors that offer a number of AI coding tools like tab completion, code generation, AI chat, and agentic coding capabilities.
AI coding assistants are best at writing entire blocks of code with inline explanations. They can be incredibly effective at bootstrapping first drafts of new features, generating unit tests, and refactoring. However, they’re more likely to add bugs and issues to your code than a tab completion tool and generally require good prompting to get good results – as this tweet by Cursor attests. For this reason, the quality of suggestions varies depending on the developer's prompting expertise.
Because AI assistants can answer questions as well as generate code, they also help reduce context switching (and time spent on Stack Overflow) when writing code yourself.
While most AI coding assistants are IDE-based like Cursor, JetBrains, Windsurf, Zed, and Copilot (which is in VS Code) – some also operate in the CLI including Claude Code, Aider, and OpenAI’s Codex.
AI coding assistants are more context-aware than tab completion tools and focused on learning your codebase and coding style over time to increase the relevance of their suggestions.

Agentic coding tools: The next frontier (but pricey)

Tools: Cursor, Windsurf, GitHub Copilot, Claude Code, OpenAI Codex, Cline, Roocode, Blackbox AI, Continue, Devin, Jules, Augment Code, OpenHands

These tools often overlap with AI coding assistants but we’ve put them in their own category since not all assistants have agentic capabilities and there are some coding agents which are more focused on agentic coding.
These tools are able to analyze your codebase to determine how best to approach coding tasks or solve problems. Then, autonomous or semi-autonomous agents work to solve those problems or complete tasks like writing tests, testing code, installing several packages, fixing issues in code, or generating new code and raising PRs based on your requests. They also can understand your codebase and summarize files.
Typically, they have the ability to execute specific tasks including modifying code and creating files. They also are integrated into your development environment and can interact with your tools.
Many have the ability to execute tasks autonomously and can do so without direct supervision like Devin, Claude Code. and OpenAI Codex while others make suggestions that you have to approve like Copilot and Windsurf. Depending on the task, developers might prefer one or the other type of tool.
Agentic coding tools are still in the early stages but are evolving fast. In the right hands and with the right tasks, they can offer major returns — or create really creative new bugs.

AI app generator tools: Generate an app or website fast

Tools: Lovable, v0, Bolt, Builder.io, Figma Make, Fine.dev, Stitch

These tools focus on quickly generating entire apps or websites rather than simply completing individual lines of code or generating features. They promise to build full-stack applications rapidly—from frontend UI design to backend infrastructure setup and are integrated with cloud databases. For that reason, they primarily appeal to non-developers. They also likely herald the end of no-code tools since they simplify the process even more.
The increasing popularity of app-generation tools among developers’ is, therefore, often focused on quickly prototyping new ideas rather than creating something that will end up in production.
App-generation tools are more agentic than code generation tools and handle a broader scope of the development process. This can mean significant initial time savings but might require more oversight and editing downstream to customize generated apps to exact specifications. But many devs question whether a generated app might actually be ready for production.
App-generation tools are rapidly evolving to become increasingly sophisticated – allowing developers to describe application ideas while AI translates these descriptions into functional codebases with minimal manual intervention. However, they remain of limited use to many developers who typically work on pre-existing codebases and applications.

Codebase context tools: Up-to-date codebase context

Tools: Repomix, Repo Prompt, Context7

These tools are crucial enablers for AI-assisted software development. They structure and deliver relevant slices of large codebases to AI models — giving the AI the context it needs to reason effectively across many files.
Developers simply prompt an AI assistant and these AI tools curate the most relevant parts of the codebase to feed into the model, ensuring the assistant isn’t flying blind in large or complex projects.
Codebase context tools can also help compress and structure prompts for AI agents to allow them to maintain a functional understanding of a large codebase within stated token limits — improving the quality of your generated code while reducing the cost when using tools with consumption-base pricing.

Essential layer: AI code review tools

Next up are AI code review tools, a critical layer because they directly tackle the increased workload created by faster AI-assisted coding.

In our previous post, we highlighted how these tools help teams better manage the growing volume of code produced, reducing burnout from manual reviews. AI-driven code reviews not only speed up the process by allowing teams to merge PRs significantly faster – but also greatly improve quality by catching bugs early, reducing reviewer fatigue, and standardizing best practices. Some, like CodeRabbit even have agentic workflows and can help with things like generating unit tests, making multi-file edits, or raising new PRs.

After AI coding tools, AI code review tools are the AI tool dev teams are most likely to adopt — both to deal with existing code review backlogs and to address the glut of AI-written PRs of questionable code quality.

Ultimately, they automate tedious tasks, freeing developers to focus on the high-impact work they genuinely enjoy.

These tools come in three main flavors:

Features of an AI coding tool: AI coding tools review themselves

Tools: Cursor, GitHub Copilot, JetBrains, Windsurf Forge (deprecated)

Some AI coding assistants offer code review tools as features included in their subscriptions or as add-ons. For example, Cursor’s subscription includes IDE-based code reviews and GitHub Copilot’s subscription includes CI/CD-based reviews. Up until April 2025, Windsurf also offered Forge, a CI/CD-based code review tool that was an add-on to their service. However, they recently deprecated it and relaunched code reviews as a feature of their main AI coding assistant.
It makes sense to some to include AI code reviews as part of existing AI coding assistant tools since code reviews are such a core use case for AI in development. However, many question how effective a coding assistant can be when reviewing the code it generates.
With code reviews, the best practice has always been to have peers or senior devs do reviews with a goal of ensuring that several different sets of eyes look for potential issues. Having AI code reviews as a feature of coding assistants deviates from the central security and quality protocol. How can you expect the AI tool that added 41% more bugs to your code to find any of those bugs if it didn’t realize they were bugs to begin with?
What’s more, AI coding assistants often prioritize low latency and real-time responses leading to potentially more superficial code reviews over standalone AI code review tools which focus on quality.
Forge’s depreciation also suggests that, as a feature or an add-on, code reviews are unlikely to be a core focus of product development by AI coding assistants – especially as the space becomes more competitive and companies devote more time to improving their core offerings. That could likely mean standalone solutions will be more comprehensive and have more features making them able to deliver additional value.

Git-based AI code review tools: Reviews that save teams time

Tools: CodeRabbit, Bito, Greptile, Qodo, Graphite Diamond

These tools run automatic reviews when you open a pull request. First-pass AI code reviews find bugs, security vulnerabilities, syntax errors, stylistic issues, and more in order to save senior engineers time adding comments on issues themselves.
These tools fit perfectly within your CI/CD workflow and existing code review processes while offering PR summaries and 1-click fixes to make it easier to both review and fix issues.
With codebase awareness and enhanced context, these tools can catch common issues early, find bugs you might miss, and enhance code quality across your codebase.
Offerings like CodeRabbit even have agential chat and workflows allowing you to do things like generate docstrings and unit tests, make multi-file edits, raise PRs, and more by simply chatting with the AI reviewer.
Having AI code reviews at the CI/CD stage is critical to streamline the code review process while implementing code quality standards across the codebase.

Both IDE and Git-based AI code review tools: Reviews at every stage

Tools: CodeRabbit, SonarQube, Qodo, Sourcery

Few AI code review tools offer code reviews in the IDE and CI/CD tools. These tools provide the most comprehensive code review support by reducing bugs at multiple stages of the development cycle.
Connecting IDEs with CI/CD reviews also allows for multilayer reviews allowing for a more seamless workflow and additional quality checks.

Optional layer: AI QA test generation & execution tools

We previously discussed how QA testing has traditionally incorporated forms of machine learning or AI, but newer tools are going even further by automating the most repetitive and time-consuming aspects of testing.

These AI-powered tools generate extensive and realistic test scenarios from simple descriptions, significantly speeding up the testing process. Beyond speed, they also enhance test coverage by considering numerous permutations a human tester might overlook. Additionally, some of these tools offer "self-healing" features that automatically update tests when your app’s UI or underlying data changes.

We break these down into two categories:

AI test generation tools: Test generation-only

Tools: Testim, Mabl, Functionalize, testRigor, Autify, ACCELQ, Qodex, Tricentis

AI test generation tools don’t run or manage tests—instead, they automate the creation of test cases, scripts, or scenarios based on natural-language descriptions or by analyzing existing code paths.
Their main appeal is reducing the tedious, repetitive work of manually defining each individual test case to help QA engineers rapidly build out robust test suites.
Developers and QA teams appreciate these tools because they speed up initial test creation and are especially useful when expanding test coverage for large, complex, or legacy applications.
While great for generating volume quickly, these tools typically require manual fine-tuning and review to ensure accuracy and coverage.
Increasingly, these tools leverage deeper context-awareness that allows them to parse existing code and user journeys more intelligently, allowing them to propose test cases that closely align with real-world use cases.

AI test execution and maintenance tools: End-to-end AI test support

Tools: MuukTest, Applietools, Sauce Labs, Perfecto, Meticulous

Full-lifecycle AI QA tools go beyond test generation and handle the entire testing process – from writing test cases to executing them automatically and even maintaining them as your application evolves.
Teams often favor these comprehensive tools because they dramatically reduce QA workload by automating, not just initial test creation, but the ongoing upkeep required when the codebase or UI changes.
Though these tools significantly ease maintenance burdens, they can sometimes struggle with intricate or complex scenarios.
Increasingly sophisticated, these tools integrate seamlessly into existing CI/CD workflows, providing continuous, automated testing coverage across multiple environments and deployment stages.

Optional layer: AI Refactoring tools

Another crucial area is AI refactoring tools. While general AI coding assistants may claim refactoring capabilities, their outcomes often fall short. This has led many teams to adopt specialized AI refactoring tools explicitly designed for optimizing and improving codebases.

These dedicated tools automate tedious tasks, quickly identifying and performing refactoring opportunities based simply on natural-language instructions, drastically cutting down manual effort and enhancing code maintainability.

We divide these tools into two types:

Semi-automated tools: The tab completion of refactoring tools

Tools: CodeGPT, GitHub Copilot, Amazon CodeWhisperer, Sourcegraph Cody

Semi-automated refactoring tools don’t completely take the wheel. Instead, they proactively suggest code improvements within your IDE allowing you to quickly accept, reject, or modify each suggestion.
These tools focus on smaller-scale, incremental refactors—like simplifying methods, restructuring loops, or optimizing function logic—that benefit from a human eye before committing.
Developers prefer semi-automated tools for complex or sensitive refactoring tasks because they offer fine-grained control.
The appeal lies in their balance. They speed up routine refactors while leaving room for developer judgment — minimizing the risk of unwanted changes or subtle bugs sneaking into production.
Increasingly, semi-automated refactoring tools leverage deeper context-awareness, analyzing the broader codebase to offer smarter, more relevant suggestions.

Fully automated tools: When you want to give AI more autonomy

Tools: Claude Code, Devin, OpenAI Codex

These AI tools handle large-scale, repetitive refactoring tasks automatically across your entire codebase, often from just a single set of instructions or rules.
While semi-automated tools highlight refactoring opportunities for devs to review manually, fully automated tools excel at bulk tasks—such as upgrading dependencies, migrating frameworks, or standardizing code styles—potentially saving hours of repetitive work.
These tools appeal to teams looking to tackle technical debt at scale without spending days manually applying the same refactor across hundreds or thousands of files.
Developers appreciate their reliability and consistency for clearly defined, repetitive refactors, but fully automated tools generally work best when given explicit rules. They’re less suited for nuanced code improvements that require human judgment.
Increasingly, fully automated refactoring tools can parse multiple programming languages and integrate directly into existing CI/CD pipelines.

Optional layer: AI documentation tools

Finally, AI documentation tools, while not usually the first thought when adopting AI, have proven incredibly valuable.

As we previously noted, these tools tackle the often-dreaded task of writing and updating code documentation such as inline comments and docstrings. By leveraging AI, developers can quickly generate clear, accurate, and up-to-date documentation directly from their codebase, saving significant time and effort that would otherwise be spent manually maintaining documentation.

Code-level docs tools:

Tools: DeepWiki, Cursor, CodeRabbit, Swimm, GitLoop, GitSummarize

AI documentation tools analyze code structures and behaviors to produce readable, context-aware documentation drafts automatically—potentially cutting documentation time in half or more by generating inline comments, docstrings, API references, or even internal design and architecture docs.
These tools appeal to teams wanting to keep documentation continuously synchronized with code changes without manually updating every function comment or API description as the code evolves.
Increasingly, AI documentation tools support multiple programming languages and integrate directly into IDEs and CI/CD pipelines, proactively prompting devs to document their code as they write it, thus improving doc quality and reducing technical debt over time.

Building your own AI dev tool stack

Adopting an AI dev tool stack isn’t about just throwing a couple new AI tools into the mix. It’s about strategically bringing AI into every part of your development workflow.

Using AI strategically at every step – from coding and reviewing to testing, refactoring, and documenting –can help your team get more done, reduce frustration, and significantly boost the overall quality of your codebase.

We’d love to hear more about how you’re building your AI dev tool stack and what’s working for you. Tag us on Twitter or LinkedIn.

Interested in trying out our AI code review tool? Get a 14-day free trial!

]]>

Tue, 10 Jun 2025 00:15:55 GMT

In April, Microsoft and Google announced that AI is generating 30% of the code at their companies. That indicates that AI coding tools have entered a new phase. They’ve become a significant part of engineering workflows – even at large, enterprise companies.

With Dev Twitter obsessed with vibe coding these days, the question many devs we’ve been talking to are asking is what does all this AI use actually look like? Are developers vibe coding whole features for production using agentic coding capabilities? Or are they using AI primarily for tab completion and early prototyping?

Ultimately, devs want to know what successful AI adoption really looks like across teams, companies, and industries. What AI tools are teams actually using? How are they getting real value from them? What rules, if any, are companies putting in place around AI usage? Are AI coding tools really boosting productivity or just helping teams code faster, but with more bugs?

At CodeRabbit, we talk to hundreds of engineering teams every month about how they're using AI. That gives us early visibility into trends around AI adoption, and in the last few months, we've seen striking similarities in the ways development teams are thinking about AI.

Let’s dive into what we’re hearing from customers – and why it’s convinced us 2025 is the year of the AI dev tool tech stack.

Everyone has AI pain points now

It likely comes as no surprise that the teams we talk to tell us that one of the major pain points of their AI coding tools is that the productivity and DevEx gains they deliver are inconsistent. With studies finding that AI coding tools can add up to 41% more bugs to your code, these tools have come with new challenges.

A couple of weeks ago, Ryo Lu, Cursor’s Head of Design, wrote a thread about the potential downsides of using Cursor to write code. In it, he listed 12 steps to take if you don’t want to end up with AI spaghetti you’ll be cleaning up all week.

A tool that requires a 12-step guide for avoiding disastrous spaghetti code might be fine if you’re vibe coding a hobby project or on a team of mostly senior devs who can catch and edit out the spaghetti, but imagine what a junior developer could do to a legacy codebase in a highly regulated Fortune 500 company!

In addition to more bugs and issues, we’re also hearing that AI coding tools have created bottlenecks at other points of the development cycle.

It goes without saying that if you’re writing more code, you have to review more code, test more code, document more code, and refactor more code. Very quickly, your ‘game-changing’ AI productivity gains get held up at other manual parts of the development cycle. And that work can be harder and more time consuming given AI-generated code’s tendency to have more issues.

A new way to code = A new tech stack

That’s why many devs have come to an important realization this year: You can’t just introduce a transformative technology and leave the rest of the software development cycle intact. You need an end-to-end AI dev tool tech stack.

It’s common for disruptive technologies to spark broader ecosystem changes. A great example is how GitHub’s 2008 launch resulted in the launch of both Circle CI and Jenkins three years later. AI coding tools seem to be following an even faster timeline.

After a few years of using them, engineering leaders have realized that AI coding tools help sometimes but hurt sometimes, too. To actually realize the promised productivity gains, they need additional tools for the downstream tasks they create or make more difficult.

But this shift to thinking about AI adoption as a stack is also about using the same approach of leveraging AI to boost productivity that worked for code generation for other manual tasks. Why not review faster and test faster if you’re coding faster? Especially since almost no one loves reviewing code or writing tests?

In some cases, the ROI of leveraging AI at other stages of development might even be higher than what AI coding assistants deliver. That’s because those AI tools work to remove bugs from code rather than adding them in.

What’s in the AI dev tool tech stack?

The AI dev tool stacks we’re seeing our customers adopt are a layered set of AI tools that support every stage of the software development lifecycle.

Here’s a quick look at the layers of that stack, how they fit together, and why you’ll probably be using most of them by the end of this year – if you aren’t already.

Foundational: AI coding tools
Essential layer: AI code review tools
Optional layer: AI QA test tools
Optional layer: AI refactoring tools
Optional layer: AI documentation tools

Foundational: AI coding tools

This is where most teams start. These tools help developers write code faster – either by suggesting autocompletes of what you’re currently writing or by generating entire functions, tests, or components based on natural language prompts. Over time, they’ve become more sophisticated with deeper codebase awareness, a greater commitment to code quality, and a recent focus on agentic, multi-step tasks. But these tools are still notorious for introducing bugs, vulnerabilities, and performance inefficiencies into code. That translates into developers doing a lot more code editing and reviewing.

Increasingly, we’re hearing two things. First, devs aren’t just using one tool but often leveraging multiple tools based on what each tool is best at (a process satirized in this tweet). Second, devs are increasingly opinionated about which tool or tools they want to use – with the choice of an AI coding assistant becoming as divisive as whether to use a PC or a Mac.

That’s led many teams to start giving developers a choice around AI assistants rather than choosing just one to buy licenses for. Given that they’re likely to also be more effective at using the tool they prefer – that benefits companies, too.

We break these tools into five categories – though many tools span multiple categories.

Tab completion tools: GitHub Copilot, Cursor Tab, Windsurf, TabNine, Sourcegraph Cody, Qodo, Jetbrains
AI coding assistants: GitHub Copilot, Cursor, Windsurf, Claude Code, OpenAI Codex CLI, Traycer Zed, Cody by Sourcegraph, Aider, Qodo, Cline, Roocode, Blackbox, OpenHands, Gemini Code Assist, Augment Code, Amazon Q, JetBrains AI Assistant
Agentic coding tools: Cursor, Windsurf, GitHub Copilot, Claude Code, OpenAI Codex, Cline, Roocode, Blackbox AI, Continue, Devin, Jules, Augment Code, OpenHands
AI app generator tools: Lovable, v0, Bolt, Builder.io, Figma Make, Fine.dev, Stitch
Codebase context tools: Repomix, Repo Prompt, Context7

Essential layer: AI code review tools

AI code review tools sit at the center of the stack because they directly address the biggest bottleneck introduced by AI coding tools: the review process. If your code is getting written faster — and more often — by machines then you need a better way to review it.

Trying to manually review increasingly more code as a team isn’t just a recipe for burnout, it also risks quality degradation. Research shows that most devs can only manually review up to ~400 lines of code before fatigue sets in. That fatigue could mean devs miss more critical bugs then have to address them in production.

Indeed, code review tools don’t just help you merge PRs up to 4x faster and reduce the time you spend reviewing by up to 50%. They are also essential in AI-assisted development to keep bugs from production given that AI coding tools have been found to add up to 41% more bugs to code. Using them protects your AI productivity savings by ensuring no bad code ends up in production.

AI code reviews also help improve code quality, reduce reviewer fatigue, and standardize best practices across teams no matter which AI coding assistants your team members are using. Unlike code generation and agentic coding tools, their output isn’t wildly inconsistent since it doesn’t depend on the AI competency of any individual developer to know how to prompt them.

But, perhaps more importantly, they leverage AI for what it’s best at – automating repetitive and tedious tasks devs don’t want to do. Who wants to spend an hour adding a dozen comments to a PR when AI can add most of those comments for you, give you easy 1-click fixes for each of them, and find bugs you might have missed?

These tools come in three main flavors:

Features of an AI coding tool: Cursor, GitHub Copilot, JetBrains, Windsurf Forge (deprecated)
Git-based AI code review tools: CodeRabbit, Bito, Greptile, Qodo, Graphite Diamond
Both IDE and git-based AI code review tools: CodeRabbit, SonarQube, Qodo, Sourcery

Optional layer: AI QA test generation & execution tools

For many dev teams, QA testing has long included some form of AI. But a new generation of AI-powered QA tools promise to automate even more of the grunt work – especially around generating and maintaining tedious end-to-end tests that simulate real user journeys. Instead of manually thinking up every scenario, you can let an AI generate test cases or even entire test scripts from a natural language description of what needs to be checked.

The benefits are hard to ignore. The most important is speed – they can churn out or execute suites of tests in a fraction of the time and generate dozens of scenarios at once. However, they also help achieve greater breadth of coverage by running through permutations a human might overlook or not have time for.. Some even offer self-healing capabilities to adjust tests when your UI or data changes, reducing maintenance headaches and keeping your test suite running smoothly as the app evolves.

We break these down into two categories:

AI test generation tools: Testim, Mabl, Functionalize, testRigor, Autify, ACCELQ, Qodex, Tricentis
AI test execution and maintenance tools: MuukTest, Applietools, Sauce Labs, Perfecto, Meticulous

Optional layer: AI Refactoring tools

While some AI coding tools claim they can be used for refactoring, their results are often lackluster. For that reason, many companies adopt AI tools created explicitly for refactoring code as part of their AI dev tool tech stack after they’ve had bad experiences attempting to use coding tools for that use case.

AI-powered refactoring tools promise to automate the tedious and repetitive aspects of improving your codebase from minor optimizations to significant architectural changes. Instead of spending hours manually hunting down inefficiencies or repeating the same structural tweaks across your codebase, these AI tools quickly identify and even execute refactoring opportunities from a simple natural-language description.

We divide these tools into two types:

Semi-automated tools: CodeGPT, GitHub Copilot, Amazon CodeWhisperer, Sourcegraph Cody
Fully automated tools: Claude Code, Devin, OpenAI Codex

Optional layer: AI documentation tools

While docs are never the first thing that teams think about when adopting AI, it’s one task that they appreciate getting help with when they do. These tools tackle one of coding’s most dreaded tasks—writing and updating code documentation like inline comments to docstrings. Instead of manually documenting every new function or combing through outdated guides, devs can let AI tools quickly draft readable, up-to-date documentation directly from the code itself, saving countless hours of tedious work.

Code-level docs tools: DeepWiki, Cursor, CodeRabbit, Swimm, GitLoop, GitSummarize

Sample stacks

So, what do some of these AI dev tool tech stacks look like? We’ve seen a range of configurations from company to company but here are some common stacks teams are using.

‘Comprehensive’ stack

There’s a growing group of companies we encounter who have implemented or are in the process of implementing an end-to-end AI dev tool stack that includes an AI-powered coding tool, code review tool, QA tool, refactor tool, and docs tool.

These are typically companies where there’s been significant internal leadership around AI adoption either from the C-Suite or engineering. They were also often early adopters of AI coding tools and have already seen their benefits so are looking for additional AI productivity and DevEx gains.

‘Choose-your-own-AI-tool’ stack

We are increasingly seeing companies that are implementing AI tools throughout the development cycle AND giving their team more choice as to which tools they use. These companies understand (or have learned the hard way) that different AI tools are best suited for different kinds of work and that the best AI tool for any developer is the one they feel most comfortable prompting.

This strategy hasn’t just anecdotally helped increase AI adoption but it’s also improved developer satisfaction and experience at these companies. That’s because, increasingly, developers are opinionated about which tool they use. Some companies offer developers choice over just their AI coding tool (Cursor, Copilot, or Claude Code?) while others will offer devs choice over other tools in the stack, as well.

‘Multiple coding tools’ stack

Not to be outdone by the companies that let developers choose their own AI tools are the companies that let devs choose multiple AI coding tools. Maybe they use Lovable for prototyping UI and then Cursor to write the app. Or they use TabNine for code completion and ChatGPT for code generation. More companies are saying yes to developers using more than one tool if they can make the case for why it will improve their productivity.

‘Partial’ stack

Not all companies that we’re seeing building an AI dev tool stack are adopting all the tools in the stack. Typically, however, their stacks involve an AI coding tool, an AI code review tool, and another AI tool from our list – be that an AI refactoring tool, an AI QA tool, or an AI docs tool. Which they adopt often depends on their codebase, internal expertise, and needs. For example, larger companies are more likely to adopt AI QA tools since they have a large enough team internally to manage QA whereas smaller companies are more likely to mostly outsource QA to contractors and agencies.

‘Essential’ stack

Finally, we see a lot of companies building just an ‘essential’ stack which includes just an AI coding tool and an AI code review tool to help navigate the added bugs and more complicated code reviews that typically result from using coding assistants. Code review tools also have some of the highest ROI of any AI tools – including AI coding tools – since they both save significant time and keep bugs out of production.

Building your own AI dev tool stack: What to consider

When it comes to building an AI dev tool stack, we’ve seen a number of approaches. Many adopted AI coding tools and then iteratively looked for individual solutions to the problems those tools created as downstream issues became particularly painful.

Other companies took a more intentional approach with CTOs or other technical leaders investigating tools that could improve the development cycle and running proof-of-concept tests to see whether they actually deliver results. Some even waited to adopt AI coding tools and leveraged AI code review tools to address their existing code review backlogs first.

We recommend a proactive approach since we often see teams suffering from delayed milestones and dev burnout before they start looking for solutions.

Want more info about what we’ve been seeing around AI adoption of specific tools? We have another post here where we go into greater details about the different types of tools in each category and how we’re seeing them helping engineering teams.

We’d love to hear more about how you’re building your AI dev tool stack and what’s working for you. Tag us on Twitter or LinkedIn.

Interested in trying out our AI code review tool? Get a 14-day free trial!

]]>

Thu, 05 Jun 2025 21:53:23 GMT

Overview

SalesRabbit, a CRM and canvassing platform used by roofing, solar, and pest control companies, is no stranger to legacy code. In recent years, SalesRabbit has expanded its product line through multiple acquisitions – including RoofLink in 2024, a roofing-focused CRM.

Those expansions came with new challenges: multiple legacy codebases in different languages (C#, Elixir, Python, and even C) and no easy way to assess code quality across them.

With 20 engineers, CTO Michael Archibald needed a scalable way to maintain engineering velocity while gaining visibility into an inherited codebase, reducing bugs, and supporting less experienced developers on the team.

That’s where CodeRabbit came in.

The challenge: Legacy codebase & high defect rates

Before CodeRabbit, SalesRabbit was trying to grapple with an inherited codebase from a new acquisition while dealing with many of the common challenges engineering teams face around code reviews. Those included delays in reviews that slowed down deployment velocity and inconsistent coding standards.

Unfamiliar legacy codebases after acquisitions

The SalesRabbit team was spread out across a growing number of languages. While SalesRabbit started as a PHP application, they acquired a company with a C# codebase, shifted some of their own codebase to Elixir, and were about to buy RoofLink, whose code was in Python. It was the introduction of that Python codebase with SalesRabbit’s acquisition of RoofLink that initially prompted Michael to research AI code review tools. “I was looking for some automated tools, primarily AI, that could help us understand the codebase a little faster and better validate the quality of the code,” he shared.
High defect escape rate

Michael has always been hyper-focused on improving application quality. When he joined SalesRabbit as CTO six years ago, the company was facing frequent downtime. Since then, they've improved to 99.99% availability and scaled their team. But, after the acquisition, Rooflink’s defect escape rate gave him cause for concern. While Rooflink wasn’t tracking how many bugs made their way to production, anecdotally, the Rooflink support team told him they were used to fielding customer complaints on nearly every release. It seemed clear that code at the company wasn’t being as thoroughly reviewed as it should be.
Slow review cycle

With an ambitious roadmap and multiple products across the company, Michael had to ensure the team maintained velocity. But manual code reviews were inconsistent and often took several days, slowing deployment significantly. One problem was that the team had a large number of junior engineers – which meant fewer senior developers who could review code. Michael wanted a solution that would make reviews easier.
AI coding tools caused code quality issues

While SalesRabbit’s engineers leveraged Copilot and other AI coding tools to help write code faster, it created problems with code quality. “The junior engineers were introducing a lot of bugs with these tools,” Michael explained. That caused him to try to find other AI tools that would better support the junior engineers on the team.
Inconsistent coding standards

Different teams across SalesRabbit and RoofLink used different styles and standards, often due to legacy standards at the acquired companies. But style inconsistencies added friction. A central governance layer was needed to enforce best practices. “We just want everyone to be the same,” said Michael.

Why SalesRabbit loves CodeRabbit

https://youtu.be/0WmK5QqqjJY?feature=shared

The engineers all wanted it – and used it

Michael wasn’t initially convinced that CodeRabbit would solve his team’s problems. “I came across CodeRabbit and thought, ‘It's relatively inexpensive. I'm going to just give one or two engineers a seat and see how they like it,” he explained. “But almost immediately everybody on my team was like, oh, I want this, I want this.”

That level of enthusiasm for a tool is something Michael listens to. When he joined SalesRabbit, the company was facing 80% engineering churn and he’s since worked hard to improve developer satisfaction and stabilize the engineering org. “One of the litmus tests for me with AI tools is: do engineers want it? I don’t like pushing AI tools on engineers,” Michael said. “With CodeRabbit, everybody asked for it almost immediately.”

Initially tested with junior developers, senior engineers also quickly recognized its value around bug fixes, refactor suggestions, and security checks. “With CodeRabbit,, everybody was like, give me this. This is fantastic. It speeds up code reviews,” Michael said. “We went from a small test to full adoption very quickly.”

CodeRabbit found more issues than any human

While Michael had been worried about Rooflink’s defect escape rate, CodeRabbit reduced it significantly – and almost immediately. “We could have started putting processes in place to improve things but those can take weeks and months before we get measurement,” he explained. “Code Rabbit seemed to have an almost immediate impact. Code quality has gone up and the only thing we've adjusted has been adding CodeRabbit to all of the deploys.”

Michael isn’t surprised it’s been so effective at reducing issues. “I feel very comfortable saying that it's caught a lot more bugs than any human has,” he said.

An AI tool that… didn’t introduce more bugs

Unlike Copilot and other AI coding tools, which focused on writing code and resulted in a lot of added bugs, CodeRabbit focused on finding and fixing them. That gave SalesRabbit the visibility and quality gates they needed at the PR stage to keep defects out of production. “It works especially well for junior developers,” Michael said. “It helps them spot patterns and mistakes they’d otherwise miss.”

SalesRabbit was also able to more quickly understand their inherited codebase. “It really helped us to determine the code quality,” shared Michael.

It fixed style consistency issues

CodeRabbit’s built-in style enforcement reduced the need for custom linters or style checkers, helping standardize code across legacy and modern languages. “CodeRabbit does a really good job saying, ‘this might be a bad pattern’ or ‘you’re not following style here,’” Michael explained. “We were able to get rid of a lot of tooling we put in place for managing code styles because CodeRabbit has a version of that built-in.”

What’s helpful is having one centralized code quality enforcement tooling for legacy languages like C# and modern ones like Elixir and Python “We just want everyone to be kind of the same,” said Michael. “CodeRabbit does that for us.”

The results: Better code, lower defect rate, happier engineers

With CodeRabbit, SalesRabbit has seen impressive results:

30% fewer defects

The defect escape rate decreased by at least 30% after introducing CodeRabbit, improving system reliability. Support teams even noticed the difference. “It had almost an immediate impact,” Archibald said.

25% Faster deployments

CodeRabbit’s automated first-pass review enabled faster iterations, reducing release cycle time – even with a complex legacy codebase. Then, one-click fixes helped them quickly commit the changes identified.

Significant style and standards consistency improvements

While it’s hard to measure, Michael feels strongly that CodeRabbit helped them level up their code quality significantly. “It's improved our code style,” he attests.

Happier engineers

Michael’s focus is on keeping the engineers at SalesRabbit happy and productive. That’s why he’s never wanted to push AI tools on them that they didn’t want. But CodeRabbit was a tool that his engineering team all wanted. “The developers have really enjoyed using it,” he shared.

CodeRabbit = Less review overhead, more velocity

For SalesRabbit, adopting CodeRabbit was low-lift but high-impact. Their team was able merge PRs at least 25% faster, improve defect detection in legacy C# and Python code, and increase developer efficiency by freeing them from multi-day review cycles.

The AI-powered reviews only take hours now, instead of days, and enable faster deployments. CodeRabbit was also able to find bugs that junior engineers were letting slip by when using AI coding tools. With review cycles shortened, developer confidence increased, and the entire team more aligned around coding practices, Michael’s glad he found CodeRabbit when he did.

As Michael puts it: “Before CodeRabbit, we struggled with inconsistencies in code reviews and defects slipping into production. It’s improved our coding standards, especially in C#, provided a centralized governance layer for code style enforcement, and significantly reduced production defects.”

With CodeRabbit’s expanding feature set, especially the recent support for automated docstrings insertion and the future support for agentic workflow-based automated unit-test insertion, SalesRabbit anticipates seeing even more efficiency gains soon.

Want see how CodeRabbit can help your team? Get a 14-day trial.

]]>

Tue, 03 Jun 2025 06:21:59 GMT

Pipeline AI vs. agentic AI for code reviews | AI architecture patternsの意訳です。

AIはコードレビューのあり方を大きく変えました。

従来の静的ルールや正規表現ベースのLinterから、差分を読み取り、まるでシニアエンジニアのようなフィードバックを返すシステムへと進化してきました。これは確かな前進です。

しかし、CodeRabbitのように本番環境で使えるAIレビューシステムを開発する中で、私たちはある根本的な設計の選択肢に直面します。

AIにエージェントのような自律性を持たせるべきか？それとも、構造化されたPipelineとして制御するべきか？

この選択は実装の問題だけではありません。システムの処理速度、開発者の信頼性、バグ時のデバッグしやすさ、長期的な運用コストにまで影響します。

ただし、設計が最終目的ではありません。これらはすべて、ある本質的な問いに答えるための手段に過ぎません。

「最高のコードレビューを実現するために、モデルに必要なものだけを、的確に渡すにはどうすればいいか？」

問題は「エージェント型か、パイプライン型か」ではなく、「現場で役立つ、最高のツールをどう作るか」です。

まずは、それぞれのアーキテクチャについて整理しましょう。

AIアーキテクチャのパターン：エージェント型 vs パイプライン型

エージェント型AIシステム

エージェント型の構成では、AIは1つのプロンプトに縛られず、ステップごとに考え、判断し、ツールを使いながら進行します。典型的なプロセスは以下の通り：

行動計画の立案
ツールの実行（例：grep、静的解析ツール、テストランナーなど）
出力の観察
次に何をすべきかを判断

このプロセスは ReAct（Reason + Act） というアプローチに基づいており、多くの研究やシステムに使われています。

モデルが外部ツールやメモリを活用して出力を豊かにできるのは大きな魅力ですが、それを正確に制御するのは非常に難しいのです。

パイプライン型AIシステム

パイプライン型は、より決定論的（predictable）なアプローチです。以下のような一連のステップを定義します：

入力の準備（例：diff、関連ファイル、Issue内容など）
前処理（例：静的解析、コード検索）
モデルへのプロンプト送信
出力をレビューコメントとして整形

この構成は高速でテストしやすく、CIなどのワークフローにも組み込みやすいのが利点です。

ただし最近の多くのツールでは、パイプライン型をベースにしつつ、プロンプトの動的調整や文脈取得、対話型のフローなど、エージェント的要素も一部取り入れています。

つまり、ほとんどの現実的なシステムは、どちらか一方ではなく、その中間に位置しているのです。

ハイブリッドAI：両極ではなくスペクトラムで考える

実際のところ、現場で使われている多くのAIシステムは完全にエージェント型でもパイプライン型でもありません。その中間に位置し、両者の利点を活かすハイブリッド型を採用しています。

CodeRabbit や GitHub Copilot PR Reviews はその代表例です。

こうしたハイブリッド型では、パイプラインの再現性・安定性と、エージェント的な柔軟な文脈取得や動的挙動を組み合わせ、実用的なバランスを取っています。

AIアーキテクチャのトレードオフ

観点	エージェント型	パイプライン型
レイテンシ	ステップが多く遅くなりがち	高速で予測可能
ツールの使い方	柔軟で動的	一貫性があり安定
信頼性	テストしにくく不確実性が高い	デバッグしやすく再現性が高い
文脈の扱い	動的で柔軟だがミスしやすい	事前に定義・制御された入力
ワークフロー適性	対話的なツール向け	CI/CDやPRレビューに最適

本当のボトルネックは「文脈」

コードレビューAIで最も重要なのは、どのように設計するかではなく、どんな文脈を与えるかです。

ありがちな誤解として、「コードやメタ情報をもっと渡せば精度が上がるだろう」と考えがちですが、実際は逆効果になることもあります。

無関係な情報が多すぎるとモデルが混乱する
プロンプトのノイズで誤検出が増える
質の低い文脈がツール経由で生成される可能性もある

つまり、「多いほうが良い」ではなく、「適切なものだけを渡す」のが正解です。

CodeRabbitが目指すもの：文脈を“キュレーション”するAI

CodeRabbitでは以下のような構成を採用しています：

モデルの実行前に30種類以上の静的解析ツールを実行
ASTやシンボル情報をもとに文脈を抽出
過去のレビュー結果を活かしたフィルタリング
モデルの入力制限を考慮して、構造化されたプロンプトを生成

これにより、「必要な情報だけを、正しく渡す」構成を実現しています。

今後の展望：エージェントに文脈の“選び方”を学習させる

理想的には、AIが自ら「どの文脈が役立つか」を学び、適切にツールを使えるようになるべきです。

そのためには：

理想的なPRと文脈のデータセット
評価指標と結果の紐付け
効果的なツール使用のシミュレーション

などが必要です。

実際に ReTool や LeReT のように、強化学習でツール選択を学習させる研究も進んでおり、精度や効率の向上が報告されています。

CodeRabbitでも、こうした文脈選択における次の一手に注力しており、より賢く、信頼できるレビューAIの構築を目指しています。

私たちの答え：「目的を持って考える」AI

エージェント型かパイプライン型かという話は、本質ではありません。大事なのは、**「モデルに必要なものを、必要なだけ、正しく渡すこと」**です。

パイプラインはスピードと安定性。エージェント型は柔軟性と思考力。そして、CodeRabbitが採用するハイブリッド型はその中間で最適なバランスを目指しています。

でも最も大事なのはやはり文脈です。

どこを見て、何を無視し、何を重視するべきか。それを人間のエンジニアのようにモデルが判断できたとき、私たちは本当に信頼できるAIレビューを手にすることができるでしょう。

最高のエンジニアが最高のパフォーマンスを出しているかのようなレビュー——それを、毎回、自動で届けられる世界へ。

私たちは、そこを目指しています。

CodeRabbitを試してみたい方は、14日間の無料トライアルからお試しください！

]]>

Tue, 03 Jun 2025 00:40:33 GMT

株式会社techbeans（テックビーンズ）は、受託開発を中心に事業を展開しています。完全な一括請負型だけでなく、ラボ型での伴走型チーム開発も行っており、特に後者のニーズが増加傾向にあるとのことです。同社は、特に各種AIサービスを活用した開発体制が特徴的です。そうした中で、近年ではAIを活用したプロトタイプ提案サービス「ゼロスタート」を開始しました。

今回は、そんなtechbeansの代表取締役・前川和浩さんと、テックリード・西井智紀さんにCodeRabbit導入についてお話を伺いました。

少数精鋭で柔軟にチームを編成する開発体制

techbeansの開発チームは約15名で構成されており、プロジェクトごとにチームを編成する柔軟な体制が取られています。PMは2名体制で、その他のメンバーは全員がエンジニア。案件によっては、1人で要件定義から実装・運用まで担うケースもあります。

フルスタックのスキルを持つメンバーが複数在籍しており、小規模案件では少人数で完結させる一方、必要に応じてフロントエンド・バックエンド・インフラの担当を分けた5〜6人規模のプロジェクトもこなしています。

小規模チーム特有のレビュー負担が課題に

CodeRabbit導入以前、コードレビューはチームにとって大きな負担でした。リードエンジニアがレビューに時間を取られると、並行する開発業務に支障が出るため、特に少人数体制では深刻でした。また、1名体制のプロジェクトではレビューする相手がおらず、自分の書いたコードを客観視するのが難しいという問題も抱えていました。

「コードレビューって本当に大変なんです。特に小規模案件ではリードエンジニアの時間がレビューに取られてしまい、全体の進行に影響が出てしまいます」（前川さん）

そうした中で出会ったのがCodeRabbitです。知ったきっかけは、X（旧Twitter）での話題でした。前川さんは、AIによるコードレビューという新しい手法に興味を持ち、すぐに試してみようという判断に至ったといいます。

「コードレビューをAIがやってくれる時代が来たんだ、とワクワクしました。試してみたら想像以上のクオリティでした」（前川さん）

チームに浸透し、AI活用が当たり前に

導入の決め手となったのは、CodeRabbitの指摘内容の的確さと、会話可能なコメント機能、そしてダッシュボードによるレビューの可視化でした。プロジェクト文脈を理解しているかのようなコメントが得られる点に驚いたとのことです。

「初めて使ったときは本当に感動しました。プルリクに自然にコメントが入り、まるで人間のように会話ができる。すぐに全社導入を決めました」（前川さん）

現在では、CodeRabbitはtechbeansの開発フローに欠かせない存在となっています。西井さんはもちろん、新規メンバーも違和感なく受け入れており、全社で自然に活用されています。細かな指摘やリファクタリングの提案、さらにはIssue作成まで自動で行ってくれる点が好評です。

前川さんは、他のAI（Devin）と連携させてCodeRabbitの指摘に自動で対応させる実験的な取り組みも行っており、AI駆動開発の可能性を広げています。

「CodeRabbitは、いまやチームにとって当たり前の存在です。最初は驚きましたが、今ではいないと困るレベルですね」（西井さん）

レガシーコードへの対応や対話性にさらなる期待

今後は、自己回帰型エージェントとの連携や、レガシーコードやデータ構造の深い理解に基づくレビュー機能の強化が期待されています。さらに、心理的安全性を保ちつつレビューできる柔らかい対話性も、競合との差別化ポイントとして重視されています。

「CodeRabbitには、AIコードレビューの最先端を走り続けてほしいです。特にレガシーコードの文脈理解や、APIテストまでカバーできる自律型レビューエージェントとの連携に期待しています」（前川さん）

CodeRabbitは今後も進化を続け、チームの規模に関わらず開発効率を向上させるパートナーとして、techbeansの成長を支援していきます。

]]>

Mon, 02 Jun 2025 19:04:39 GMT

Modern dev teams rely on data. Without analytics, how would you know if your team is improving its deployment velocity and code quality over time? At CodeRabbit, we want teams to have the data that matters when it comes to tracking their performance.

That’s why we created a unique integration with Grafana, a leading analytics platform, that provides greater visibility into your organization’s code review metrics through interactive dashboards directly in the CodeRabbit UI.

CodeRabbit, an AI-powered code reviewer, can be seamlessly installed on your git platform to review pull requests and deliver actionable insights. By embedding Grafana as a micro-frontend, we’ve made it easier for teams to understand the impact of code reviews on their organization.

In this post, we’ll cover why we decided to integrate dashboards in our application via a micro-frontend framework and how we built our Single SPA micro-frontend with the help of Qiankun.

Why use Grafana as a Single SPA micro-frontend?

When it comes to showing analytics in a single-page application, developers often face challenges. Building custom dashboards from scratch requires significant coding effort, and maintaining that code becomes a burden. Adding new metrics or dashboard panels typically involves creating new APIs and modifying frontend code, making the process slow and cumbersome.

To address these challenges, we decided to leverage Grafana as a Single SPA micro-frontend. This decision allowed us to streamline the addition of new dashboards while managing Grafana separately and deploying updates with ease. By doing so, we reduced development overhead and improved scalability.

Adding Grafana as a Single SPA micro-frontend with Qiankun

Grafana offers robust data visualization capabilities. To integrate it seamlessly into our UI, we forked Grafana, customized its dashboard page, and implemented it using Qiankun.

This micro-frontend architecture enabled us to retain the native Grafana experience while tailoring the interface to align with CodeRabbit’s needs. Micro-frontend architectures are particularly useful when integrating multiple JavaScript frontend applications, even those built with different frameworks.

After exploring various micro-frontend framework options, we chose the Qiankun micro-frontend library to implement this architecture. Qiankun, built on top of the single-spa micro-frontend framework, provides a simple API that makes it easy to manage micro-frontend architectures.

Using the Qiankun micro-frontend framework to overcome architectural challenges

Most micro-frontends are mounted in the main app based on specific routes. However, our use case required dashboards to be displayed dynamically across multiple routes and on tab changes. To achieve this, We used Qiankun’s loadMicroApp API which implements the single-spa parcel api under the hood.

This approach eliminated common limitations of micro-frontend architectures. To improve performance and faster load time. We have used caching in our Grafana proxy server.

Securing our Grafana micro-frontend with authentication

Authenticating Grafana using our UI credentials was a critical challenge. Typically, Grafana relies on an API key for access. To handle this, we created a proxy server which first authenticates a CodeRabbit user and then proxy forwards Grafana.

While this setup worked initially, it exposed a potential vulnerability: Grafana allowed any query to be sent in the request, posing a risk of data leaks.

To mitigate this, we added a validation layer that:

Ensures only pre-defined queries are allowed.
Scopes requests to specific organizations and restricts access to queries explicitly associated with dashboards.

This solution secured the Grafana micro-frontend integration by preventing unauthorized data access and ensuring a robust implementation.

The flow diagram above shows an micro-frontend architecture overview. The CodeRabbit UI mounts the Grafana dashboard using Qiankun by accessing a public micro-frontend endpoint that exports component lifecycle functions. Once mounted, all API calls to Grafana are routed through our authentication service for secure access.

Our use case for dashboards

At CodeRabbit, our goal is to demonstrate the value our product adds to an organization by providing AI-driven code reviews. To achieve this, we collect and analyze data stored in our data layer and use it to create insightful panels that tell the story of CodeRabbit's impact on our customers’ development cycle.

The dashboards include metrics such as the average number of pull requests (PRs) reviewed per day and the total number of reviews conducted by CodeRabbit. Additionally, they showcase the various language-specific tools leveraged to fine-tune reviews, offering deeper insights into how CodeRabbit adapts to diverse development environments.

These panels also highlight key contributions, such as the number of comments, suggestions, and chat conversations facilitated by CodeRabbit to improve code quality.

By integrating Grafana as a Single SPA micro-frontend, we were able to design and serve these visualizations seamlessly within the CodeRabbit UI while reducing the amount of development and maintenance work.

Benefits of Single SPA micro-frontend architecture

Integrating Grafana as a Single SPA micro-frontend architecture has transformed how we deliver analytics to our users. Since Grafana is a separate service to modify dashboards, we just have to make changes in Grafana dashboards and provision changes using Grafana Dashboard provisioning.

Using Grafana as a Qiankun and Single SPA micro-frontend in CodeRabbit has allowed us to deliver a powerful, secure, and dynamic analytics experience for our users. Looking ahead, we plan to expand our dashboard offerings and explore additional Grafana plugins to provide even more insightful analytics.

For developers looking to streamline analytics integration, we highly recommend exploring Grafana as a micro-frontend—it’s a game-changer for simplifying dashboard management and enhancing scalability.

Interested in trying out CodeRabbit? Get a free 14-day trial. Want to join our team? Check out our Careers page!

]]>

Fri, 30 May 2025 22:00:03 GMT

Overview

Plane, an open-core project management solution, had an ambitious roadmap and a tight-knit team determined to move fast. Despite the frontend team’s small size—just 12 engineers—they were responsible for significant scope: building and maintaining their cloud, self-hosted, and popular open source versions, fixing bugs, deploying new features, and continuously improving Plane’s performance and security.

As the most popular open source project management tool, they also needed to review multiple, complex pull requests a day – including those from their many OSS contributors. That proved to be a blocker for the team. “Long review cycles slowed us down,” explained Principal Engineer Sriram Veeraghanta. “Without clear PR summaries, understanding changes took time, delaying merges and feature releases. Bugs and inconsistencies slipped through, adding to tech debt.”

Eager to free up developer time and get their release schedule back on track, Sriram decided to try CodeRabbit, an AI-powered code review platform recommended to him by one of Plane’s OSS contributors. The result? Drastic improvements in review speed, code quality, and developer satisfaction. For Plane, that translated into less time spent buried in pull requests and more time hitting milestones that actually matter.

The challenge: A manual code review bottleneck

Before CodeRabbit, Plane relied on what Sriram characterized as the ‘standard process’ of manual reviews using code editors and GitHub’s built-in tools. “It worked,” he shared, “but was time-consuming and required a lot of back-and-forth to fully understand changes.”

That manual code review process just couldn’t keep up with the speed of Plane’s development cycle.

High pull request volume
As an application-layer SaaS company building flexible open-core project management software, Plane’s roadmap involved building a wide range of vertical features. That – plus the typical volume of spam and fluff PRs an OSS tool receives –- translated into a steady stream of PRs. For Plane’s senior engineers, that made for an overwhelming daily workload of scanning line after line during manual code reviews. With research showing quality degradation in manual code reviews after 1 to 2 hours a day or reviewing ~400 lines, Sriram’s team struggled to keep up.
Limited context in PRs
Many developers wrote only brief or no descriptions for their PRs, forcing reviewers to piece together a puzzle. “It was hard to grasp the context,” recalled Sriram. “I had to manually go through files to understand the changes.”
Slowed delivery and hidden bugs
Manual code reviews caused merges and releases to slow down – interfering with Plane’s product roadmap. While basic static checks or linting tools caught some issues, many bugs, vulnerabilities, and large-scale refactoring complications often slipped through as each PR demanded significant engineering attention.
Trapped in a cycle of low developer productivity
With developers spending so much time on manual code reviews, they had less time to concentrate on writing code. That didn’t just impact velocity but also impacted code quality – which then meant that future code reviews would take even longer. That led to what felt like endless review cycles for Plane’s senior engineers.

Why Plane loves CodeRabbit

Instant AI Summaries & Sequence Diagrams

For Plane, CodeRabbit’s AI-generated summaries and the Sequence Diagrams were an immediate time-saver.
“With CodeRabbit, AI-generated summaries give me instant context and the visual file structure helps me spot critical changes quickly,” Sriram explained. “These made it much easier to review changes quickly and catch critical issues without going through every file manually.”
That alone shaved hours off their daily reviews.

Early issue detection

Like with most small teams, Plane has to balance rapid iteration with stability and scalability – something that is harder to do if you’re always responding to issues in production or when the codebase has significant tech debt. AI code reviews helped considerably:

“Automated reviews catch issues early, improving both speed and code quality,” shared Sriram.

With CodeRabbit, issues they might have missed before—including security vulnerabilities, logic errors, or concurrency pitfalls—were flagged right away*.* This proactive approach reduced the chance of shipping bugs into production.

Faster, more efficient workflows

Plane’s workflow improved dramatically because CodeRabbit goes beyond traditional static analysis by understanding context behind code changes and leveraging the advanced reasoning of generative A

“AI changed the game by accelerating reviews, providing better context, and catching issues early, making the entire workflow more efficient,” Sriram shared.

Set up in minutes

Implementing CodeRabbit was a breeze for Plane, which meant the team was able to start seeing value right away.

“The setup was seamless—we configured CodeRabbit in just a few minutes, and the bot started reviewing our PRs instantly,” Sriram explained*. “It fit right into our workflow without any friction.”

The results: Quality code, shipped faster

Once CodeRabbit was fully in place, Plane saw quick improvements to their process:

Significantly decreased code review time

“PR review time has significantly decreased, improving deployment speed,” shared Sriram. Code reviews that used to take hours were now completed in a fraction of the time. CodeRabbit customers generally see a 50% reduction in overall review time.

Fewer bugs reaching production

By catching issues at the PR stage, the Plane team saw notable reductions in post-release fixes*. “Fewer bugs make it into production thanks to early issue detection,”* explained Sriram. Across our customers, CodeRabbit catches an average of 90%+ of all bugs and errors.

Faster merge cycles

With better context and fewer open questions, the team were able to merge PRs faster – and not fall behind on their release schedule. “We now spend less time on back-and-forth and more time shipping quality code,” explained Sriram. While results vary from team to team, CodeRabbit customers see an average of 4x faster PR merges.

Improved developer productivity

Less back-and-forth in PR discussions and less time on manual code reviews has allowed engineers to focus on building rather than reviewing. For that reason, Sriram believes CodeRabbit is essential for any development team. After all, “developer productivity is crucial for any organization,” he shared.

Immediate impact

Plane started seeing value almost immediately—with faster reviews, better PR context, and improved collaboration. “CodeRabbit has sped up reviews, improved visibility, and helped catch issues early,” Sriram shared.

CodeRabbit = No more code review bottlenecks

By implementing CodeRabbit, Plane successfully tackled the code review bottleneck that had slowed their team’s momentum. They’re now shipping features faster, collaborating more effectively as a team, and maintaining a high standard of code quality.

As Sriram puts it:

“Code reviews are no longer a bottleneck. AI-driven insights help us catch issues early, and the team can focus more on writing better code rather than spending excessive time reviewing. The overall workflow is smoother and we ship faster with more confidence.”

Want see how CodeRabbit can help your team? Get a 14-day trial.

]]>

Thu, 29 May 2025 18:17:22 GMT

AI has changed what code reviews can be.

We’ve gone from static rules and regex-based linters to systems that can actually read a diff and respond with feedback that resembles what a senior engineer might say. That’s real progress.

But as companies like CodeRabbit create production-grade systems for code reviews or for other developer-focused tools, we all face a core architectural question:

Do you give the AI autonomy to plan and act like an agent? Or do you structure the process as a predictable AI pipeline?

This choice affects more than just implementation. It shapes how fast your system runs, how much developers trust it, how you debug it when it breaks, and what it takes to maintain it long-term.

And while the architecture matters, it's not the end goal. These are just different ways of trying to answer the same underlying question —

How do we give the model everything it needs (and nothing more) to deliver the best code review possible?

That’s the real challenge. Not "agentic AI" vs. "pipeline AI." Just building the best possible tool for the people who use it.

We’ll come back to that. But first, let’s define the two camps.

AI architectural patterns: Agentic AI vs. pipeline AI

Agentic AI systems

In an agentic architecture, the model isn’t locked into a single prompt. It’s allowed to think step-by-step, make decisions, and use tools as it goes. Often this means:

Planning a course of action
Calling a tool (e.g. grep, a static analyzer, test runner)
Observing the output
Deciding what to do next

This approach — often referred to as ReAct (Reason + Act) — is one of several reasoning patterns used to guide agent behavior.

It shows up across a range of modern systems and research prototypes, but the core idea is the same: the model can reason, act, observe, and repeat — using external tools and memory to enrich its output. That flexibility is incredibly promising.

It’s also incredibly hard to get right.

Pipeline AI systems

Pipeline AI-based systems take a more deterministic approach. You define a sequence:

Prepare inputs (e.g. diff, relevant file slices, issue text)
Run pre-processing (e.g. static analysis, code search)
Call the model with a crafted prompt
Post-process the output into review comments

This approach is predictable, fast, and easy to test. It’s also easier to integrate into CI workflows, where speed and reproducibility matter.

Many tools use a pipeline AI backbone as their foundation, however, most modern implementations also incorporate elements of agentic behavior. They may dynamically adjust prompts, use retrieval strategies, or support interactive review flows.

They aren’t fully agentic, but they aren’t rigidly linear either.

Which brings us to the reality most teams face: you don’t have to pick a side. Most real-world systems live somewhere in the middle — not for philosophical reasons, but because that’s what it takes to ship something reliable, adaptable, and useful.

Hybrid AI systems: A spectrum, not a binary

In practice, many real-world systems don’t land fully in either the agentic AI or pipeline AI camp. They blend elements of both — taking the structure and reliability of pipelines, and layering in tool use, learned behavior, or context enrichment strategies that are often associated with agents.

CodeRabbit is a good example of this kind of hybrid AI architecture.

GitHub Copilot PR Reviews also falls into this category. While their interfaces and goals differ, they share similar DNA — blending structured inputs with retrieval, static analysis, and interactive flows.

We go deeper into CodeRabbit’s AI pipeline and enrichment strategy in the next section, but in short: it blends the determinism and predictability of pipelines with dynamic, learned behavior and targeted context augmentation — sitting squarely between the two paradigms.

Hybrid AI systems like this sit along a spectrum — and that's the point. You don’t have to go all-in on one paradigm. You just have to solve for what matters: helping your users make better decisions, faster, with fewer surprises.

Hybrid systems aim to balance the pros and cons of both agentic and pipeline systems by finding a balance between the two. Striking the right balance can also be difficult to achieve, with some experimentation required. This added control and flexibility can increase the cost of development and maintenance.

Tradeoffs between AI architecture patterns

Dimension	Agentic systems	Pipeline systems
Latency	Multi-step, often slower	Fast, predictable
Tool Use	Dynamic and adaptive	Static and consistent
Trust	Harder to test, less predictable	Easier to debug and validate
Context Handling	On-demand, but error-prone	Predefined and controlled
Workflow Fit	Interactive tools	CI/CD and production PR reviews

Agentic AI systems offer flexibility — but flexibility is a double-edged sword. They can fetch exactly what’s needed… or fetch everything and drown in noise. They can reason step-by-step… or loop forever. You need good defaults, good tools, and often, some level of hard constraint.

Pipelines, by contrast, are stable. You get speed, control, and a well-bounded behavior space. But they can be rigid. If the context isn’t there at the start, the model can’t do much about it.

That’s the tradeoff.

And that’s what most of us are doing here — not debating abstractions, but working to build the best damn tool we can. For ourselves. For our teams. For the developers who need to ship something today.

The AI architecture pattern you use is just a means to an end. The real work — and the real leverage — lies somewhere else.

AI context is the real bottleneck: Why autonomy needs structure

More context isn’t always better

In AI code review, we spend a lot of time debating architecture — agentic AI vs. pipeline AI — but the real performance bottleneck is often upstream: what context we give the model.

There’s a common assumption:

If we just add more AI context — more code, more metadata, more analysis — the model will perform better.

But that’s not how it works.

Too much irrelevant input overwhelms the model (Secure Code Review at Scale)
Prompt noise leads to muddled reasoning and false positives (Secure Code Review at Scale)
Even high-quality tools can generate low-quality AI context if used indiscriminately (Anthropic Case Study)

More isn’t better. Better is better.

Agent autonomy sounds great — but struggles in practice

Agentic systems promise flexibility: let the model decide what it needs, when it needs it, and fetch context accordingly.

In theory, this is ideal. In practice, it’s messy.

Common failure patterns:

Tool overuse — agents calling everything, just in case (DevTools Academy)
Redundant or noisy fetches that dilute the prompt (Prompt Engineering Guide)
No clear reward signal to distinguish helpful context from useless output (ReTool)

Agent autonomy without structure doesn’t scale.

At CodeRabbit, we curate context — we don’t wander

We’ve taken a different approach.

CodeRabbit’s system:

Runs 30+ static analyzers before prompting the model
Uses AST and symbol lookups to identify relevant context
Applies context filters based on past review learnings
Structures inputs carefully to fit model limits and prompt constraints

This hybrid AI pipeline gives the model exactly what it needs — and nothing more. No random guesses, no runtime surprises.

We’ve learned that great reviews come from:

Tight, relevant context
Consistent structure
Just enough flexibility to adapt to the code change at hand

Could agents learn to curate context?

Maybe — and that’s the interesting future path.

If we had:

A dataset of pull requests with “ideal context sets”
Evaluation metrics tied to actual review outcomes
Synthetic examples showing what helps and what hurts...

...then we might be able to train agents to call tools intelligently. To act more like great reviewers than interns with shell access.

That’s the direction explored by recent work like ReTool and LeReT, which use reinforcement learning to teach agents retrieval strategies — learning which tools to invoke and when, based on feedback loops tied to downstream task quality. ReTool showed improvements in task accuracy of up to 9% over retrieval-agnostic baselines, and required significantly fewer training steps to converge. LeReT similarly demonstrated a 29% boost in retrieval success and a 17% gain in downstream QA accuracy over standard retrievers — strong early signals that agents can, in fact, learn to fetch the right context when properly trained.

But even with these improvements, we’re still lacking high-quality, domain-specific datasets for tasks like code review.

One path forward could involve curating a large-scale benchmark of real and synthetic pull requests, each labeled with:

The issue or defect type present (e.g. logic bug, perf regression, missing test)
The AI context types that improve or degrade LLM performance on detecting that issue (e.g. AST, file diff, related function definitions, ticket description)
The tool invocations used (or simulated) to assemble that context

With this dataset, we could:

Evaluate which types of PRs benefit from which types of context
Train agents to learn context selection policies based on PR characteristics
Create specialized sub-agents for different error classes (security, style, performance), each using context proven to enhance detection of those issues

In other words: teach agents to reason more like experts — not just by copying what they say, but by emulating how they gather, filter, and apply the information that matters.

And we wouldn't have to guess at it. We could back it up with data.

That’s the deeper opportunity: not just training agents to run tools, but to understand why and when to use them — grounded in evidence, driven by outcomes.

We’re not there yet. But the path is starting to look clearer. And at CodeRabbit, we’re leading the charge. This is exactly the frontier we’re investing in: building hybrid AI systems that can predict the right tool to use, at the right time, for the right kind of review. Not just to make something clever — but to make something teams can trust.

Our hybrid AI pipeline: We reason with purpose

By now, it should be clear that "agentic AI vs. pipeline AI" isn’t the real battle. These are just architectural tools — different shapes we use to tackle the same core problem:

How do we give the model exactly what it needs to deliver a useful review — and nothing that drags it off course?

Pipeline AI systems give us speed, control, and consistency. Agentic AI systems promise adaptability and richer reasoning. And hybrid AI systems, like what we’ve built at CodeRabbit, try to walk that line — combining structure with flexibility, precision with power.

But no matter how you structure it, one thing matters above all: context.

The hard part of code review — for both humans and machines — isn’t the format. It’s knowing where to look. What matters. What can be ignored. What’s risky. What’s surprising. That’s what great engineers learn to spot, and it’s what we’re trying to teach our models to do.

That’s the exciting part.

Because if we can train a system to not just analyze a diff, but to know which tool to call, when to call it, and how to interpret its output with surgical precision — then we’re getting closer to something remarkable.

Not just automation. But reviews that feel like they came from your best engineer — on their best day — every time.

That’s what we’re building toward. Not for the sake of cleverness, but because that’s what teams need: trustworthy tools that help them move fast, write better code, and ship with confidence.

We’re not done. But we’re getting closer.

Interested in trying out CodeRabbit’ reviews? Get a 14-day trial!

]]>

Thu, 29 May 2025 05:30:09 GMT

株式会社カウンターワークスは、「すべての商業不動産をデジタル化し、商いの新たなインフラをつくる。」をミッションに、商業不動産のデジタル化に取り組むスタートアップです。主力サービスである「ショップカウンター（ショップカウンター）」は、ポップアップストアの出展を支援するスペースのマッチングプラットフォームで、空きスペースと出展者をつなぐ新たな商習慣のインフラづくりに挑戦しています。

また、大型商業施設のリーシング（テナント誘致）業務を支援するSaaS「ショップカウンターエンタープライズ（エンタープライズ）」も展開しており、業界に根強く残るアナログな慣習をテクノロジーの力で効率化。大手デベロッパー企業との提携を通じて、急速に導入実績を伸ばしています。設立から11年目を迎えた今も、挑戦を続ける姿勢が印象的な企業です。

少数精鋭の開発チームで挑む、商業不動産のデジタル化

カウンターワークス社の開発組織は、「ショップカウンター」と「ショップカウンターエンタープライズ」という2つの主要サービスごとにチームが分かれており、それぞれの事業フェーズやユーザーに合わせたアプローチを取っています。ポップアップストアの出展支援を担うショップカウンターでは、少数精鋭のチーム体制で開発が行われています。エンジニア5名、加えてPO（プロダクトオーナー）やデザイナーと連携しながら、日々の改善に取り組んでいます。

「設計やアーキテクチャの意図が正しく伝わっているかを確かめたり、必要に応じて軌道修正することが、レビューにおける私の役割です」と話すのは、ショップカウンターチームでアーキテクトを務める阿部さん。普段の業務を通じて、レビューに限らず自身が描いたアーキテクチャの背景や意図をメンバーに伝え、チーム全体の設計力や理解を高めることにも力を注いでいます。レビューの比重が高まる中、その負担を軽減しつつ質を保つため、CodeRabbitが積極的に活用されています。

一方、商業施設向けのSaaSであるショップカウンターエンタープライズは、より多機能で広範な業務を扱うサービスとして、3チーム体制・各5名規模で開発が進められています。チームはCRMやSFAなど注力領域ごとに分かれており、スクラム開発を推進。エンジニアの栗田さんは「コードレビューは開発プロセスの中でも大切な位置づけとして捉えています。CodeRabbit等のAIを活用したレビュー・開発プロセスをよりチームに合った形で行うことができないか、チームの振り返りで話しが上がることが多くなっています」と話します。現場にはジュニアからシニアまで多様な人材が揃い、AIツールも活用しながら、技術と運用の両面でチーム力を高めているのが印象的です。

レビュー品質と負荷に苦慮

CodeRabbit導入前、カウンターワークス社の開発チームが抱えていたのは、「コードレビューにかかる時間と品質のばらつき」という日常的な課題でした。特にショップカウンターチームでは、少人数体制の中でレビューの負荷が特定メンバーに集中しており、阿部さんも「レビューが業務全体の中でかなりの比重を占めていた」と語っています。チームにはジュニアから中堅、シニアまで幅広い層が在籍しており、レビュー観点や指摘の深さに差が出ることも多く、品質を安定させる難しさがありました。

ショップカウンターエンタープライズチームでも、似たような状況が見られました。複数のサブチームに分かれた中でメンバーの得意分野や経験値にばらつきがあるため、レビューに時間がかかったり、重要な観点の見落としが発生することもありました。栗田さんは「レビュー精度や効率をもう一段階引き上げたいと思っていた」と振り返ります。また、成長フェーズにあるプロダクトとして、設計ミスや冗長な実装を早期に防ぐ仕組みが求められていました。

このような背景の中で、CodeRabbitの導入は特定の課題を解決するというよりも、まずは「試してみたい」という軽い動機から始まりました。結果的にチームの開発スタイルにフィットし、現在ではレビュー品質の底上げや、早期段階でのフィードバック取得といった形で、開発プロセスに確かな変化をもたらしています。

「当初は技術的、開発プロセス的に課題がありました。インフラやレビュー体制など、少しずつ整えてきた中で、CodeRabbitのようなツールが自然と必要になる段階に来たのだと思います」（CTO徳永さん）

最初の印象は“意外とちゃんと見てくれる”

CodeRabbitに対する第一印象は、「思った以上にちゃんと見てくれる」でした。形式的な文法チェックに留まらず、セキュリティや設計に関する観点からのフィードバックもあり、経験豊富なメンバーからも「自分が忘れていた視点を思い出させてくれる」との声が上がりました。一方で、指摘の解釈には慣れが必要で、当初はベテランと若手で受け取り方に差が出ることもあったと言います。

それでも、レビューの観点をチーム内で共有し、CodeRabbitのフィードバックをどう活かすかを議論していくうちに、自然とチームのナレッジが蓄積されていきました。導入のハードルが低く、スムーズに使い始められた点も、定着を後押しした要因です。

「面倒な設定なしで、すぐに導入できたのはよかったです。また、利用人数ベースの定額で利用できる点も、見積もりが明確になっていたので導入しやすかったと思います」

早期レビューで設計ミスを未然に防ぐ運用へ

CodeRabbitの導入から時間が経ち、今では複数の開発チームで日常的に活用されるツールへと定着しています。特にレビュー負荷の高かったチームでは、CodeRabbitが一次レビューを担うことで、人的リソースに余裕が生まれました。阿部さんの所属するショップカウンターチームでは、CodeRabbitの指摘を起点に設計やセキュリティについてチーム内で議論する場面も増え、コード品質の底上げだけでなく、ナレッジ共有のきっかけとしても機能しています。

一方で、ショップカウンターエンタープライズチームでは、CodeRabbitを「初期レビューのアシスタント」として活用しています。人間によるレビューの前にCodeRabbitに通すことで、コードの粗さや設計上の問題を早期に発見できるようになりました。指摘に対してはリアクションを取り、定期的な開発プロセスの振り返りにもCodeRabbitに関する話題が上がるなど、チーム全体の意識にも変化が見られます。

「僕たちのチームでは、完成度が高まってからレビュー依頼をするのではなく、あえて50％くらいの状態でCodeRabbitに見てもらうこともあります。設計の矛盾に早く気づけることもあったり、人間によるレビューを受ける前に一度整理できるのがいいですね」（栗田さん）

チームに寄り添うレビューへ、さらなる進化に期待

カウンターワークス社では、CodeRabbitを日々の開発に欠かせない存在として活用しながらも、その可能性はまだ広がっていくと考えています。特に期待されているのが、チーム固有の開発スタイルやドメイン知識をより深く理解したレビューです。指摘内容の正否だけでなく、「なぜそれが問題なのか」を文脈に応じて伝えられるようになることで、ジュニアメンバーの成長やチーム全体の設計力強化につながると期待されています。

また、レビューの重要度に応じた運用もさらに進化させたいと考えています。ショップカウンターエンタープライズチームでは、CodeRabbitからの指摘にはアクションを取る運用を試しています。今後は、より洗練されたフィルタリングやカスタムルールの設定機能が追加されることで、さらに実践的な活用が進むことを望んでいます。

「今後はCodeRabbitに“うちの開発チームらしさ”を学習してもらえると嬉しいですね。レビュー内容がよりチーム文化にフィットしてくると、より自然に受け入れられるし、フィードバックの活用もしやすくなると思います」（栗田さん）

CodeRabbitは、今後もカウンターワークス社の開発組織の成長を支えるパートナーとして進化していきます。

カウンターワークス社では、フロントエンドエンジニアやバックエンドエンジニアなど、さまざまな職種で人材を募集しております。ご興味ある方は採用ページをぜひご覧ください。

採用情報 | 株式会社カウンターワークス（COUNTERWORKS）

]]>

Wed, 28 May 2025 07:13:33 GMT

株式会社neccoは、Web制作を中心にブランディングからロゴデザイン、サイト実装までを一貫して提供するクリエイティブ企業です。プロジェクトの初期段階からクライアントと並走し、経営視点に立ったデザインやシステム提案を行うことを強みとしています。サイト制作のみならず、3Dやモバイルアプリなど多様なスキルを持つチームで、柔軟かつ広範な領域をカバーしています。

そのneccoにて、開発と品質管理を統括するのがCTOの佐藤さんです。今回は、同社におけるCodeRabbitの導入背景、活用の工夫についてお話を伺いました。

さまざまな技術スタックを駆使するneccoの開発体制

neccoでは、開発メンバーの構成がコンパクトで、佐藤さん自身がすべてのコードの最終レビューを担当しています。納品後に自社で運用を行わないケースも多く、プロジェクトごとに仕様が異なる中でも高品質なコードを担保する必要があるためです。

また、Web制作における実装は多様な技術スタックにまたがることがあり、フロントエンドだけでなくPHPなどのサーバーサイド言語に関してもプロジェクトごとに最適な構成が求められます。そのため、少人数でも品質を落とさない体制構築が鍵となっています。

レビュアー・レビューイ双方の精神的な負担がネックに

少人数体制において、すべてのコードを人力でレビューすることには限界がありました。特にジュニアエンジニアが増えてくると、同じような指摘を何度も繰り返さなければならず、レビュアー・レビューイ双方にとって精神的な負担が大きかったといいます。

加えて、neccoでは案件ごとに技術要件やルールが異なることも多く、Lintなどの自動化ルールの統一が難しいという背景もありました。これにより、開発者ごとのばらつきを防ぐための工夫が必要とされていました。

「単純なミスの指摘を何度も繰り返すのが本当に大変でした。相手も“またやってしまった”という気持ちになってしまうし、お互いにとって良くないんですよね」

新規案件の導入からスタート

CodeRabbitを知ったのは、オープンソースのレビューサービスを試していた2023年10月頃。SaaS版として提供されている点にも魅力を感じ、実際に試したところ「意外と使えるかも」という第一印象を持ったそうです。

最初はWeb制作の新規案件で導入を試みました。ビジュアルやユーザビリティ面は人によるレビューが不可欠ですが、コンソールログの削除漏れなど、基本的なミスの検出には十分に効果があると感じたといいます。

「最初は機能が限定的だなとは思いましたが、初歩的なミスの自動検出は本当に助かります。コードレビューの基盤として活用できると確信しました」

明確なコストが導入を後押し

導入の決め手となったのは、価格と導入のしやすさでした。1人あたりのサブスクリプション形式でコストが明確であり、プロジェクト単位での変動がないため、継続利用に向いていると感じたそうです。

また、CodeRabbitは組織単位の課金になるので、GitHubとGitLabの2つで運用していると料金が倍になってしまいます。そのため、以前利用していたリポジトリをGitHubに統合するなど、運用の見直しも進めました。リポジトリがクライアント側にある場合など、一部利用できない場面もありますが、VS Code拡張機能がその課題を補完するのではないかと期待を寄せています。

「1プロジェクト単位で費用が変わるサービスよりも、1人あたりの定額制は導入しやすかったです。運用の見直しも進んだので、むしろ良いきっかけになったと感じています」

同じ指摘が減り、心理的負担が軽減

現在は、全プルリクエストに対してCodeRabbitによる自動レビューを必須としています。開発メンバーはまずAIレビューを通過させ、修正が必要なら対応、不要と判断した場合はその理由を明記、または佐藤さんに相談する運用です。

このプロセスにより、基本的なミスの削減が実現され、佐藤さんによる最終レビューに集中できるようになりました。リモートの開発者にとっても、レビュー前の安心材料として機能しているといいます。

「人にレビューをお願いする前に、一度AIに見てもらえるのは大きいです。同じ指摘を繰り返されることがなくなり、心理的な負担も減りました」

ユーザビリティやデザイン面でのサポートも期待

今後の期待として、プロジェクト単位で異なるルール設定の自動化や、ユーザビリティやデザイン面への対応の拡張が挙げています。現在はルールファイル（YAML）を手動で作成していますが、エディタ側でコードベースを解析し、推奨ルールを自動生成してくれる機能があるとより便利になるとのことです。

CodeRabbitは今後も進化し、neccoの開発チームをサポートしていきます。

株式会社neccoでは、フロントエンドエンジニアを中心に採用しています。また、東京に限らずリモートや本社のある秋田での採用も行っています。ぜひ採用ページをご覧ください。

採用情報 | necco inc.（ネッコ）

]]>

Mon, 26 May 2025 23:30:24 GMT

It seems like vibe coding is everywhere these days. It’s become the latest developer trend— fueled by Twitter threads about people vibe-coding entire startups in a weekend.

While it can be fun to vibe code your hobby projects — you might feel tempted to bring that chaotic, spontaneous AI-powered energy into the workplace. Be warned: your teammates probably won’t share your enthusiasm when they're forced to review massive, confusing, and buggy code dumps.

To illustrate this problem for all those tempted to ‘vibe’ at work, we created a comic. The Vibes Are Off features Zig, the office vibe coder, submitting an enormous PR to his beleaguered colleague, Grumpster. Poor Grumpster spends two soul-crushing days cleaning it up, only for the CTO to publicly praise Zig for saving a few hours coding. Grumpster? Well, he’s left quietly dying inside.

Read it here:

pdf='https://victorious-bubble-f69a016683.media.strapiapp.com/Harecompressed_910f61fa09.pdf'

At CodeRabbit, we don’t think that code reviews should cause anyone to begin a decades-long quest for vengeance. That’s why we’re launching a Code Review Etiquette series. Our first order of business? Prevent vibe coders from becoming the most hated developers at their companies.

Edit your code, don't just vibe & pull request

Look, we get it. It feels amazing to watch your AI coding assistant spit out an entire app from a vaguely worded prompt. But we beg you: don’t commit it straight into your PR. AI-generated code is notoriously verbose and loves sprinkling in little surprises like pointless loops or imaginary functions that don’t exist in your libraries.

Take the time to actually read the AI’s output before passing the pain onto your teammates. And once you’re done? Read it again. And then read it yet another time.

Refactor awkward functions, clean out those random bits of nonsense, and check whether your assistant hallucinated APIs (it happens more than you think!). In other words, do the minimum required to ensure that your colleagues won't wish you ill.

This way, your colleagues won’t dread seeing your PR pop up – and you might just avoid becoming the reason they DM each other vibe coding memes.

**Avoid ginormous PRs**

Nothing sparks immediate workplace rage like a PR that reads “+11,374, -3.” If your AI assistant just handed you a PR that could rival the length of War and Peace, you've officially violated every known rule of code review etiquette. The developer equivalent of Emily Post is now very mad at you.

Massive PRs aren’t just annoying—they guarantee that reviewers will miss important issues and send bugs into production. Instead, split your vibe-coded masterpiece into smaller, digestible pieces. You know, like you’d want to receive it.

Even if the AI assistant gave you the entire feature in one go, be kind and break it into logical chunks before subjecting your colleagues to a review so endless that it will leave them feeling like they’re in a Black Mirror episode where time has somehow looped.

Trust us, your teammates will appreciate the effort and you’ll spare yourself some passive-aggressive comments or an immediate ‘Request Changes.’

Ensure your code meets org style and standards

Your organization’s style guide and coding standards aren’t just ‘suggestions.’ Just because the AI you vibe-coded with decided camelCase and snake_case look beautiful together doesn’t mean your colleagues agree. Leaving style issues for your reviewers to clean up is a sure way to build resentment faster than npm builds node_modules.

Make sure your vibe-generated code matches the established style. Yes, it might be boring. Yes, it takes slightly more time. But remember, your colleagues didn’t sign up to debug your assistant’s latest avant-garde variable naming schemes.

Get AI to review in your IDE

If you insist on vibe coding, then at least have the decency to enlist an AI assistant to do the heavy lifting of cleaning up your code first. At CodeRabbit, we now offer AI-driven reviews right inside your IDE (and for free too!). It will catch embarrassing bugs, obvious mistakes, and nonsensical logic that you might miss. And all before a human reviewer even sees it and adjusts their opinion of you down a few notches.

Running these preliminary reviews shows respect for your teammates’ time. It’s the ultimate act of empathy: a robot criticizing you now so your colleagues won’t have to later.

And if you just really enjoy passive aggressive comments from annoyed developers for some reason, don’t worry! You can customize CodeRabbit’s tone to add comments in the same frustrated tone your teammates would!

Warn your colleagues (& thank them after)

If you’re going to unleash AI-generated code into your teammates’ lives – you might want to at least warn them in advance. A quick Slack message saying something like, “Heads up, vibe-coded PR incoming!” lets your team brace themselves mentally (or emotionally). You can even add a few sheepish emojis. We suggest these: 🫣🫠😳.

Even better, acknowledge upfront that your PR might contain more WTFs than usual. After your colleagues spend hours untangling your AI’s bizarre logic (or less time if they also enlist an AI reviewer like CodeRabbit to do a first pass), don’t forget to thank them – even if it’s just in grateful replies to their frustrated comments. Gratitude goes a long way toward smoothing over any vibe-induced frustration.

Code review best practices: Vibe responsibly!

Vibe coding is here to stay – at least as long as AI tools keep getting smarter and developers keep writing multi-part threads on social media. While there’s nothing wrong with experimenting and having fun, your teammates shouldn’t have to suffer for your AI-assisted sins.

If you follow these basic etiquette rules – editing and reviewing before committing, respecting your company’s standards, and warning people before you drop an AI-fueled PR bomb—you might just avoid becoming the most hated dev at your company. Plus, you’ll actually end up producing better code!

Or you could ignore all this advice, keep vibe coding recklessly, and earn yourself a starring role in your office’s most pointed private jokes. But don’t say we didn’t warn you.

Want to sign up for free code reviews in VS Code, Cursor, or Windsurf? Go here!

]]>

Fri, 23 May 2025 01:49:46 GMT

株式会社イノベーションは、BtoB領域を中心に複数のデジタルサービスを展開している企業です。主力サービスである「ITトレンド」は、IT製品やサービスの導入を検討する企業と、提供企業をつなぐマッチングプラットフォームで、多くの法人ユーザーに利用されています。これに加え、ビジネスパーソン向けの動画メディアである「bizplay」や、マーケティング支援ツール「List Finder」など、メディア事業とSaaS事業の両面から企業の業務支援を行っています。

近年では、新たな領域として個人向けの金融情報メディア「ITトレンド Money」やAI関連メディアも立ち上げており、事業の幅を広げています。こうした多様なサービス群はすべて内製で開発されており、複数の専門チームによって支えられています。

同社でのCodeRabbit導入と、その効果について田中さんと宮村さんにお話を伺いました。

フルスタックな開発体制の中で活躍するAIツール

株式会社イノベーションの開発体制は、プロダクトごとに分かれた専門チームで構成されています。主に「メディアディベロップメントグループ」「SaaSディベロップメントグループ」「プラットフォームグループ」の3チームが存在し、それぞれがITトレンド、bizplay、List Finderなどのサービス開発を担っています。データチームなどを含めた技術部門全体では60名強の体制となっており、各チームにはアプリケーションエンジニアやSREが所属。フロントエンドやバックエンドを分けず、フルスタックで開発に取り組むスタイルが採用されています。

同社では、エンジニアがプロダクトに深く関わる構成となっています。開発プロセスでは、レビューの効率化や業務改善を目的にAIツールの積極的な導入が進められており、日常的にさまざまなツールが利用されています。新しい技術やツールの採用に対して柔軟で前向きな文化が根付いており、開発者の自律性と実験精神が活かされる環境が整っています。

導入前の課題

株式会社イノベーションでは、CodeRabbit導入以前から、コードレビューにかかる負荷の大きさが課題となっていました。誤字脱字や細かい記述ミスといった軽微な指摘に多くの時間が割かれ、レビュー担当者のリソースが圧迫されていたのです。また、人によるレビューではどうしても見落としや観点の偏りが発生しやすく、品質の一貫性を保つことにも難しさがありました。

開発リーダーの宮村さんは当時を振り返り、「人間が細かくチェックしないと気づけないような軽微なミスの確認に時間がかかっていました」と語っています。さらに、「レビュー担当者が毎回同じような指摘を繰り返す構図があって、それを何とか自動化できないかと感じていました」と、改善の必要性を強く認識していました。

社内のAI活用方針が後押しに

CodeRabbit導入の決め手となったのは、レビュー精度の高さと、社内のAI活用方針との親和性でした。イノベーションではもともと業務効率化を目的にさまざまなAIツールを試しており、開発部門でも「AIトランスフォーメーション」をキーワードに、日常業務へのAI導入を積極的に進めていました。その中でも、コードレビューという領域はツールによる自動化が効果を発揮しやすいと考えられていたのです。

複数のレビュー支援ツールを比較検討する中で、「CodeRabbitは精度が高く、試してみたら非常に良かった」と担当者は語ります。実際のコードへの適用でも、期待通りの精度で軽微なミスをしっかり検出し、現場からの評価も上々だったことが導入を後押ししました。

「AIツールを積極的に試す姿勢が根付いているからこそ、良いものを見極めて素早く導入できたと感じています」

壁打ちにも役立つCodeRabbit

CodeRabbit導入後、レビュー精度の高さはすぐに実感された一方で、ビジネスドメインへの理解が浅いという課題も浮かび上がりました。自社特有の業務ロジックや背景を理解したうえでのレビューはAIには難しく、「的確だけど、文脈がずれている指摘」が生まれる場面もあったと言います。そのため現在は、CodeRabbitが拾いやすい軽微な指摘はAIに任せ、設計意図や業務理解が求められる部分には人が注力するという役割分担が進められています。

また、レビューを「壁打ち」として使う工夫も導入されています。本格的なプルリクエスト前のドラフト段階からCodeRabbitにレビューを依頼し、初期の設計段階でフィードバックを受けることで、修正の手戻りを減らす工夫として活用されています。これにより「小さな粒度で早めにレビューをかけていく」開発文化が育まれつつあり、レビュー負荷の軽減だけでなく、チーム全体の開発リズムにも良い影響を与えているようです。

「CodeRabbitは、開発チームに自然と溶け込んでいますね」

CodeRabbit導入の効果

CodeRabbit導入後、開発現場ではコードレビューにかかる負荷の軽減を実感しています。特に、誤字脱字や細かい記述ミスといった“軽微な指摘”が事前にAIによって処理されるようになり、レビュー担当者はより本質的なロジックや設計部分に集中できるようになったとのことです。

この効果を受け、当初はITトレンド事業に限定して導入されていたCodeRabbitは、現在ではbizplayやList Finderなど、他のサービスにも展開されています。レビューの効率化に加えて、計測している指標上での生産性の向上にもつながっており、社内からの評価も高まっているとのことです。

「“人が見るべきところ”に集中できるようになったのが大きい効果です。効率化が単なる時間短縮ではなく、レビューの質そのものを高める形で成果に結びついています」

ツールとの連携に期待

株式会社イノベーションでは、CodeRabbitに対してさらなる機能強化への期待も寄せられています。特に、自社のドメイン知識やコーディングルールへの対応がより柔軟になることが望まれており、社内文書や仕様書を参照したレビューや、チームごとのルールに基づいた指摘ができるようになると、さらに実用性が高まると考えられています。

また、他ツールとの連携にも期待が高まっています。たとえば、NotionやBacklogなど社内で日常的に使っているツールとの連携が進めば、レビュー結果の活用範囲が広がり、プロジェクト全体の生産性向上につながる可能性があるとのことです。実際、AIが日常の業務に自然に組み込まれる中で、「使いながら学習してくれる」CodeRabbitのポテンシャルにも強い期待が寄せられています。

CodeRabbitは今後もイノベーション様のサービス開発に寄り添い、より高い生産性を実現するべく進化していきます！

株式会社イノベーションでは、リードエンジニアやWebエンジニアなど、多くのエンジニアを募集中です。ぜひ採用ページをご覧ください。

求人情報 - 株式会社イノベーション

]]>

Fri, 16 May 2025 06:36:18 GMT

AI Code Reviews | CodeRabbit | Try for Free の意訳です

CodeRabbitから大きなお知らせです。AIによるコードレビューが、VS Codeおよびそのフォーク（CursorやWindsurfなど）で直接利用できるようになりました。しかも IDE上でのコードレビューは完全無料 です。

この拡張機能では、既存のレビューツールよりも多くのバグや問題を検出できるよう、IDE内で高品質なレビューを提供する点にこだわりました。CodeRabbitが日々の開発フローに自然と組み込まれることで、レビューのフィードバックをすばやく反映でき、コードの品質向上がスムーズになります。

導入は簡単です。お使いのVS Code/VS CodeフォークのエディタにCodeRabbitプラグインをインストールするだけ。以降は、すべてのコミットを自動でレビューし、PRを作成する前にバグを見つけて、開発スピードを後押しします。

VS Code・Cursor・Windsurfで無料のAIコードレビュー

CodeRabbitは「良いツールは、既存のワークフローに自然に馴染むもの」だと考えています。今回、AIによるコードレビューがCursorやWindsurfにも対応したことで、開発者は作業の流れを止めることなく、スムーズにレビューを受けられるようになりました。

IDEでAIコードレビューを行うメリット

IDE上でコードレビューを行うことで、次のような効果が得られます。

コーディング中に、1回目のコードレビューを自動で実行
問題点を素早く検出し、その場で修正
PR段階でのやり取りを減らし、リリースまでの時間を短縮
より自信を持ってプルリクエストを作成でき、バグの発生も抑制

このように、 IDEとGitプラットフォームの両方でコードレビューを行う多層的なアプローチ によって、より多くのバグを発見し、レビュー時間を削減できます。

IDE上でのレビューは、開発者がコードを書くエディタ上で素早くコードをチェックできます。そして、Gitプラットフォーム上でのレビューは、チーム全体のコードを一元的に確認するのに適しています。それぞれのレビューが異なる役割を持ち、相互に補完し合うことで、品質の高い開発サイクルを実現します。

個々の開発者はIDE内で素早くフィードバックを受け取りつつ、チーム全体では中央集権的な品質管理ができる。この仕組みによって、効率と品質の両立が可能になります。

IDEでのコードレビューとPRでのコードレビュー

自信を持ってコードを書く

開発者からよく聞く課題のひとつに「AIによるコード生成ツールが、バグを見逃す」というものがあります。AIが生成したコードの不具合を、同じAIが検出できるとは限りません。だからこそ、別の独立した視点でコードをチェックすることが重要 です。バグが本番環境に入り込む前に見つけ出すために必要なのです。CodeRabbitがIDE内でどのように機能するのかをご紹介します。

インラインコードレビュー
コードの1行1行を、シニア開発者レベルの目線でAIがレビュー。CodeRabbitはコードエディタ内で、まるでペアプログラマーのようにコメントを付けます。
流れを止めない設計
コードを書く・レビューする・コミットする──この一連の流れを途切れさせません。CodeRabbitはコミット済み・未コミットの変更をすべてレビューし、開発フローを加速させます。
AIでそのまま修正
シンプルな修正提案は「1クリック修正」で即時反映できます。より複雑なフィードバックには「Fix with AI」機能を使い、CodeRabbitのレビューコメントとコンテキストをお好みのAIコード生成ツールに引き継げます。
フォーク互換・言語非依存
CodeRabbitのVS Codeプラグインは、CursorやWindsurfを含むすべてのVS Codeフォークに対応しています。また、JavaやJavaScript、PHP、Python、TypeScript、Go、Rubyなど、一般的に使用される言語もすべてサポートしています。

IDEでのレビューは、PRレビューの軽量＆高速版

IDE内での無料レビューは、CodeRabbitの軽量版によって実行されます。フィードバックは即座に返され、多くのバグを検出できます。ただし、本番環境を想定したユースケースでは、IDEレビューに加え、PRでもコードをレビューするフルバージョンのCodeRabbitをおすすめします。フルバージョンでは、さまざまな情報ソースをもとにした追加コンテキストを活用し、より充実したレビューを行います。

PRレビューをどのようにIDE向けの即時レビューに最適化したのか、技術的な詳細はこちらをご覧ください。

なお、IDEでのレビューにはPRレビューよりも低めのレート制限が適用されます。IDEレビューとPRレビューでどのような違いがあるのか、詳細な機能比較はこちらのドキュメントをご参照ください。

次のステップ

ぜひVS Code、Cursor、WindsurfでCodeRabbitの無料AIコードレビューをお試しください。そして、あなたからのフィードバックをお待ちしています。セットアップは数分で完了します。以下のリソースもご活用ください。

]]>

Thu, 15 May 2025 00:00:28 GMT

エディターでの即時レビューにより、レビューのボトルネックが解消されます。 Cursor、Windsurfなど、すべてのVS Codeフォークで利用できます。

ビッグニュースです！CodeRabbit の AI コードレビューが VS Code 内で直接実行できるようになりました。これはVS Codeだけでなく、フォークであるCursorやWindsurfでも利用できます。この拡張機能では、より多くのバグや問題を特定できるように、最高のIDE上のコードレビューにフォーカスしています。本機能拡張を利用すれば、CodeRabbit が開発ワークフローに直接組み込まれ、レビューのフィードバックを簡単にコードへ反映できるようになります。

使いはじめるのは簡単です。お使いのVS Code/VS Codeフォークへ、CodeRabbitプラグインをインストールするだけです。そうすれば、各コミットをレビューし、PRを作成する前にバグを特定し、コードをより迅速にリリースできるようになります。一番重要なポイントは、コードエディターでのレビューは完全に無料だということです！！

CodeRabbitは、既存のワークフローにシームレスに適合すると確信しています。AI コードレビューがVS CodeやCursor、Windsurf に組み込まれたことで、開発者は集中を切らさずにコードレビューを完了できるでしょう。

IDE上でコードレビューを行えるのは、以下の点で優れています。

コーディング中にコードのファーストレビューを実行する
IDE 内で問題をすぐに見つけて修正できる。
PR 段階でのやり取りを減らし、より高速なデリバリーを実現します。
自信を持ってプルリクエストを作成できるようになり、バグを減らします。

この多層アプローチ (IDE と Git プラットフォーム双方でのレビュー) は、開発者がより多くのバグを発見し、レビュー時間を短縮するのに役立ちます。究極的にはコードレビューは双方に必要であり、多くのメリットをもたらします。IDE上のレビューで、開発者はコーディング中に最初のコードレビューを受けられます。そして、その後にGit プラットフォームでレビューすることで、チーム全体のコードを集中的にレビューし、すべてのコミットが集まった際に発生する問題を簡単に発見できます。

こうしたアプローチは、個々の開発者がコードエディタ上で迅速なフィードバックを得られ、チーム全体で集中的なガバナンス構造を維持できるでしょう。

自信を持ってバイブコーディングを行う

開発者からよく挙がる懸念のひとつに、「AIコーディングエージェントは、なぜ検出が難しい欠陥を見逃してしまうのか？」というものがあります。これらのAIツールは非常に便利ですが、生成したコードに含まれるすべての不具合を検知できるわけではありません。そのため、本番環境に導入する前の段階で、バグを確実に見つけ出すには、“もうひとつの目”として独立したレビューが必要です。VS Code上で動作するCodeRabbitのレビューには、こうした課題を補完する以下のような利点があります。

インラインコードレビュー: AIによるインラインコメント機能を活用することで、まるでシニアエンジニアの目でチェックされているかのようにレビューされます。CodeRabbitは、コードエディター上で常に寄り添ってくれる“ペア・プログラマー”のような存在です。
“摩擦”ではなく“流れ”を重視した設計: CodeRabbitはコーディング → レビュー → コミットという一連の流れを中断させることなく、スムーズに作業できるよう設計されています。ステージング済み・未ステージングの両方のコミットを対象にレビューを行い、あなたの開発フローを妨げることなく加速させます。
AIによる修正提案: CodeRabbitのレビュー提案の中には、ワンクリックで即座に修正できるシンプルなものもあります。より複雑な指摘に対しては、「AIによる修正」機能を使って、レビューコメントとその周辺のコンテキストをお好みのAIコーディングエージェントに渡すことができます。これにより、指摘への対応もスムーズかつ効率的に行えます。
フォークとの互換性、そして言語非依存: CodeRabbitのVS Codeプラグインは、CursorやWindsurfなどを含むすべてのVS Codeフォークと互換性があります。さらにJavaやJavaScript、PHP、Python、TypeScript、Go、Rubyなど、主要なプログラミング言語を幅広くサポートしています。

IDE レビューは、PR レビューの軽量かつ高速なバージョンです

IDE上ではCodeRabbitの軽量バージョンが動作し、コードに対するフィードバックを即座に提供します。ちょっとした記述ミスやバグの兆候もその場でキャッチしてくれるため、開発中の修正がスムーズになります。

一方で、実際のプロジェクトや本番運用を想定したユースケースには、Gitプラットフォーム向けのフルバージョンの利用をお勧めします。より強力なレビュー機能や、チーム開発を想定した機能が備わっており、開発全体の品質と効率を高めてくれます。

CodeRabbitはGitレビューを「インスタントレビュー」に最適化しています。その体験を、ぜひご自身のIDE上で確かめてみてください。

機能の比較: IDE プラットフォームと Git プラットフォームのレビュー

特徴	IDE版CodeRabbit	Git上のCodeRabbit
サポートされているプラットフォーム	VS CodeとCursor、Windsurf、その他すべての VS Codeフォーク	Github、Gitlab、Azure Devops、Bitbucket
レビューの粒度	コミットごと（ステージング済みまたはステージングされていないコード）	PR・MR
平均レビュー時間	2～3分	～10分
インストール	開発者ごとに個別	チーム全体
レビュー数	レート制限の下限	より高いレート制限
ワンクリック修正	✅	✅
Linter / SAST	❌	✅
Webクエリ	❌	✅
グラフ分析	❌	✅
パスベースの命令	❌	✅
Jira/Linear統合	❌	✅
ダッシュボードとレポート	❌	✅
エージェントチャット	❌	✅
学習	❌	✅
仕上げ	❌	✅

では試してみましょう！

CodeRabbit の無料AIコードレビューを VS CodeやCursor、または Windsurf で試してください。そして、皆さんからのフィードバックをお待ちしています。セットアップには数分しかかかりません。下記は、はじめるのに役立つリソースです、ぜひご覧ください。

]]>

Wed, 14 May 2025 07:00:49 GMT

We’ve got exciting news to share! CodeRabbit’s AI code reviews are now delivered directly within VS Code and its forks including Cursor and Windsurf. And code reviews in the IDE are completely free!

With this extension, we focused on building the best quality reviews in the IDE so that we can identify more bugs and issues for you than existing review tools. This brings CodeRabbit directly into your development workflow – making it much easier to incorporate our review feedback in your code.

To get started, simply install the CodeRabbit plugin in any VS Code fork. We will then review every commit, identify most bugs before the PR is raised, and help you ship code faster.

Free AI code reviews in VS Code, Cursor and Windsurf

At CodeRabbit, we believe that the best tools fit seamlessly into your existing workflows. With our AI code reviews coming into Cursor and Windsurf, developers can now get code reviews done without breaking their flow state.

Benefits of AI code reviews in IDE

Code reviews in the IDE are a great way to:

Automate first pass of code review right as you’re coding.
Quickly catch and fix any issues right in your IDE.
Reduce back-and-forth at the PR stage, helping you ship faster.
Help you raise pull requests with greater confidence – and fewer bugs.

This multi-layered approach – with reviews in both the IDE and the Git platform – helps developers catch more bugs and reduce review time. Ultimately, code reviews are necessary in both places and deliver complimentary benefits. With reviews in the IDE, developers get a first pass review right as they are coding. With reviews in the Git platform, the entire team’s code is reviewed in a centralized manner, making it easier to catch issues that may creep in when all the commits come together.

This helps individual developers get faster feedback in their code editor – while also maintaining a centralized governance structure across the team.

Code Reviews in the IDE and in the PR

Vibe code with confidence

One of the key challenges we hear from developers is how AI coding agents can leave defects behind that are hard to catch. These AI code gen tools are not always able to catch defects in the code they generated. This is why having a second, independent set of eyes is critical to ensure that bugs are caught before they end up in production. Here’s how CodeRabbit works in the IDE:

Inline code reviews: Each line of code gets senior developer level attention with AI-powered inline review comments. CodeRabbit becomes your pair programmer, within the code editor.
Built for flow, not friction: Code, review, commit - without breaking your flow state. CodeRabbit reviews every committed and uncommitted change, making your dev workflow faster.
Fix-with AI: Some review suggestions are simple and can be incorporated by using “One-Click Fix” to apply changes instantly. For more complex feedback, our “Fix with AI” feature hands off CodeRabbit’s review comments with associated context to your preferred AI coding agent.
Fork-compatible and language agnostic: Our VS Code plugin supports all VS Code forks including Cursor and Windsurf. Also, all commonly used languages are supported such as Java, Javascript, PHP, Python, Typescript, Go, Ruby and many more.

IDE reviews are a lightweight, faster version of our PR reviews

Our free reviews in the IDE are done by a lightweight version of CodeRabbit that delivers feedback instantly and catches most bugs. For production use-cases, we recommend using the full version of CodeRabbit that also reviews code in the PR, in addition to reviews in the IDE, and comes with additional features that includes additional context from several sources during the review process.

Check out the technical details of how we adapted our PR reviewer for instant reviews in the IDE

Note that the rate limits for reviews in IDE will be lower than rate limits for reviews in PR. For a detailed feature comparison of what is included with reviews in IDE vs reviews in PR, refer to our documentation

Next Steps

Try out CodeRabbit’s free AI code reviews in VS Code, Cursor or Windsurf, we’d love to hear your feedback. Setup takes just a few minutes. Here are some additional resources to help get started:

]]>

Wed, 14 May 2025 07:00:47 GMT

At CodeRabbit, we recently shipped our free VS Code extension, bringing context-rich AI-powered code reviews directly into your editor.

Our engineering philosophy has always been simple: we build tools that fit seamlessly into your existing workflow. While developers have told us our comprehensive PR reviews have helped them ship faster and keep more bugs from production, many also asked for IDE reviews to help check code prior to sending a pull request.

By creating another review stage within VS Code (and compatible editors like Cursor and Windsurf), we've minimized disruptive context switching, allowing developers to catch logical errors and embarrassing typos before they ever send a PR. In our dogfooding, our team has found it helps catch issues early and reduces the iterative back-and-forth that can slow down teams at the pull request stage.

One engineering challenge we faced when designing IDE reviews was re-engineering our code review pipeline to meet developers’ expectations for instant reviews in their IDE. In this post, we’ll share how we thought through this shift and what steps we took to ensure high-quality reviews while reducing the time-to-first comment by ~90%.

Transforming CodeRabbit’s typical code review pipeline

Because CodeRabbit initiates a review immediately after a PR is sent, we optimized our PR code review process for the most helpful reviews and send a notification to the reviewer when our review is completed.

That process takes several minutes and involves a complex, non-linear pipeline that pulls in dozens of contextual datapoints for the most codebase-aware reviews.

We then subject all recommendations to a multi-step verification process to validate that each suggestion will actually be helpful in order to ensure our reviews have as little noise as possible. This involves multiple passes where we identify how parts of the codebase and changes fit together to look for issues before also processing and verifying each suggestion separately. During this process, we don’t share any of those suggestions with the user but wait until we’ve completed all steps in the pipeline in case we find that a suggestion might not be useful. This approach creates a sophisticated, non-linear review graph that is batch delivered to minimize noise.

However, users receiving a PR are unlikely to act on it immediately so the processing time isn’t noticed. In that case, it makes sense to bias for quality over speed. But it’s different with IDE reviews. In the IDE, the user expects that reviews will start instantly and be delivered quickly since they might be waiting for the review before sending a merge request. They want to start working on changes immediately. To create an IDE review that better fit user expectations, we needed to adjust how we approached reviews for that context – without impacting review quality or usefulness.

Balancing quality and speed

Developing a streamlined pipeline able to deliver near real-time feedback directly within your coding environment required that we restructure how events and review comments were both created and transmitted.

This process started by thinking intentionally about the affordances around creating valuable insights in real-time. Real-time reviews would make it impossible for every comment to go through our verification steps to ensure its relevance. But removing those steps from our pipeline would lead to noisier reviews and a higher proportion of unhelpful suggestions. At what point would the ratio of signal-to-noise make an AI code review more of a nuisance, than a help?

Since we also offered a solution at the PR stage, we could make our IDE reviews narrowly focused on what mattered most at that stage. We decided that anything that was architectural or required deeper verification with the full codebase was better suited for the PR stage since trying to deliver valuable insights around that would require multi-step validation and take too long. In our PR reviews, we even go so far as to check code by attempting to run it in our sandbox. But that would be impossible in real-time.

We decided instead to prioritize reviewing for mistakes, specification alignment, bugs, and logical issues that a developer might miss while coding or editing. Since those are the type of things that could require a PR be sent back for changes or make your colleagues question how you could have missed it, we saw these as the most critical issues to look for in the IDE. For example, in my own use of our IDE reviewer, it found a conditional that I changed by mistake and didn’t notice.

We also still wanted to have a verification process that would validate that any suggestions we made would be beneficial but needed to build a more lightweight process for doing so.

Our goal was to develop the best IDE reviews but we knew they didn’t need to be as comprehensive as our PR reviews. The goal of IDE reviews is to streamline the PR process with an IDE check to help devs tackle the most critical changes and merge more confidently. At which point, the code would undergo our more in-depth PR review.

How we engineered it

Here’s how we tackled developing a new pipeline for our IDE reviews.

Review processing and delivery

One of the biggest changes we made was to how we process and deliver reviews. While in the SCM, we process all suggestions together and wait until we’ve completed our review before delivering the results to users, in the IDE we opted to deliver suggestions iteratively in real-time as our pipeline created them. While we could have opted for a more superficial review and delivered the complete results faster, we wanted to balance comprehensiveness and quality with the user’s need for instant feedback. By sharing suggestions over a longer period of elapsed time, we were able to buy extra time to include additional steps in the process to improve the quality of the suggestions.

Because we opted to engineer our system in this way, that also meant that CodeRabbit was able to give continuous feedback as you code using the same review process and pipeline.

Context preparation

Context enrichment is a big part of how we deliver the most codebase aware and relevant recommendations. However, our context preparation process is extremely comprehensive for our PR reviews. We go so far as bringing in linked and past issues, cloning the repository, and even building a Code Graph of your codebase to analyze dependencies. That kind of context enrichment wasn’t possible in the IDE. Instead, we focused on the code primarily to keep the context lighter for a faster review.

In the future, we are also planning on adding user-specific Learnings similar to how we do org-level Learnings for teams. That will ensure that code reviews will improve in relevance over time as the agent learns from your past commits and feedback.

Prompt and model optimization

Despite rearchitecting our pipeline to be more linear and to cut down on steps, we didn’t alter the multi-model orchestration that we use for our PR reviews. We use the same orchestration of models for both kinds of reviews but we created different choices of weights to intelligently select the models to process different parts of the review. We also finetuned how our decisioning engine worked to create a more linear process flow. Finally, we optimized our prompts for faster response times and the different priorities we’d identified for IDE reviews.

Choosing non-streaming LLM responses over streaming responses

We first thought streaming responses—generating words one by one from a language model like in ChatGPT—would be ideal for our IDE reviews, especially since it’s popular for real-time tasks. But because our prompts are large and we perform significant context engineering before starting our reviews, we ran into problems like garbled output from the model and missing tool calls.

Users expect review comments to be complete, unlike casual chats with AI coding assistants. So, we have to clean up the model’s output before showing it. Since streaming models their output in chunks, we had to buffer the LLM output until the model generated a full comment before sending it through our processing pipeline. This delay meant streaming didn’t help much in our case. Instead, we chose to wait a bit longer to get complete outputs for a bunch of files simultaneously.

UI changes

We had to design our UI from the ground up in VS Code. At first, we thought of adding our own comments panel where we would add comments similar to how we do it in SCMs. But we realized that a comments panel wasn’t the right way to do real-time comments. We decided to integrate our comments more directly into the editor so that users get the comment where the code is. We find this is a more IDE-native strategy and creates a better UX flow.

Working with users’ preferred AI coding agents

Because so many developers use AI coding tools in their IDE, we wanted to take advantage of that to give devs choice around how to resolve a suggestion from CodeRabbit. We give users the option to resolve the issue with code we suggest or to pass over the suggestion to the users’ preferred AI coding assistant to suggest code. All in one click.

What’s next on the roadmap?

We might have launched version 1.0 of our IDE reviews but we have a number of things in our roadmap to make them even more helpful.

Near term

User-level Learnings: We’ll be adding the ability to add Learnings or give feedback on suggestions so our agent automatically learns the suggestions you like and don’t like. We currently have org-wide Learnings in the SCM but want to extend this feature to individual developers who want to add custom Learnings that will only apply to them.
Tools: We’re planning on bringing the 30+ tools you can use in our PR reviews to our IDE reviews. If you’re already running linters in your IDE, you’ll be able to do it all in one review with CodeRabbit. Look out for this addition later in the year.

Longer-term

Web Queries: We plan to integrate our context enhancing features into our IDE tool so your code is bug-free with lesser false positives and your reviews are always up-to-date on versions, library documentations and vulnerabilities, even if the LLM isn’t.
Docstrings: Want to create Docstrings before you merge? We’ll be adding this feature that’s currently part of our PR reviews to our IDE reviews in the future.
More codebase awareness: We will be bringing in additional data points to create more codebase aware reviews in the IDE.

Try our free IDE reviews here. Interested in tackling similar challenges? Join our team.

]]>

Thu, 01 May 2025 06:49:07 GMT

「データ分析から新しいイノベーションを起こす」を企業理念に、Web広告の最適化ツールを提供しているログラフ。ファーストパーティーデータを収集・管理・運用する「Omni Data Bank」とコールトラッキングの「Call Data Bank」、そしてLINEの友だち追加トラッキングツールの「L Data Bank」という3つのサービスを展開しています。また問い合わせ時点や経過の音声やテキスト情報から、サービス契約者用にカスタマイズしたAIにより将来の価値を推定することができる「Value Data Bank」というサービスがローンチ間近となっています。

ログラフではコードレビューに対する負荷を軽減すべく、CodeRabbitを導入しています。今回はその導入経緯や社内での反響について、CTOの丹羽さんにお話を伺いました。

インターンや異業種からのエンジニア転身者も

ログラフでは主に3つの主軸となるサービスを開発・運営しています。これらは約30名のエンジニアによって支えられており、それぞれのサービスに対して専任のチームが設けられています。エンジニアは主にバックエンドを開発していますが、必要に応じてフロントエンドの開発も行うフルスタックのメンバーで構成されています。

社内のエンジニアの多くはジュニア層が中心で、インターンとしてスタートしたメンバーもいます。また、異業種からの転職者も多く、多様なバックグラウンドを持ったエンジニアで構成されています。社内では生成AIツールの導入が進んでおり、開発の効率化を図っています。

レビュアー不足を補う存在として導入

そうした中、ログラフではレビューができるミドルクラスのエンジニア不足が課題でした。丹羽さん自らレビューを行う時期もあり、経験あるエンジニアへレビュー負荷が偏ってしまっていたのです。さまざまな取り組みを通じてレビュー負荷軽減を目指してきた中で出会ったのがAIでした。そこで、導入されたのがCodeRabbitです。ログラフ社での導入は約2年前と、かなり初期の頃から活用しています。

「コードレビューができる人材には、コードのロジックを理解する能力が必要です。そのため、一定の経験を持ったエンジニアでないと十分なレビューは難しいです」（丹羽さん）

CodeRabbitは既存のレビュー体制を補完し、より迅速に開発サイクルを回すためのツールとして評価されています。従来の人によるレビュー方法では不十分だった部分をカバーできる点がポイントであり、社内のエンジニアからも高評価を得ています。CodeRabbitの導入により、レビュー工程の最適化と負担軽減が期待され、結果として開発全体の生産性向上が実現しています。

ただし、その内容を鵜呑みにしてしまわないよう注意しているとのことです。

「生成AIによるレビュー支援は、多くの部分で効率を向上させます。しかし、無条件でその結果を受け入れず、各自が検証する姿勢が求められます。レビューの過程でロジック理解を深めることが重要で、レビュー体制強化の鍵だと考えています」

ログラフ社がCodeRabbitを導入するに至った大きな要因の一つは、セキュリティ面での信頼性でした。CodeRabbitは、学習データを外部に利用しないことを約束しており、この点が非常に評価されています。開発プロセスの中で扱う重要なデータが外部に漏れる心配がなく、安心してCodeRabbitを利用できています。

「CodeRabbitはデータをセキュアに扱い、レビュー後は適切に破棄して学習に使わないと明記されています。そうした点があって、安心して導入できました」

CodeRabbitは開発チームにとけこんでいます

ログラフ社では、利用しているすべてのプロジェクトに対してCodeRabbitを導入しています。また、各プロジェクトでレビューが異ならないように、設定は統一しているとのことです。現状では必要最低限の設定のみに留めていますが、将来的なカスタマイズには意欲的です。たとえばプロジェクトのコーディング規約に合わせたプロンプトのカスタマイズなどが挙げられています。

課題として、エラーハンドリングについて過剰だと感じてしまうケースがあるとコメントしています。とはいえ、社内での反応は上々で、CodeRabbitは開発チームに受け入れられています。

「CodeRabbitは人間のレビュアーと同程度に考えており、その作業精度の向上に期待しています。社内ではCodeRabbitをウサギさんと呼んでいて、開発チームの中にとけこんでいます」

2年間運用しているとあって、ログラフ社ではさまざまな工夫が行われています。指摘事項を実行しない場合にも、その理由を記載するようにしたり、PR作成時の説明文は長くならないようにしています。また、CodeRabbitはファイルのパスから文脈を把握するので、責務の分離が明確になるディレクトリ構成を保っています。他にもさまざまな運用上のナレッジが蓄積されています。

CodeRabbitを導入した結果、ログラフ社では開発効率の大幅な向上が見られています。特にプルリクエストの作成コストが大きく削減され、多くのエンジニアがCodeRabbitの便利さを実感しています。メンバーはより重要な開発作業に集中できるようになり、全体的な生産性が向上しました。

「CodeRabbitは大きな効果を上げているものの、現在は逆に利用者側のスキル向上が求められています。ツールをただ受け入れるのではなく、常にその指摘を検証して実践に活かさなければなりません。特にジュニア層には、AIツールを使いこなしつつ、その裏にあるロジックを理解して欲しいと期待しています」

AIを開発に活用しないリスク

ログラフ社のオンボーディング資料の中では、CodeRabbitをレビュアーの一人として扱うように明記されています。ログラフ社における開発体験の一部として、CodeRabbitが組み込まれ、活用されています。AIを利用した開発体験について、丹羽さんは今や使わないことによるリスクの方が大きいと考えています。

「かつては書籍やGoogle検索などで調べてコーディングを行っていました。今はそれが生成AIになってきています。そうしたさまざまな方法を使えるようにならないとダメだと感じます。使う・使わないではなく、使えないことのリスクの方が大きくなっています」

CodeRabbitは今後もログラフ社の開発生産性を高めるべく、進化し続けます。

ログラフ社では、フロントエンドエンジニアやバックエンドエンジニアを募集しています。AIを活用した開発体験を通じて、エンジニアとしての成長を目指す方にとって、ログラフ社は魅力的な職場です。興味のある方はぜひ、ログラフ社の採用情報をご覧ください。

RECRUIT - 株式会社ログラフ

]]>

Fri, 18 Apr 2025 06:24:21 GMT

How to Do Thoughtful Code Reviewsの意訳です。

効果的なコードレビューには、丁寧で明確なコミュニケーションが不可欠です。チームが拡大するにつれ、これは大きな課題となります。各開発者が、それぞれコードを作成するためです。そのため、適切にドキュメント化されたコードレビューのガイドライン、プロセス、原則が非常に重要です。

コードレビューは、ソフトウェア開発ライフサイクルにおける重要なステップです。省略すべきものではありません。ただし、その実施方法については長年にわたり議論が続いています。

組織ごとにコードレビューの取り組み方は異なります。

すべてのメンバーがコードをレビューする組織もあれば、レビューを1人のチームメンバー（通常は著者より経験豊富なエンジニア）に任せる組織もあります。

大手テクノロジー企業では、コードレビュー全体のプロセスを管理するための社内ツールを導入しています。たとえば、GoogleはCritiqueやGerritを、MetaはPhabricatorを活用しています。さらに、ジェネレーティブAIやAIコードアシスタントの普及により、これらを取り入れてコードレビューを自動化する動きも広がっています。

組織ごとにコードレビューの進め方が異なるように、ソフトウェアエンジニア自身も、コードレビューのやり方についてさまざまな意見を持っています。ただし、これらの意見は、実際の職場環境とは一致しないこともあります。つまり、エンジニアが理想とするコードレビューのあり方（コードレビュー）と、職場で実際に求められている方法との間にギャップが存在するのです。

私たちは、X上でエンジニアが語ったコードレビューに関する意見をいくつか発見しました。それらは一見の価値があります。

これらのエンジニアは皆、それぞれの意見に対して正当な理由を持っています。中には、自身の実体験に基づいて意見を述べている人もいます。

私たちは、こうした多様なコードレビューに関する意見を踏まえ、以下のように分類しました。

「コードレビューが好き」グループは、レビューを通じての改善を望む人たちです。無条件にレビューを求め、職場で決められたプロセスに素直に従います。
「コードレビューは好きだけど」グループは、レビューそのものには肯定的ですが、いくつかの条件を必要とします。たとえば「コードレビューは好きだけど、あなたのコードスタイルを強制しないで」や「コードレビューは好きだけど、AIエージェントで自動化すべき」といった意見が挙がります。
「コードレビューは必要ないと思う」グループ（ややマイナーな立場）は、レビューを他人の基準に合わせるだけのものと捉え、価値を見出していない場合があります。

いずれのグループにも、それぞれ納得のいく理由があります。この投稿では、そうした背景を踏まえつつ、より良いコードレビューのためのヒントを共有していきます。まずは、「責任追及型の文化」に関する話題から始めましょう。

コードレビューの責任追及文化

コードレビューは、作者とレビュアーの間で何度もやり取りが発生する、手間のかかるプロセスであることはよく知られています。もし適切に運用されていなければ、コードの欠陥を見つけ出すことが主目的となり、レビュー中にバグが見逃された場合には、誰に責任があるのかを問う雰囲気が生まれてしまいます。私たちはこうした状況を「責任追及文化」と呼んでいます。

コードレビューは、コードの品質や一貫性、そしてベストプラクティスの実践を確保することが目的です。

コードレビューを対立の場として捉えるべきではありません。同様に、レビュアーのフィードバックも、個人ではなく課題そのものにフォーカスする必要があります。もしチーム内で、個人攻撃、過度な揚げ足取り、防御的な態度といった行動が見られるようになると、やがて責任追及の文化が根付いてしまいます。

コードレビューでバグが見逃されたとしても、最初に考えるべきは「誰の責任か？」ではなく、「何がうまくいかなかったのか？どう改善すべきか？」です。この考え方こそが、健全なチームマインドセットです。

責任追及の文化は有害であり、コードレビューに持ち込むべきではありません。

この点をしっかりと認識したうえで、次にコードレビューを適切に実施するための方法について解説していきます。

適切なコードレビューの実施方法

コードレビューに万能なアプローチはありません。しかし、チームは透明性のあるプロセスと明確な基準のもとで取り組むべきです。思慮深いコードレビューは、単なるエラー発見にとどまらず、知識の共有やチームとしての協業、そして共感の醸成を目的とします。

以下に、思慮深いコードレビューを行うために重要だと考えるポイントを挙げます：

事前自己レビューを実施（自分の仕事に責任を持つ）

まずは、自分のコードを自分自身でレビューしましょう。Create Pull Request を押す前に、コードを見直す習慣を持つことが大切です。これは、自分の力を疑うということではなく、「人は誰でもミスをする」という前提に立った行動です。それ以上に大切なのは、レビュアーの時間と労力を大切にする姿勢を示すことです。

コードやテストにエラーやバグがないかをしっかり確認してから、ピアレビューに提出しましょう。

自己レビューの際は、高リスクと低リスクの変更に意識を向けることが重要です。

高リスクの変更 — 重要なビジネスロジック、セキュリティ、パフォーマンスに影響を及ぼすような変更は、特に慎重にチェックする必要があります。
低リスクの変更 — たとえば、小規模なリファクタリングやドキュメントの修正などは、簡潔なレビューでも問題ありません。

レビューに優先順位を設けることで、深刻な問題を早期に発見しやすくなり、全体のプロセスも効率よく進めることができます。

小さなPR

大規模な変更を含むコードを徹底的にレビューするのは非常に困難、というよりほぼ不可能です。

1,000行を超えるコードを一度にレビューしたい人はまずいません。単に疲れるだけでなく、見落としが増え、フィードバックも遅れがちになります。こうした問題を避けるためにも、大きな変更は小さく論理的に分けたPRに分割しましょう。これにより、開発のスピードも向上します。

GoogleやFacebookのような大手企業、あるいは成熟したエンジニアリングチームでは、「小さなコミット」の文化が根付いており、それによって「マージ地獄」を防いでいます。

目安： レビューに10〜15分以上かかるPRは、大きすぎる可能性があります。

共感を持って書く/レビューする

コードを書くときも、レビューをするときも、相手の気持ちを想像することが大切です。ベストプラクティスに従い、読みやすく整ったコードを書くことは、自分のあとにコードを読む人への思いやりの表れです。共感とは、相手の立場になって考える姿勢です。

たとえば、「この関数は少し珍しいと思います。もう一度確認していただけますか？」という表現は、「この関数は悪いです。リライトすべきです」といった言い方よりも、はるかに共感的です。

PRへのコメントは、個人的な感情に基づいたり、あいまいな内容になったりしてはいけません。

❌ ヌル値のチェックをしていません。

✅ この入力値はヌルになる可能性があります。サーバーエラーにつながるおそれがあるため、ヌルの場合はクライアントエラーを返すようにしてください。

2つ目のポイント：・コードにフォーカスし、人を責めないこと。・改善提案は具体的かつ明確にすること

良いコードレビューの目的は、単にバグを見つけることではありません。チーム全体がより良いコードを書くための支援をすることです。もしフィードバックが攻撃的に聞こえてしまうと、誰も耳を傾けようとしなくなります。

コードレビュープロセスを自動化しましょう

繰り返し同じ作業をするのが好きな人はいません。退屈で、すぐにイライラしてしまいます。実際、コードレビューのサイクルはそうなりやすいものです。だからこそ、繰り返し発生する手動作業は、可能な限り抽象化・排除すべきです。

自動化ツールを活用すれば、ワークフローの効率化が可能です。特にリンターのようなツールは、コードレビューの一部を自動で処理するのに非常に効果的です。

AIコードエージェントを活用すれば、コードレビューの一部を自動化することが可能です。

もちろん、人間によるレビューは今後もしばらくは不可欠です。ただし、リンターやフォーマッターが、チームで合意されたベストプラクティスの遵守に役立ってきたように、AIベースのコードレビュー自動化ツールもまた、AIによる高度なリンティングを通じて、ノイズを減らし、影響の大きな問題に注目を集める手助けをしてくれます。

注意：AIベースのツールにはハルシネーションのリスクがあります。期待する出力を得るには、LLMに対してレビューの方針や基準を細かく指示・調整する必要がある場合があります。

チームで標準を合意する

私たちは皆、人間としてそれぞれ異なる偏見や好みを持っています。物事の進め方に対する「自分なりのやり方」があるのは自然なことです。ですが、その個人的な好みや偏見を他のメンバーに押し付けるべきではありません。

自分のコードスタイルを他人に押しつけるのはやめましょう。

パス閾値を設定する

コードの承認は、「レビュアーの気分次第」であってはなりません。チームとして、明確なパス閾値（合格基準）を定めることが必要です。

これは、メインのコードベースにマージするために提出されたコードが満たすべき最低限の基準を意味します。

パス閾値として設定できる基準には、パフォーマンスベンチマーク、セキュリティ要件、読みやすさなどがあります。

GitHub Checksは、合格閾値の自動化とその強制をサポートしてくれる仕組みです。プルリクエストがマージされる前に、特定のチェック項目をクリアするよう設定することで、コード品質の一貫性を保ちながら、コードレビューのスピードアップにもつながります。

クリアなフィードバック

フィードバックは、具体的で行動につながる内容を心がけましょう。文脈が不明確な場合は仮定せず、質問を通じて確認してください。良いコードレビューとは、単なる批判ではなく、コラボレーションなのです。

疑問があれば、あいまいにせず、具体的かつ実行可能な形で伝えるようにしましょう。

悪い例： これは間違っています。修正してください。
良い例： このアプローチにはレース条件が発生する可能性があります。ここにロックを使用する方が良いでしょうか？

すべての問題に対して自分で解決策を出す必要はありません。大切なのは、著者が気づきを得て、より良いコードに改善できるようサポートすることです。

まとめ

コードレビューは、時間と労力を要するプロセスです。

だからこそ、慎重かつ丁寧に取り組むことで、関わる全員の負担を軽くすることができます。うまく実施すれば、チーム全体のコード品質の向上や一貫性の確保に加え、協力し合える文化、オープンな対話、そして学びのある環境を育むことにもつながります。

あなたのチームでは、どのようにコードレビューを行っていますか？また、今後どのように進めていきたいと考えていますか？

]]>

Wed, 09 Apr 2025 12:28:15 GMT

「プロフェッショナル・テックで、次の常識をつくる。」をミッションとして、契約マネジメントプラットフォーム『クラウドサイン』や人々と専門家をつなぐポータルサイト『弁護士ドットコム』『BUISINESS LAWYERS』『税理士ドットコム』などのサービスを展開する弁護士ドットコム株式会社。多くのエンジニアを抱える同社では、エンジニアの生産性向上を重要視しており、多くの施策を実施しています。その一つとして、AIツールの活用を挙げています。

弁護士ドットコム社では、AIツール活用の一端として、CodeRabbitを利用いただいています。今回はその体験と実践的活用について、クラウドサイン事業本部の須山さんにお話を伺いました。

レビュー負荷が課題に

須山さんはクラウドサイン事業本部において、バックエンドエンジニアとして、システム基盤のリファクタリングやリプレイスに重点的に取り組んでいます。リファクタリングを進める中で、大規模なコード修正が発生し、そのためのコードレビュー負荷が大きかった点を課題に挙げています。

「既存システムのAPIをリプレイスしようと思うと、そのエンドポイントが3桁になることもあります。そうなるとレビューの負荷は非常に高くなります。リクエストとレスポンスを細かくチェックする必要があり、人的ミスも発生しやすくなります。」

そうした中で、レビュー負荷の軽減を目指して導入したのがCodeRabbitになります。

「当社ではエンジニアの生産性向上を重視しており、AIツールの導入にも積極的です。レビューの効率を上げたいと考え、CodeRabbitを導入しました」

導入要因となった3つのポイント

CodeRabbitに期待したポイントとして、機械的なチェックに強い部分を挙げており、ケアレスミスを減らすのに役立つと考えたとのこと。そして、導入の決定要因として、3つのポイントを挙げています。

1点目は、プルリクエストの自動レビューと変更内容のサマリーが、レビューの効率化に役立つと考えたことです。実際、コードの確認作業がスムーズになり、レビュー時間の短縮につながっています。

2点目は、コスト面での利点です。他のツールは従量課金制が多く、使用量によってコストが増減するため、予算の管理が難しい側面がありました。一方、CodeRabbitはサブスクリプション型のため、一定のコストで運用でき、予算の見通しが立てやすい点が評価されました。結果として、上層部への説明もスムーズになったとのことです。

そして、3点目としてセキュリティが挙げられます。セキュリティを重視するクラウドサインにとって、認証認可周りをはじめとしたコードを外部ツールに出して良いのかという議論が出ました。そうした中、CodeRabbitはコードを外部に出さない（学習しない）仕組みがあり、安心して利用開始できたとのことです。

「導入検討していた時に、外部勉強会で知り合った他社の優秀なエンジニアの方が絶賛しており、当社でも試験導入してみたところ、予想以上の効果があったため本採用に踏み出しました」

CodeRabbit導入による効果

こうしてCodeRabbitを導入したクラウドサインですが、その効果を須山さんは実感していると言います。

「エンジニアの生産性向上に大きく寄与していると感じます。レビューの負荷が軽減され、エンジニアが開発に集中できる環境が整ってきています。ケアレスミスをCodeRabbitがチェックしてくれるので、レビューの質向上にもつながっています」

特に、変更内容をサマライズする機能が、レビューの時間短縮につながっているとのこと。その結果として、開発プロセス全体がスムーズになり、チーム全体の生産性向上とプロジェクト進行の加速につながっています。

「CodeRabbitの学習機能が便利だと感じます。徐々にレビューの精度が向上しており、エンジニアがより良い提案を受けられるようになっています。チーム内でのコミュニケーションが円滑になっているので、開発品質の向上を実感しています」

運用上の工夫

クラウドサインにおける運用上の工夫として、path_filtersの活用を挙げています。path_filtersを利用し、特定の言語やファイルパスに基づいてレビューをカスタマイズしています。たとえば、Go言語のコードに対して特定のレビューを行ったり、特定のパス以下のファイルをレビューから除外したりするといった具合です。これにより、不要なレビューを避け、効率的なコードレビューが可能になります。

また、path_filtersによって自動生成されたファイルをレビュー対象から外すことができます。こうした設定はレビュー精度を高め、開発者の負担軽減につながっています。

「path_filtersの設定はYAMLで宣言的に行えますし、後からの修正も容易で便利ですね」

クラウドサインでは設計書もリポジトリで管理されており、コードと同様のレビュープロセスを経るようにしています。そうすることで、早期の課題発見、解決につながっています。各リポジトリでは .coderabbit.yaml のプロンプトをカスタマイズしており、プロジェクトに最適化されたレビューを得られるように工夫しています。

CodeRabbitへの期待

CodeRabbitの導入について、社内でも良好であると須山さんは述べています。エンジニアからは、コードレビューの効率化やケアレスミスの減少といったポジティブな反応が寄せられており、全体的な開発生産性の向上に寄与しているとのことです。また、AIによる実装提案機能も、コードの質を高める上で有用であると評価されています。

そして、今後さらに期待したい機能として、独自のLLM切り替え機能を挙げています。

「よりクラウドサインや弁護士ドットコムに最適化されたLLMを利用できれば、エディタやレビューツールを同じ目線で利用できて、さらに生産性向上が期待できそうです」

CodeRabbitは、今後もさらなるレビュー精度の向上と機能追加を通じて弁護士ドットコム/クラウドサインの開発生産性向上に貢献します。

弁護士ドットコム/クラウドサインではAIツールを使いこなすエンジニアを募集中です！

弁護士ドットコム/クラウドサインでは、幅広くエンジニアを募集しています。現在募集している職種については、下記URLを参照してください。

エンジニアポジション一覧 | 弁護士ドットコム株式会社

]]>

Mon, 24 Mar 2025 21:48:00 GMT

Relicは新規事業の創出をドメイン領域とし、アイディア出しからプロトタイプの開発と検証、そして開発から成長に至るまで、一気通貫で新規事業を共創することを強みとしています。特に最近では、生成AIや先進技術を活用した事業を展開するAI Transformation Groupを立ち上げ、これにより新たなビジネスアイディアの創出と実現可能性の検証を行っています。

本記事では、Relic社におけるCodeRabbit導入の経緯と効果について、技術統括部の米田さんとプロダクトディベロップメント事業部の山本さんに話を伺いました。

レビューの効果を十分に発揮したい

Relicはスタートアップ支援を数多く手がけています。0〜1/1〜10/10〜100といった各成長フェーズに対して最適な提案と体制を提供しています。業種に関しては特に限定せず、大企業の新規事業からスタートアップまでさまざまな規模の企業に対してサービスを提供しています。

そうした中で、大小さまざまなプロジェクトが社内で展開されており、ガバナンスが効きづらいという課題がありました。約100名のエンジニアが在籍する中、個別のプロジェクトは少人数のメンバーで構成されており、レビュアーが開発者と同じナレッジを持ちづらかったり、レビュー工数の確保が難しい状況でした。

レビューの質が低下してしまうと、開発者が新しい気付きを得る機会が減少するリスクもあります。

「レビューを行えるスキルを持ったエンジニアは限られており、負担が集中してしまうのが課題でした。レビューの質が低下してしまうと、初歩的なミスが見逃されてしまう可能性があります。また、プロジェクトの内容を十分に理解していないと、レビューの効果は十分に発揮できません」

導入決定要因は「セキュリティ」と「導入の容易さ」

こうした課題を解決するため、RelicではCodeRabbitの導入を検討しました。その目的として、レビューの効率化と質の向上を図りたかったと言います。

「機械的に検出できる部分や、初歩的なミスをCodeRabbitの時点で検出して欲しかったのが導入要因です。エンジニアの負担を減らし、レビューの質の向上を期待しました」

CodeRabbitの導入決定要因として、米田さんは2点挙げています。1点目はセキュリティ面です。Relicでは大手企業との取引が多く、彼らはセキュリティを重視しています。彼らのセキュリティ基準を満たすのは必須です。その点において、CodeRabbitはコードを学習に利用しないといった点や、SOC2などのセキュリティ規格を取得していることから、信頼性が高いと判断しました。

もう1点は導入の容易さです。CodeRabbitは導入が容易であり、すぐに運用が開始できた点も魅力だったと言います。

「開発者の生産性を損なうことなく、迅速にレビューの効率化を図れました」

プロンプトのカスタマイズがスムーズな運用の決め手

CodeRabbit導入後の課題について、山本さんはプロンプトのカスタマイズを挙げています。

「初期設定のままではチームのコーディングルールに合わない指摘や、不適切なフィードバックが見られました。そのため、CodeRabbitによるレビューを無視してしまう傾向が生まれてしまいました」

この問題を解決するため、プロンプトのカスタマイズを実施。チームのコーディングルールや設計思想に合わせた指摘ができるように調整しました。また、GitHubのブランチ保護ルールを活用し、CodeRabbitからの指摘を確認しないとマージできないようにも設定しました。こうした改善によって、指摘の有用性が認識されるようになり、スムーズに運用が進むようになりました。

新卒研修でもCodeRabbitを活用

CodeRabbitが運用されることで、Relicではコードレビューの効率化と質の向上が実現されていると言います。

「初歩的なミスの指摘はCodeRabbitに任せて、レビュアーは本来注力すべきロジックや設計部分に集中できるようになりました。また、レビュープロセスが標準化され、チーム全体でのコーディングルールの統一が進んでいます」

Relicでは、新卒研修においてもCodeRabbitを活用しています。新入社員がプルリクエストを提出した際に、CodeRabbitが自動的にレビューを行って、問題点を指摘します。これはプログラミングの基本的なスキルの効率的な習得につながっており、オンボーディングプロセスの効率化につながっているとのことです。

CodeRabbitに期待した機能

CodeRabbitをうまく運用するための課題として、米田さんは2点挙げています。1点目は、プロンプトのカスタマイズです。

「プロンプトを修正して、その効果を確認する際には再度プルリクエストが必要です。たとえばシミュレータのような機能で、どうレビューが変化するか事前に確認できるとより効率的に調整が行えると考えています」

2つ目は、指摘内容に対する有益性をフィードバックできる機能を挙げています。

「レビューに対して参考になった、または参考にならなかったというリアクションを送れる機能があると、レビューの質が上がると思います」

CodeRabbitは、今後もさらなるレビュー精度の向上と機能追加を通じて、Relicの新規事業開発に貢献していきます。

RelicはAI/LLMを活用した新規事業立ち上げ支援に興味があるエンジニアを募集中です！

Relicでは、AI/LLMエンジニアとして、AI/LLMの技術を活用した新規プロダクト開発やプロジェクトの技術設計・実装を担うエンジニアを募集しています。

最先端のAI/LLM技術で、新たな価値と未来を共に創造するエンジニアを募集 - 株式会社Relic

ほとんどの開発者は、ドキュメントよりもコードの作成を優先します。ドキュメント作成が特に難しいわけではありませんが、納期に追われる環境では後回しにされがちです。まるで、金曜の午後になってようやく本番環境のバグを修正したり、テストされていないコードを整理したりするようなものです。

しかし、なぜドキュメント作成は常に開発の優先順位を下げられてしまうのでしょうか？

このブログでは、ドキュメント作成が軽視される背景にある心理的な偏りや技術的な障害、組織的なプレッシャーについて掘り下げます。また、ソフトウェア開発プロジェクトにおける不十分なドキュメントの影響を考察し、ドキュメント不足の状況を改善するための具体的な戦略について議論します。

ドキュメントの種類とカテゴリーについて

内部ドキュメント：コードベースに直接関わる開発者、テスター、その他プロジェクトの細部に携わる人々のためのドキュメントです。ベストプラクティスや技術的な指示、限られた人しか理解する必要のない複雑なアルゴリズムの解説など、プロジェクトの内部運用に欠かせない指針を提供します。いわば、秘伝のレシピのようなものです。
外部向けドキュメント：開発したソフトウェアを使用するユーザーや顧客、またはソフトウェアと統合する開発者向けの資料です。APIリファレンスやユーザーマニュアル、プロジェクトの採用を左右する重要なWikiなどがこれに含まれます。

このブログでは、特に他の開発者がコードベースのデバッグや学習に活用できる内部開発者向けドキュメントに焦点を当てます。

アジャイルの現実：「後でドキュメントを作成する」

2週間のスプリントと毎日のデプロイが当たり前の現代では、ドキュメントは真っ先に犠牲になることが多いものです。私たちは皆、一度はこうした状況に陥ったことがあるでしょう。スプリントが終了し、PRをマージし、機能のリリースを急ぐプレッシャーに追われる。「次のスプリントでAPIドキュメントを追加しよう」と自分に言い聞かせつつ、心の奥では「次のスプリント」が来ないかもしれないと気づいている——そんな経験はないでしょうか？

アジャイル開発の現実は、これをさらに悪化させます。APIドキュメントの更新とバックログの次のストーリーへの対応、どちらを優先するかと問われれば、どちらが選ばれるでしょうか？ストーリーポイントや速度の測定基準では、ドキュメント作業はほとんど考慮されません。これは技術的負債と似ていますが、目立たない分、見過ごされがちです。しかし、いずれ影響が表れないわけではありません。

もし、スプリントごとにドキュメントをしっかり更新している開発チームがあれば、ぜひ教えてください。

以下に、よくある状況を紹介します。

// スプリント1：初期実装
async function processOrder(orderId) {
  // TODO: API ドキュメントを追加...
}

// スプリント 3: 新しいパラメータを追加
async function processOrder(orderId, options) {
  // TODO: API ドキュメントを更新...
}

// スプリント 5: 仕様変更
async function processOrder(orderId, options, callback) {
  // TODO: 本当にドキュメントを更新する必要がある......
}

「後でドキュメント化する」という考え方は、深刻な結果を招くことがあります。TODOコメントがそのまま放置され、API仕様は曖昧なまま。そして半年後、チームは会議室にこもり、なぜ支払いサービスが注文サービスの想定していないコールバックを送信しているのかを解明するために何時間も費やすことになるのです。

これまでにIDEで CTRL+F を使って「TODO」を検索したことがありますか？
もし検索結果が10件未満だったなら、あなたは優秀なエンジニアリングチームの一員です。（もちろん、チームがTODOコメントを付けていることが前提ですが）

開発者がコードをドキュメント化しない理由とは？

ソフトウェア開発における一般的な誤解のひとつに、「コードがきれいで読みやすければドキュメントは不要だ」という考え方があります。確かに、関数シグネチャや変数名が適切であれば理解しやすくなりますが、それだけでは特定の決定がなされた理由を完全に説明することはできません。

「明白だ」という症候群

「明白だ。関数名がすべてを物語っている！」

どんなにきれいなコードでも、すべてを語ることはできません。適切な変数名や明確な関数は確かに役立ちますが、なぜそのアプローチを選んだのか、どのようなエッジケースを考慮したのか、将来の保守担当者が注意すべき点は何かといった背景情報までは伝えられません。
きれいなコードは「どのように動作するか」は示せても、「なぜそうなっているのか」は示せない。そこを補完するのがドキュメントの役割です。

動く的の問題

現代の開発は高速で進んでいます。

コードベースが常に進化していると想像してみてください。新しい機能が追加され、APIが変更され、コードがリファクタリングされる。その間に、ドキュメントは古くなっていきます。先月は正確だった内容が、今日では誤解を招く可能性があり、来期には完全に間違っているかもしれません。

ドキュメントの陳腐化は自然に起こります。開発者は日々、高速でコードを変更しており、その影響でドキュメントの一部分が時代遅れになってしまうことは珍しくありません。APIエンドポイントの変更だけでなく、クラスメソッドの更新、関数シグネチャの進化、データベーススキーマの変更など、さまざまな要因が絡み合います。気がつけば、かつて正確だったドキュメントが、時代遅れの情報の地雷原と化してしまうのです。

例えば、こんな状況が起こります。

ドキュメント化したAPIエンドポイントに、新たに3つの必須パラメータが追加されている
エラーコードのセクションが6か月間更新されていない
例として掲載したコードスニペットが、もはやコンパイルさえできない
認証フローが完全に変更されたのに、ドキュメントには古い手順がそのまま載っている

これこそが、コードとドキュメントを同期させることが難しい理由です。ドキュメント作成は一度きりの作業ではなく、常に変化と戦い続ける必要があるのです。

コードがドキュメントとずれても、必ずしも動作しなくなるわけではありません。しかし、ユニットテストはコードが正しく動作するために常に最新の状態に保たれる必要があります。適切に書かれたテストは、単なる機能の検証を超え、具体的な例を通じて適切なAPIの使用方法を示す役割も果たします。

網羅率の測定にこだわるだけではなく、意味のあるテストケースを作成することで、開発者はコードがどのように動作すべきかの正確な参照を提供できます。テストは唯一のドキュメント形式であるべきではありませんが、信頼性が高く、自動的に更新される例を示すことで、書かれたドキュメントを補完する強力なツールとなります。

心理的要因（私たちも複雑な生き物であるため）

正直に言いましょう。ドキュメント作成の問題は、単なる時間的な制約だけではありません。優れたドキュメントを作成することには、コーディングにはない難しさが伴います。

まず、異なる思考方法が求められます。コーディングでは、コンピュータと正確かつ論理的な用語で対話します。しかし、ドキュメント作成ではどうでしょうか？
人間とコミュニケーションを取り、質問を予測し、複雑な概念を明確に説明しなければなりません。多くの開発者にとって、この思考の切り替えは簡単ではありません。

また、完璧主義の罠 もあります。
「このドキュメントは本当に十分な品質だろうか？」
「間違った説明をしていないか？」
「重要な詳細を見落としていないか？」
こうした不安が先延ばしにつながります。結局のところ、「間違ったドキュメントを作るくらいなら、いっそ書かない方がいい」と考えてしまうかもしれません。（ネタバレ：そうではありません）

さらに、動機付けの問題 もあります。
コードを書いて、それが動作すれば、すぐに達成感が得られます。しかし、ドキュメントには即座の報酬がありません。その価値が明らかになるのは、数週間、あるいは数ヵ月後、誰かが（将来の自分も含めて）仕組みを理解しようとしたときです。

ドキュメント作成が負担に感じられるのは、不慣れなスキル、不完全さへの不安、そして報酬の遅延が重なるためなのです。

質の悪いドキュメントがもたらす影響

質の悪いドキュメントの影響は、個々の開発者にとどまらず、プロジェクト全体に広がります。

生産性の低下とコードのメンテナンス負担の増加
開発者はコードの理解に多くの時間を費やし、その結果、作業の遅延が発生します。
バグやエラーの増加
コードの意図が正しく伝わらないことで、誤解によるバグやエラーが発生しやすくなります。
共同作業の障害
不十分なドキュメントは、開発者同士のスムーズな協力を妨げ、無駄なやりとりが増える原因になります。
知識の損失
チームメンバーが退職すると、ドキュメント化されていない知識も一緒に失われ、引き継ぎが困難になります。

実際に効果のあったこと

当社で実際に効果を発揮した方法を、優先順位の高い順に紹介します。

意味のあるコメントを活用する

完全なドキュメントを作成する時間がない場合、まずはコメントから始めましょう。コードの「何をしているか」だけでなく、「なぜこの実装なのか」を説明するコメントを残すことが重要です。将来の自分、あるいはあなたが退職した後にコードを保守する開発者への、小さなラブレターのようなものと考えてください。

簡潔かつ最新の状態に保つ

誰も冗長な文章を読むのは好きではありません。ドキュメント（コメントを含む）は簡潔で要点を押さえ、焦点を絞ったものにしましょう。そして、最も大切なのは最新の状態を維持することです。古くなったドキュメントは、むしろ無い方がマシな場合すらあります。それは、もはや存在しない都市の地図を渡すようなもので、フラストレーションを生むだけです。

AI開発ドキュメントツールの導入

最新のAIツールを活用することで、PythonのDocstring、JavaScriptのJSDoc、JavaのJavadocなどのコードドキュメントを自動生成・メンテナンスできます。これらのツールは、コードの構造を解析し、パラメータ、戻り値の型、関数の目的を説明するドキュメントを提案できます。完璧な代替にはなりませんが、開発のスピードを上げ、リファレンスドキュメントの最新性を保つのに大いに役立ちます。

READMEの活用

プライベートリポジトリのREADMEは、後回しにされがちです。しかし、オープンソースプロジェクトと同様にREADMEを充実させることで、オンボーディングにかかる時間を大幅に短縮できます。プロジェクトの目的、開発フロー、APIドキュメント、実行手順、アーキテクチャ図など、重要な情報をまとめましょう。新しくチームに加わったメンバーが最初の1週間で聞きそうな質問を想像して、それをREADMEに記載すると効果的です。

視覚的な補助を使用する

開発者向けドキュメントをすべて文章で書くのが難しい場合は、アーキテクチャ図やシーケンス図を活用しましょう。特に複雑な概念を説明する際には、1枚の図が1000の言葉に値します。フローチャートや図解を取り入れることで、ドキュメントをより直感的で分かりやすくすることができます。

結局のところ、優れたドキュメント文化を根付かせることが鍵です。まずはPRチェックリストにドキュメント更新を含めることから始めましょう。コードレビュー時にドキュメントについて議論し、適切なコメントやガイドを追加したメンバーを評価する仕組みを作るのも有効です。こうした取り組みを続けることで、「後でドキュメントを作成する」から**「ドキュメントがなければ完了ではない」**という意識へと、チーム全体の認識が変わっていくはずです。

まとめ

優れたドキュメントとは、完璧なものではなく、今日のソリューションと明日の問題をつなぐ橋です。ドキュメント作成の課題を完全に排除することはできませんが、小さな一貫したステップとチーム文化の変革によって、ドキュメント作成をより管理しやすくすることは可能です。

まず「なぜ」を説明する有意義なコメントを記載することから始める
ドキュメントの更新をPRプロセスの一部に組み込む
社内プロジェクトでもREADMEを効果的に活用する
テキストだけで伝わらない場合は、視覚的な補助手段を取り入れる
時代遅れのドキュメントは、ドキュメントがない状態よりも悪いことを意識する

次にドキュメント作成を後回しにしたくなったときは、思い出してください。ドキュメントは他の人のためだけでなく、未来の自分のために書いているのだということを。

なぜなら、6か月後には、私たちは皆、自分のコードのことをまったく覚えていないのですから。

CodeRabbit では、コード品質の継続的な改善を重視しています。例えば、プルリクエストごとに、文脈に即したコードドキュメントを自動生成する機能を提供しています。現在、この機能はベータ版として提供中です。無料トライアルに登録して、ぜひコードとドキュメントの生成をお試しください！

]]>

Mon, 24 Mar 2025 21:44:36 GMT

From Manual Reviews to AI-Powered Insights: The Linux Foundation'sの意訳です

TL;DR

主な連絡先： David Deal、プラットフォーム担当ディレクター
使用するコーディング言語： Go言語、Python、Angular、TypeScript
課題： 開発者はエラーの発生しやすい手動コードレビューに多大な時間を費やしている
主な成果： 開発者の時間を25%開放し、より多くの機能開発が可能に

The Linux Foundation は、オープンソース技術の力を活用し、最新のWebアプリケーション開発を支援するグローバルな非営利団体です。CNCF、PyTorch、OpenJS、Automotive Grade Linux など、200以上のオープンソースプロジェクトをホスト・支援しています。

同団体のエンジニアリングチームは世界各地に配置され、40～60名のエンジニアで構成されています。開発者と企業が協力できる中立的な環境を提供し、プロジェクトメンバーシップの管理やテレメトリデータの管理、ITインフラ向けのツールやサービスの開発・保守などを行っています。

ビジネス上の課題 - 手作業によるコードレビューがボトルネックに

Linux Foundationは、分散型のソフトウェア開発チーム特有の課題に直面していました。特に、コードレビューの遅延や非効率性、エラーの発生が問題となっていました。従来のコードレビューは、技術リーダーや同僚が手作業で行っており、プルリクエスト（PR）のマージ準備が整うまでに、複数の開発サイクルを要することも珍しくありません。さらに、異なるタイムゾーンにいるエンジニアからのフィードバックを待つ時間が長くなり、プロジェクト全体の進行が遅れる原因となっていました。

こうした課題を解決するため、チームリーダーの負担を増やすことなく、コードの品質不整合やギャップを特定できるツールの導入が求められていました。

主な課題：

手動のコードレビューにはばらつきがあり、重大なバグを見逃すリスクがある
レビュアーの知識に依存するため、一貫性を欠き、最高品質のエンジニアリングを維持しにくい
エンジニアが異なるタイムゾーンに分散していると、コードレビューの遅延が発生しやすい
コード品質のばらつきが原因で、チームリーダーは新機能開発に集中できず、大きな負担を抱えることになる

AI コードレビューによる開発生産性の向上

The Linux Foundationは、CodeRabbitのAIコードレビューを導入し、ソフトウェア開発のワークフローを大幅に改善しました。導入はわずか数回のクリックで完了し、GitHubリポジトリとの統合からAIによるコードレビューの実施まで、1日もかかりませんでした。さらに、CodeRabbitのSaaSによる展開とカスタマーサクセスチームのサポートにより、スムーズかつ迅速な導入が実現しました。

CodeRabbitのAIコードレビューにより、エンジニアリングチームのリーダーは本来の業務に集中できるようになり、初回のレビューサイクルを高品質かつ自動化された形で進められるようになりました。AIが特定する主な問題には、バグ修正やドキュメント不足、ユニットテストの追加、コードのリファクタリング提案などがあります。

Linux Foundationが活用したCodeRabbitの主な機能は以下のとおりです。

AIによるコードレビューの主な機能

常時稼働のコードレビュー:

CodeRabbitのAIは、常に上級エンジニアレベルのコードレビューを提供し、ワンクリックで利用可能です。人間のレビューと異なり、エンジニアは異なるタイムゾーンの同僚やエンジニアリーダーを待つ必要がなく、スムーズなワークフローを実現できます。

PRサマリーと提案:

PRに関連する各ファイルの変更内容を説明し、複雑な変更点を簡潔かつわかりやすく要約
リファクタリングの提案を含め、コード品質を最適化するための行単位の推奨事項を提供

1クリックでバグを修正:

バグやエラーが本番環境に影響を与える前に即座に特定
1クリックでAIによるバグ修正の提案を受け入れ、修正をPRに自動コミット

幅広い統合:

幅広い言語とフレームワークをサポート（Go言語、Python、Angular、TypeScript、Terraform、SQLなど）
静的解析ツールやリンターと統合し、CodeRabbitはコードレビューの一元的なプラットフォームとして機能

ドキュメントとユニットテスト:

CodeRabbitのAIレビューはドキュメントの不足を特定し、特にSQLやDBTのテストフレームワークではユニットテストの追加を推奨しました。

Terraformファイルへの提案により、インフラ管理が改善されました。

自動学習:

CodeRabbitのAIは時間をかけて学習し、リポジトリ固有のベストプラクティスを特定
エンジニアはチャットボットを通じて文脈に沿ったフィードバックを提供し、AIの学習内容を確認および調整（フィードバックにより、コードレビューの品質がさらに向上します）

“CodeRabbitは、当社のドキュメントと実際のテスト範囲の相違点を明らかにする上で非常に有益でした。nullチェックの欠落や値の範囲の不一致などの不整合をハイライトすることで、コードベースの品質が大幅に向上し、多くの潜在的な問題を回避できました。” — David Deal氏（The Linux Foundation、エンジニアリング担当シニアディレクター）Linux Foundation

まとめ

The Linux Foundationは、CodeRabbitを組織内の他チームへ展開を計画しています。また、PRのマージ時間の追跡や人間によるコメントの削減など、今後リリースされるCodeRabbitの分析機能を活用し、AIが開発ワークフローの高速化に与える影響をより簡単に測定できることを期待しています。さらに、より多くのユーザーがチャットボットとやり取りすることで、レビューサイクルにおけるコーディング標準に特化した学習が進み、AIが時間をかけて成長していく様子を見られるだろうと考えています。

Linux FoundationによるCodeRabbitの導入は、AIによるコードレビューの効果を示しており、チームの効率を向上させ、コードレビューに費やす手作業の時間を25%削減しています。CodeRabbitは、単調な作業を自動化し、実用的なインサイトを提供することで、エンジニアがオープンソースプロジェクトにおけるイノベーションと品質維持に集中できるよう支援します。

CodeRabbitの始め方

CodeRabbitを使って、AIによるコードレビューを体験してください。導入にかかる時間は5分未満で、Gitプラットフォームへの統合時にクレジットカードは不要です。無料トライアルを開始し、新しいプルリクエストを作成するだけで、AIが数分でコードレビューを実施するのを確認できます。ご質問がある場合は、サポートまでお問い合わせください。

Primary contact: Shanimol. E.M, Engineering Manager
Company: KeyValue Software Systems
Coding languages used: Golang, Flutter, Next.js, React, Python
Challenge: Time-consuming, inconsistent, and error-prone manual code reviews slow development.
Key Result: More than 50% reduced code review time, enabling faster releases and improved efficiency.

KeyValue Software Systems is a premier global AI-first product development hub and the best delivery engine in the Indian subcontinent. With a 400+ strong engineering team, the company has built and delivered 120+ products for 80+ companies over the last eight years. Its expertise spans industries, geographies, and technologies, leveraging AI to create high-value software solutions. At KeyValue, Shani leads a 20-member engineering team developing a fintech product to help tech startup operators, founders, and investors create and manage wealth. The team focuses on delivering a secure, fast, and reliable user experience.

Business Challenges – Manual Code Reviews Slowing Down Development

As a product development partner for fast-moving startups and scaleups, KeyValue Software Systems prioritizes rapid feature releases. However, manual code reviews have created bottlenecks in the development process.

Key Challenges:

Time-consuming manual Reviews: Engineers spent excessive time reviewing PRs, creating bottlenecks that slowed QA delivery and reduced overall productivity.

Inconsistent Review Quality: Varying skill levels led to overlooked best practices and inconsistent naming conventions.

Security Risks in FinTech: Given the security-sensitive nature of FinTech applications, ensuring strong security, compliance, and vulnerability detection before deployment was critical.

Engineering Bandwidth Constraints: Senior developers spend too much time reviewing code instead of focusing on high-impact development and architectural improvements.

Missed Bugs & Hidden Errors: Manual reviews often miss subtle bugs, ambiguous code, and performance inefficiencies, leading to potential unexpected behavior.

Sprint Execution Inefficiencies: Teams spend excessive time in the push-review-fix-release loop, limiting their ability to focus on strategic, high-impact engineering challenges.

Key CodeRabbit Features That Transformed Development Workflow:

1. Always-ON AI Code Reviews:

CodeRabbit functions like a 24/7 senior engineer, ensuring prompt and consistent code reviews without relying on peer availability.

2. Automated PR Summaries and Suggestions:

Clear and concise summaries for every PR, making it easier for reviewers to understand changes at a glance.
Intelligent refactoring suggestions ensure optimized, maintainable code.
Detects edge cases and recommends error-handling improvements, identifying potential issues early.

3. Security:

Automatically flags security risks and helps meet security compliance requirements.

4. Bug Detection:

AI-driven bug identification highlights potential issues in the code before they go live.
Automated error-handling reviews improved overall application stability.

5. Context-Aware Code Reviews with Linear Integration:

Seamless integration with Linear allows CodeRabbit to fetch relevant ticket details, providing context-aware PR reviews.
Leverages ticket descriptions to ensure code changes align with business requirements and intended functionality.

6. Junior Engineer Onboarding:

AI highlighted naming conventions and coding standards and reinforced best practices.
Reduced senior engineers' having to correct common beginner mistakes.

7. Increased Sprint Productivity:

Time saved in code reviews was used to take on strategically important tasks.
More efficient sprint execution due to less time in review cycles.

8. Built-In AI Chatbot for Interactive PR Discussions:

CodeRabbit’s intelligent chatbot allows engineers to interact with PR comments in real time, making it easier to clarify suggestions, request refinements, and resolve issues efficiently.
It simplifies code review discussions by providing quick explanations and justifications for AI-generated review comments.

“CodeRabbit has completely transformed our code review process, making it faster, more consistent, and less manual. It has saved us more than 50% of the time we used to spend on manual reviews, allowing our engineers to focus on building great products” - Shanimol. E.M, Engineering Manager, KeyValue Software Systems

Conclusion

CodeRabbit has improved the way KeyValue teams work. By automating code reviews, CodeRabbit enabled engineers to deliver high-quality products faster while maintaining security and best practices. With CodeRabbit rolling out to more teams within the company, KeyValue is using AI to set a new standard for development efficiency.

Get Started with CodeRabbit If your team wants to supercharge its development workflow, CodeRabbit makes AI-powered code reviews seamless and efficient. You can try CodeRabbit in under 5 minutes; no credit card is required. Get started today.

]]>

Mon, 03 Mar 2025 15:54:51 GMT

Good code reviews take thoughtful, unambiguous communication. This can be a big challenge as a team grows. Each developer creates additional lines of code. This is why well-documented code review guidelines, processes ,and principles are so important.

Code review is a critical step in the software development lifecycle that you shouldn't omit. However, how code review should be done has been debated for a long time.

Every organization approaches code reviews in their own way.

In some organizations, everyone reviews the code. Others prefer to assign code review to one team member (usually a more senior engineer than the author).

Big tech companies usually have an internal code review tool to manage the entire process. Google has Critique & Gerrit, and Meta has Phabricator. Also, since Generative AI and AI Code Assistants became very popular, many organizations have adopted them to automate code reviews.

Similar to how organizations approach code review in their way, software engineers also have opinions on how they want code review done. These opinions might differ from their reality at work. This shows a difference between how engineers would like to do code reviews and how their workplace makes them do it. We came across some opinions from engineers on X about code review.

We think these are worth looking at.

Every one of these engineers has a valid reason for their opinion. And some with real-world experience to buttress why they hold such opinions.

Based on the diverse takes we have seen on the subject of code review, we pretty much grouped them into:

The “I like code review” group consists of people who want code review regardless of the pass threshold set. They will adjust to any code review process structure set at their workplace.
The “I like code review but” group. Folks in this group want code review but with certain conditions, such as “I like code review but don’t force me to write in your code style,” “I like code review, but we should automate with an AI agent,“ etc.
The “I don’t think code review is necessary” group (not a popular group). Some people in this group believe that code review is just a way for someone to make you conform to their standards.

Again, each group has a good reason for their choice. Our focus in this post is to share tips on thoughtful code reviews. First, let’s quickly detour to the code review blame culture.

The Code Review Blame Culture

It is a known fact that the code review process can be a tedious exercise involving multiple back-and-forths between an author and a reviewer. If not properly managed, it can create an environment where the focus shifts to finding faults in code and apportioning blame when a bug escapes review. we call it the blame culture.

Code review is about ensuring code quality, consistency, and best practices.

Authors should not view code reviews as antagonistic. Similarly, feedback from reviewers should be issue-based. The blame culture thrives when teams behave contrary to these—personal attacks, nitpicking, defensiveness, etc.

If a bug manages to escape code review, what comes to mind should not be “Whose is to blame?” but “What went wrong and how can we fix it?.” We call it the team mindset.

The blame culture is toxic and should have no place in code review.

Now that we have established that, let’s move on to how to do code reviews thoughtfully.

How to Do Thoughtful Code Reviews

There is no one-size-fits-all approach to code review. However, teams should approach it with transparent processes and standards. A thoughtful code review goes beyond finding errors to promoting knowledge sharing, collaboration, and empathy.

The following are some elements that we think a thoughtful code review should have:

Do a preliminary self-review (own your work)

You should first self-review your code. Before clicking “Create Pull Request,” give your work a second look. No, this doesn’t mean you don’t trust or believe in yourself. It means that you understand that you can make mistakes. More importantly, it means you value the reviewer’s time and effort.

Check your code and tests for errors and bugs before submitting it for peer review.

Additionally, focus on high-risk vs. low-risk changes during self-review.

High-risk changes—those affecting critical business logic, security, or performance—deserve extra scrutiny.
Low-risk changes- such as minor refactors or docs updates, can be reviewed with a lighter touch. By prioritizing your review efforts, you help ensure that major issues are caught early while keeping the process efficient.

Smaller PRs

It is hard (almost impossible) to thoroughly review a codebase with large changes.

Nobody wants to review 1,000+ lines of code in one go. It’s exhausting, error-prone, and delays feedback. Break big changes into small, logical PRs—it keeps things moving faster. Big tech like Google, Facebook, and every well-run engineering team encourages the small commits culture and prevent merge hell.

Rule of thumb: If a PR takes more than 10-15 minutes to review, it’s too big.

Write/review with empathy

Consider the feelings of others when you write or review code. Following best practices and writing clean code means you care about who takes a look at your work. Empathy is putting yourself in the shoes of the other person.

For example, feedback such as “I find this function a bit unusual. Do you mind giving it a second look?” is more empathetic than “This is a bad function. You should do a rewrite.”

PR comments shouldn't be personal, or vague.

❌ You didn't check for a null value.

✅ This input value could be null, causing a server error. If null, a client error should be thrown.

The second: - targets the code, NOT the person - is clear about a suggested improvement.

Good code reviews aren’t just about finding bugs. They’re about helping your team write better code. If your feedback sounds like an attack, nobody will listen.

Automate code review processes

No one likes going through the same thing over and over again. You easily get bored and frustrated doing so. Frankly speaking, code review cycles can get you into that bored and frustrated spot. This is why you should try to abstract away redundant manual steps.

Use automation tools to streamline your workflows. Linters are great for automating parts of your code review process.

You could also employ AI Code agents to automate certain levels of code review.

The human element will always be important in code review (at least for the foreseeable future). While linters and formatters have helped us follow agreed practices by teams, using AI-based Code Review automation tools, you can further leverage AI-powered Linting to suppress noise and bubble up impactful issues.

Note: AI based tools prone to hallucinations. Fine-tuning or instructing the LLMs on how you want the code to be reviewed may be essential in some cases, for a desired output.

Agree to a standard as a team

We all have our own biases as humans. Each of us has a certain way we like things to be like. And that’s okay. However, we should not impose our biases (or preferences) on others.

Do not impose your code style on others.

Set the pass threshold

Approving code should not be “as the reviewer wishes.” Your team should set a clear pass threshold.

Set the minimum standards code submissions must meet to merge in the main codebase.

Examples of the pass threshold you could set are: performance benchmark, security, and readability.

GitHub Checks can help you automate and enforce pass thresholds. By requiring specific checks to pass before a pull request can be merged, you ensure consistent code quality and speed up the code review process.

Clear feedback

Provide specific and actionable feedback. Where you don’t have the right context, ask questions and don't make assumptions. A good code review isn’t just criticism. it’s a collaboration**.**

Be specific, actionable, and ask questions when you’re unsure.

Bad: This is wrong. Fix it.
Better: This approach might have a race condition. Could we use a lock here instead?

You don’t have to solve every problem—just help the author think through their solution.

Summary

Code review is a time-consuming process.

Approaching it thoughtfully makes it easier for all parties involved. If done right, it can help your team improve code quality, ensure consistency, and promote collaboration, openness, and a learning culture.

How does your organization do it? And how do you want it done?

]]>

Tue, 25 Feb 2025 21:08:02 GMT

How to fix TypeScript Code Smells with CodeRabbitの意訳です

最新の開発ワークフローに AI を活用したコードレビューを統合することで、TypeScript プロジェクトの管理方法が大きく変わりました。

TypeScript には静的型チェックの機能が備わっており、エラーを早期に検出できます。しかし、コードベースが拡大するにつれて、コード全体の品質を維持することがますます重要になります。従来のコードレビューやペアプログラミングは有効ですが、大規模なチームや複雑なシステムでは開発サイクルが遅くなる要因となることがあります。

このチュートリアルでは、TypeScript 開発者向けの求人情報サイトを例に、コードの匂いを特定し、コード品質を向上させることで、TypeScript プロジェクトを強化する方法を紹介します。

まず、「コードの匂い」とは何か、そしてなぜそれを修正することが重要なのかを見ていきましょう。

コードの匂いとその重要性について理解する

コードの匂いとは、コードの設計や構造に潜む潜在的な問題を示す兆候やパターンのことです。これらの問題は即座にバグやエラーを引き起こすわけではありませんが、将来的に保守性の低下、可読性の悪化、スケーラビリティの制約といった課題を引き起こす可能性があります。コードの品質を向上させ、堅牢で効率的なコードベースを維持するには、コードの匂いを認識し、適切に対処することが不可欠です。

TypeScript における代表的なコードの匂いには、長い関数、重複したコード、複雑な条件分岐などがあり、これらは長期的にパフォーマンスや可読性に悪影響を与える可能性があります。

TypeScript プロジェクトでは、こうした問題を早期に特定し、適切にリファクタリングすることで、コードの健全性を維持できます。これは、アプリケーションが成長するにつれて、スケーラビリティやメンテナンス性を確保するために不可欠です。

Slack のような大規模プロジェクトでは、TypeScript を活用することで開発効率を高めていますが、コードの匂いが蓄積すると技術的負債の原因となります。このような問題が積み重なると、ユーザー体験やビジネスの成長にも悪影響を及ぼす可能性があります。AI ツールを活用すれば、コードの匂いを自動的に検出し、修正することが可能になり、開発チームはコードの品質管理に煩わされることなく、新機能の開発に集中できます。

次に、TypeScript における一般的なコードの匂いを特定し、具体的なコードの断片を通じて、それらをどのように検出できるのかを検討していきましょう。

TypeScriptにおける一般的なコードの匂いを特定する

ソースコードをレビューする際に、開発者が注目すべき代表的なパターンを以下に示します。

長い関数: 1つの関数やクラスメソッドが多くの責任を持ちすぎている場合、テストやメンテナンスが困難になります。適切に分割し、単一責任の原則 (SRP) に従うことで、可読性と再利用性を向上させることができます。

function processJobApplication(job: Job, applicant: Applicant) {
  // handles filtering, validation, notification, etc.
  // too many responsibilities in one function
}

重複したコード: 同じコードが複数の場所で繰り返されている場合、リファクタリングの必要性を示しています。共通のロジックを関数やユーティリティモジュールに統合することで、コードの再利用性を高め、保守の手間を削減できます。

function displayJobTitle(job: Job) {
  console.log(job.title);
}

function showJobTitle(job: Job) {
  console.log(job.title);
}

複雑な条件分岐: ネストが深い条件や複雑な論理式は、コードの可読性を低下させ、デバッグを困難にします。

if (job.salary > 50000 && job.location === 'remote' && job.type === 'full-time') {
  // complex logic that could be simplified
}

特に、求人情報のフィルタリングや詳細表示を行う関数内などで、コードの匂いを特定し対処する場合、大規模なコードベースを扱うと手動でのコードレビューに多くの時間がかかります。

次のセクションでは、TypeScript の求人掲示板プロジェクトを例に、AI 搭載のコードレビューツールがこれらの問題を自動的に検出し、フラグを立てることで、効率的なリファクタリングをどのように支援できるかを紹介します。

TypeScriptの求人掲示板の設定

このチュートリアルでは、型安全機能の実装と潜在的な問題の検出を支援する AI 搭載のコードレビューツールを活用し、求人掲示板アプリケーションを作成する手順を説明します。

前提条件

開始するには、以下の環境が必要です。

TypeScript、React、または Next.js の基礎知識
Node.js と npm のインストール
Shadcn UI のインストール
CodeRabbit アカウント
VS Code（またはその他のコードエディタ）

ジョブボードのセットアップ

素早くセットアップするには、GitHub のリポジトリをクローンし、依存関係をインストールして、以下の手順でアプリをローカルで実行します。

リポジトリをクローン

git clone https://github.com/Tabintel/typescript-job-board.git

依存関係をインストール

cd typescript-job-board
npm install

アプリケーションを実行

npm run dev

localhost:3000 をブラウザで開き、TypeScript ジョブボードを表示

この求人掲示板は、TypeScript 開発者向けに求人情報を検索できるよう設計されています。サイドバーでの求人情報のフィルタリング、動的に生成される求人カード、レスポンシブなレイアウトが含まれています。

アプリケーションを実行したら、CodeRabbit のような AI コードレビューツールがどのようにコードの匂いを検出し、修正に役立つのかを確認してみましょう。

AIによるコードレビューの設定

アカウントを作成し、ドキュメント内の統合ガイドラインに従って、AIをワークフローに統合します。インストールが完了したら、TypeScriptプロジェクトのコードレビューの合理化と高品質基準の維持にAIを活用し始めましょう。

レビュープロセスを有効にするには、リポジトリにブランチを作成し、コードを更新してプッシュし、プルリクエスト（PR）を開始します。

ターミナルで以下のコマンドを実行して、求人情報アプリケーション用の新しい機能を作成してみましょう。

git checkout -b feature/job-application-form

これで、求人掲示板を使用している TypeScript 開発者は、プラットフォームを通じて直接求人に応募できるようになります。コンポーネントには以下が含まれます。

フォームデータの検証用の TypeScript インターフェース。
React の状態管理による厳密に型付けされたフォーム処理。
フォーム送信用の型安全なイベントハンドラ。

次に、src/app/components/ ディレクトリに JobApplicationForm コンポーネントを作成し、この GitHub gist のコードを入力します。

コードを追加したら、次のコマンドを使用して変更をGitHubリポジトリにコミットしてプッシュします。

git add src/components/JobApplicationForm.tsx
git commit -m "feat: add job application form with TypeScript"
git push origin feature/job-application-form

次に、GitHubリポジトリに移動すると、新しいブランチからプルリクエストを作成するよう促されます。

わかりやすいタイトルと新機能の説明を記載したプルリクエスト（PR）を作成しましょう。

プルリクエストにより、AIによるコードレビュープロセスが開始され、ソースコードを分析して潜在的なコードの匂いやリファクタリングが必要な領域を検出します。

コードの匂いの分析とレビュー

プルリクエストを作成すると、AI 搭載のコードレビューツールが自動的にコードを分析し、潜在的な問題を特定します。以下の画像のように、求人掲示板のコードを改善する提案が表示されます。

レビューでは、リファクタリングが可能なコードを指摘し、改善方法を提案します。

さらに、TypeScript コードに潜む潜在的な問題も強調表示されます。

`JobApplicationForm.tsx` で検出されたコードの匂い

1. 不適切なステート管理

既存のコードでは、適切なステート管理が行われておらず、読み込みインジケーターやエラーハンドリングが欠如しています。その結果、フォームの送信中にユーザーへ適切なフィードバックが提供されず、混乱を招く可能性があります。

改善策: CodeRabbit は、適切な isSubmitting フラグを追加し、エラーメッセージの管理を提案します。デフォルトの経験値を 1 年に設定し、ユーザーが適切な情報を入力できるようにしました。

const [isSubmitting, setIsSubmitting] = useState(false);
const [error, setError] = useState(null);
const [application, setApplication] = useState({
  position: '',
  yearsOfTypeScript: 1,  // 適切なデフォルト値
  githubProfile: '',
  email: '',
  portfolioUrl: ''
});

2. エラー状態の欠落

フォームの送信ハンドラではエラー処理が不足しており、送信のライフサイクルを適切に管理していません。これにより、エラーが発生してもユーザーへ適切なフィードバックが提供されません。

改善策: 適切な非同期処理 (async/await) を導入し、エラー処理を強化しました。

const handleSubmit = async (e: React.FormEvent) => {
  e.preventDefault();
  try {
    setIsSubmitting(true);
    setError(null);
    await submitApplication(application);
    // 成功メッセージを表示
  } catch (err) {
    setError(err instanceof Error ? err.message : 'Submission failed');
  } finally {
    setIsSubmitting(false);
  }
};

3. 型なしの Promise レスポンス

Promise の戻り値に any を使用すると、型安全性が損なわれ、予期しないエラーの原因になります。適切な型定義と入力検証を実装することが推奨されます。

改善策: Promise の戻り値に適切な型を付与し、入力値の検証を追加しました。

const submitApplication = (data: JobApplication): Promise<{ success: boolean; message?: string }> => {
  return new Promise((resolve, reject) => {
    if (!data.email.includes('@')) {
      reject(new Error('Invalid email format'));
      return;
    }
    resolve({ success: true, message: 'Application submitted successfully' });
  });
};

4. アクセシビリティの不足

フォームには適切なアクセシビリティ (ARIA 属性) が欠けており、スクリーンリーダーのユーザーがフォームを適切に操作できない状態でした。

改善策: 適切な ARIA 属性を追加し、アクセシビリティを向上させました。

<form 
  onSubmit={handleSubmit} 
  className="space-y-4 w-full max-w-md"
  aria-label="Job Application Form"
  noValidate
>
  <input
    type="email"
    id="email"
    aria-invalid={!!emailError}
    aria-describedby="email-error"
    required
  />
form>

この自動コードレビュープロセスを通じて、TypeScript開発に関する貴重な教訓を得られました。適切な型定義は、ドキュメントと実行時の安全対策両面で機能します。エラー処理は、適切に型付けされることでより予測可能になり、アクセシビリティ機能は、後付けではなく開発プロセスへ自然に取り込まれるでしょう。

AIによるコードレビューで継続的にコード品質を改善

AIを搭載したコードレビューツールを導入すると、開発プロセスの継続的な改善が可能になります。ソースコードの品質を向上させ、開発全体の効率を高める方法を紹介します。

効率的な開発：自動コードレビューツールが開発中にコードの匂いを検出することで、開発者は手動のコードレビューやペアプログラミングにかかる時間を削減し、より優れたソフトウェア機能の構築に集中できます。
技術的負債の防止：コードの匂いを早期に特定して対処することで、問題が複雑化する前に解決できます。この積極的なアプローチにより、複数の機能やクラスメソッドをまたぐ保守性の高いコードベースを維持できます。
コード品質の向上：適切な型情報を持つクリーンなソースコードは、バグやエラーを最小限に抑えます。プリミティブ型や型推論を体系的に活用することで、拡張性を高め、新しい機能を追加する際の負担を軽減できます。

ネクストステップ？

AI によるコードレビューを活用し、TypeScript のコードの匂いを特定・改善する方法を学んだところで、開発プロセスをさらに強化するための次のステップを紹介します。

TypeScript のプラクティスを洗練する：コードの品質をさらに向上させるため、ジェネリクスや型推論といった高度な機能を積極的に活用しましょう。これにより、保守性が向上し、ランタイムエラーのリスクが軽減されます。
AIと人間によるレビューの統合：CodeRabbit のようなAI コードレビューツールは強力ですが、人間の視点と組み合わせることでさらに効果的になります。チームメンバーが AI の提案を確認し、フィードバックを共有することで、学習と継続的な改善の文化を促進できます。
最新情報を把握する：TypeScript は常に進化し、新機能や改善が追加されています。厳格な型チェックやツールの最適化など、最新の変更点を公式ブログで定期的に確認し、最新のベストプラクティスを取り入れましょう。

まとめ

AIツールを活用すれば、開発者は TypeScript（または他の言語）のソースコードを分析し、一般的なコードの匂いを特定し、リファクタリングを通じてコード品質を向上させることができます。開発ワークフローの効率が向上し、チームは自信を持ってソフトウェアをリリースできるようになります。

より良いソフトウェア開発を実践し、TypeScript プロジェクトを強化する準備はできましたか？今すぐCodeRabbitに登録し、AI によるコードレビューのメリットを体験してください。

]]>

Tue, 25 Feb 2025 21:06:28 GMT

How to Automate OWASP Security Reviews in Your Pull Requests?の意訳です

ウェブアプリケーションの重要性が高まる中、セキュリティは世界中の組織にとって最優先事項となっています。これらのシステムがより密接に統合されるにつれ、強固なセキュリティ対策が求められています。最近の報告書によると、AI を悪用した攻撃が急増しており、小売 API への侵入、DDoS 攻撃、GenAI を利用した高度なフィッシングキャンペーンなど、1 日あたり 50 万件以上のインシデントが発生しています。

こうした脅威に対応するため、AI を活用して脆弱性をシミュレートし、事前に特定・緩和する AI レッドチームのスタートアップが登場しています。大規模言語モデル（LLM）を活用したアプリケーションの台頭により、プロンプトインジェクションやデータポイズニングといった新たな課題が浮上し、セキュリティの戦場は常に進化しています。

このような状況を受け、多くの組織がオープン・ウェブ・アプリケーション・セキュリティ・プロジェクト（OWASP）に注目しています。OWASP は、安全なアプリケーション開発を支援するリソースやガイドラインを提供する非営利団体であり、ソフトウェアのセキュリティ向上を目的としています。

本記事では、本番環境にデプロイする前に脆弱性を検出・対処する方法を解説し、侵害リスクを最小限に抑えながらアプリケーションのセキュリティを強化する手法を紹介します。

OWASP 101

OWASP の使命は、アプリケーションを攻撃者から守るための無料リソース、ツール、ドキュメントを提供することです。トレーニングやベストプラクティス、コミュニティ主導のプロジェクトを通じて、開発者やセキュリティ専門家を支援しています。また、カンファレンスやグローバルなコミュニティ活動を通じてコラボレーションを促進し、新たなセキュリティ脅威への対策を強化しています。

OWASP Top 10 は、最も一般的なセキュリティリスクをリスト化した必須のリソースです。ウェブアプリケーションの複雑化と相互接続性の向上に伴い、悪意のある攻撃者が悪用できる脆弱性が増加しました。このリストは、開発者が安全なアプリケーションを構築するための指針となります。

以下は、OWASP Top 10 に含まれる主要な脆弱性であり、ウェブアプリケーションのセキュリティにおける重大なリスクを示しています。

破られたアクセス制御: 不適切な権限設定により、攻撃者がリソースへ不正アクセスできる。
暗号化の失敗: 脆弱な暗号化や機密データの不適切な処理が、情報漏洩のリスクを高める。
インジェクション: SQL などのクエリやコマンドを不正に改変し、意図しないコード実行を引き起こす。
安全でない設計: システムアーキテクチャや設計段階におけるセキュリティ対策の欠如。
セキュリティ設定の誤り: システム設定のミスにより、脆弱性が生じる。
脆弱で旧式のコンポーネント: 更新されていないライブラリやフレームワークの使用が、攻撃の対象となる。
識別および認証の失敗: 弱い認証メカニズムにより、不正なユーザーがアクセス可能になる。
ソフトウェアおよびデータの整合性に関する障害: 改ざんされたソフトウェアや信頼できないデータの検証不足。
セキュリティのログ記録および監視の欠如: セキュリティイベントの監視が不十分で、攻撃の検知や対応が遅れる。
サーバーサイドリクエストフォージェリ（SSRF）: 攻撃者がサーバーに不正なリクエストを送信し、内部ネットワークへ侵入する。

リスクを特定するには、手動のコードレビューが不可欠ですが、複雑なシステム全体を精査し、あらゆる脆弱性を見つけるのは容易ではありません。開発者が最善を尽くしても、すべてのセキュリティリスクを排除するのは困難です。

そこで、CodeRabbit のような自動コードレビューツールが活躍します。開発の初期段階で脆弱性を検出することで、セキュリティ問題が本番環境に持ち込まれるリスクを最小限に抑えられます。CodeRabbit を導入することで、レビューを効率化し、セキュリティリスクを迅速に特定・解決しながら、開発者が潜在的な脅威に先回りできる環境を整えることができます。

CodeRabbitでOWASP違反を発見

OWASP Top 10 の脆弱性を特定する CodeRabbit の有効性を検証するため、意図的にセキュリティ上の欠陥を含む React ウェブアプリケーションを開発しました。これには、安全でない認証、未検証の入力、暗号化の欠如など、OWASP の主要ガイドラインに違反する一般的な問題が含まれています。以下の図は、アプリケーションの概要を示しています。

まず、クライアント、認証、プロファイルサービス間のやり取りの流れを説明します。

ユーザーはログインフォームに認証情報を入力し、送信します。
認証サービスの /login エンドポイントへ POST リクエストが送信されます。
/login は MD5 を使用した SQLite のパスワードハッシュを照合し、認証を実施します。
認証情報が確認されると、トークンとユーザーデータがローカルストレージに保存されます。
ログインが成功すると、ユーザーはダッシュボードへリダイレクトされます。
ダッシュボードは /profile と /fetch-avatar へ GET リクエストを送信します。

このウェブアプリのコードは、GitHub リポジトリで公開されています。

わずか 2 回のクリックで、CodeRabbit をリポジトリに統合しました。プルリクエストを作成すると、CodeRabbit が自動レビューを実行し、3 つの主要セクションを含む詳細なセキュリティレポートを生成します。

概要: 主なセキュリティリスクと優先的に対処すべき問題をハイライト。
詳細: ファイル単位のステップバイステップ分析と具体的な改善策の提示。
変更履歴: 修正が必要なファイルを一覧表示し、優先順位を決定しやすくする。

以下に、特定された 5 つの主要な OWASP リスクと、それらを修正する方法を紹介します。

アクセス制御の欠陥 (A01:2021)

CodeRabbit は、ルーティング構成が適切なアクセス制御を適用していないことを検出しました。特に /dashboard ルートでは、認証なしに機密データへアクセスできる可能性があります。

この問題は Broken Access Control (A01:2021) に該当し、未認証ユーザーによる不正アクセスを許してしまうリスクがあります。

解決策: /dashboard などの機密ルートへのアクセスには、認証済みのユーザーのみ許可する ProtectedRoute コンポーネントを導入することで、不正アクセスを防止できます。詳細はレビューコメントを参照してください。

暗号化の失敗 (A02:2021)

hash_password 関数が MD5 を使用してパスワードをハッシュ化しており、セキュリティ上の問題が指摘されました。MD5 は脆弱で、衝突攻撃のリスクが高いため、安全ではありません。

この問題は Cryptographic Failures (A02:2021) に該当し、適切な暗号化が施されていないことで、機密データが危険にさらされるリスクがあります。

解決策: bcrypt や Argon2 などの安全なハッシュ関数を使用することを推奨します。詳細はレビューコメントをご覧ください。

SQLインジェクション (A03:2021)

現在の実装では、ユーザー入力が SQL クエリに直接連結されており、SQL インジェクション攻撃に脆弱です。

この問題は Injection (A03:2021) に該当し、攻撃者が任意の SQL コードを実行できる可能性を生み出します。

解決策: パラメータ化クエリを使用し、直接的な文字列連結を避けることで、SQL インジェクションのリスクを低減できます。詳細はレビューコメントを参照してください。

安全でない設計 (A04:2021)

CodeRabbit は、API の設計において複数の問題を指摘しました。API URL のハードコーディング、CSRF 保護の欠如、パスワード強度チェック不足、レート制限の欠如などが含まれます。

これらの問題は Insecure Design (A04:2021) に該当し、セキュアなアプリケーション設計の重要性を示しています。

解決策: 環境変数を使用して API URL を管理し、CSRF 対策としてトークン認証を導入し、レート制限を適用することで、攻撃を未然に防ぐことができます。詳細はレビューコメントを参照してください。

セキュリティ設定の誤り (A05:2021)

Flask のデバッグモードが有効になっており、詳細なエラーメッセージが公開されるリスクが指摘されました。

この問題は Security Misconfiguration (A05:2021) に該当し、攻撃者がシステムの内部情報を得る可能性を高めます。

解決策: 本番環境ではデバッグモードを無効化し、環境変数で制御することを推奨します。詳細はレビューコメントを参照してください。

まとめ

自動コードレビューを活用して OWASP の脆弱性に対処することは、安全で信頼性の高いアプリケーションを維持するうえで不可欠です。開発プロセスの初期段階で脆弱性を発見できるため、セキュリティリスクの低減につながります。継続的なスキャンを行うことで、本番環境に導入される前にセキュリティ上の欠陥や誤設定、コードの非効率性を特定できます。

CodeRabbit は、脆弱性の検出とベストプラクティスの推奨を通じて、開発者が高いセキュリティ基準を維持するのを支援します。開発の早い段階で重大な問題を特定し、効率を損なうことなく、より安全なソフトウェアを構築できるようにします。自動コードレビューを活用することで、ワークフローが合理化され、チームはセキュリティチェックに追われることなく、開発に専念できます。

リスクを未然に防ぐために、今すぐサインアップして、コードを守りましょう。

]]>