Skip to content

How to Classify Code Review Comments? From Conventional Comments to AI Review Tool Taxonomies

Mar 26, 2026 1 min
TL;DR Three main classification systems dominate: Conventional Comments (label-based), Google's severity prefixes (Nit/Optional/FYI), and SonarQube's four quadrants (Bug/Vulnerability/Code Smell/Hotspot). AI review tools have each developed their own taxonomies, but the core dimensions consistently converge on four areas: correctness, security, performance, and maintainability.

🌏 中文版

The most common problem with code review comments: the reviewer thinks it’s a blocking issue, but the author treats it as a suggestion and moves on. It’s not anyone’s fault — the comment itself carries no classification signal, so each side interprets it differently.

This post covers three things: the mainstream comment classification standards, how AI review tools each categorize feedback, and a curated list of classic articles and research worth reading.

Three Mainstream Classification Standards

Conventional Comments — The Most Widely Adopted Label System

From conventionalcomments.org, the format is <label> [decorations]: <subject>.

Seven core labels:

LabelDescriptionBlocking?
praiseRecognize something done wellN/A
nitpickMinor, trivial changesNon-blocking
suggestionConcrete improvement proposalsContext-dependent
issueProblems the user will encounterBlocking
questionUnsure if there’s an issue — asking firstNon-blocking
thoughtExtended ideas worth consideringNon-blocking
choreTasks that must be done before mergingBlocking

Decorations eliminate ambiguity: (blocking) means must fix, (non-blocking) means suggested but not required, (if-minor) means do it while you’re at it if it’s small.

suggestion (blocking): Please rewrite this SQL query as a parameterized query to prevent injection attacks.

The value of this system is that it forces reviewers to decide at the moment of writing whether something is actually blocking or not.

Google Engineering Practices — Lightweight Severity Prefixes

Google’s eng-practices uses three prefixes:

PrefixMeaning
Nit:Technically should be fixed but not critical
Optional: / Consider:Suggested but not required
FYI:For reference; not expected to be addressed in this PR

The core principle: reviews should ask “does this code improve the overall health of the codebase?” — not aim for perfection. Don’t block a PR over nits. Google’s review turnaround time is about 4 hours, achieved by keeping changes small (35%+ modify only one file).

SonarQube — Rule-Driven Four-Quadrant Classification

With 6,500+ rules across 35+ languages, SonarQube is the most mature static analysis taxonomy in the industry.

TypeDescriptionTarget False-Positive Rate
BugCauses runtime errorsNear 0%
VulnerabilityExploitable by attackers<20%
Security HotspotSecurity-sensitive, requires human judgmentNeeds review
Code SmellMaintainability issuesNear 0%

Five severity levels: BLOCKER → CRITICAL → MAJOR → MINOR → INFO.

SonarQube 10.3+ began transitioning toward Clean Code attributes and software quality dimensions (Reliability / Security / Maintainability), gradually replacing the older classification.

Informal but Universally Understood Prefixes

PrefixMeaning
nit:Cosmetic, not worth blocking over
LGTMLooks Good To Me
PTALPlease Take Another Look
TODO:To be handled later
FIXME:Something broken that needs immediate attention
ACK / NAKAcknowledged / Not Acknowledged (common in the Linux kernel)

AI Code Review Tool Taxonomies

Claude Code Review — Quality Over Quantity

Only three severity levels, with correctness-only as the default:

MarkerCategoryDescription
🔴NormalBugs that would affect production
🟡NitMinor issues worth fixing but non-blocking
🟣Pre-existingBugs that predate this PR

Multiple agents analyze in parallel, a validation step filters out false positives, and comments are deduplicated and ranked before being posted. It doesn’t touch formatting preferences or test coverage — unless you explicitly request it in a REVIEW.md. Every finding includes extended reasoning explaining why it was flagged.

CodeRabbit — Comprehensive Coverage

Dual-axis classification: type × severity.

Three feedback types: ⚠️ Potential issue, 🛠️ Refactor suggestion, 🧹 Nitpick (Assertive mode only).

Four severity levels (agent layer): Critical → High → Medium → Low.

A distinctive feature is that it also generates positive feedback (praise), and integrates with Jira/Linear for ticket compliance checks. The tradeoff is noise — independent benchmarks found approximately 28% of comments to be noise or based on incorrect assumptions.

GitHub Copilot Code Review — Zero Setup but Surface-Level

Five domains: Security, Performance, Code Quality, Architecture & Design, Testing & Documentation.

The advantage is zero configuration; the downside is limited depth. It tends toward surface-level suggestions (naming, formatting, common best practices). Research found it missed all security vulnerabilities across 117 files; another test showed that 31 of 47 suggestions were things ESLint would catch, and 7 were outright wrong.

Qodo PR-Agent — Highly Configurable

Open-source core where every dimension can be toggled. Auto-labels include possible security issue, review effort [1-5], and ticket compliance. Each issue is categorized by quality dimension (reliability / maintainability / security), with remediation prompts you can paste directly into an AI tool to fix.

Configurable review sections: PR score, whether tests are included, review effort estimation, and suggestions to split the PR.

Greptile — High Signal-to-Noise Ratio

Its categories are similar to other tools (Critical Bugs / Refactoring / Performance / Validation / Nitpicks), but it deliberately limits the number of comments. Each comment carries a confidence score — it would rather say less than generate false positives. Full codebase indexing enables it to catch cross-layer issues.

Review Dimension Coverage Comparison

DimensionClaudeCodeRabbitCopilotQodoSonarQubeGreptile
Bug / Correctness✅ Core
Security Vulnerabilities⚠️ Weak✅ Strongest
PerformanceExtensible
MaintainabilityExtensible✅ Core
Style / FormattingOff by defaultAssertive modeConfigurableLow priority
Test CoverageOff by default
Pre-existing Issue Flagging✅ 🟣
Positive Feedback
Ticket Compliance

Design Philosophy Comparison

ToolPhilosophy
Claude Code ReviewPrecision over recall — defaults to correctness only, uses a validation step to filter false positives
CodeRabbitComprehensive coverage — deep multi-dimensional analysis at the cost of higher noise
CopilotLow friction — zero-config GitHub integration, broad but shallow
QodoConfigurable — open-source core, every dimension can be toggled and customized
SonarQubeRule-driven — 6,500+ deterministic rules, AI as a supplement
GreptileHigh signal-to-noise — prefers saying less over generating false positives, includes confidence scores

Research from three major companies — Google (9 million reviews), Microsoft (50,000+ developers), and Meta experiments — all converge on the same conclusion: for code review to scale, the core value should be knowledge sharing, letting automation handle what doesn’t require human judgment.

Academic Papers

Engineering Practice

Resource Collections

The Bottom Line

The classification system itself isn’t the point. What matters is that the team has shared agreement on “does this comment need to be addressed.”

The lowest-cost approach: use the nit: prefix to distinguish blocking from non-blocking — that alone solves 80% of the problem. For something more complete, adopt Conventional Comments. AI tool classifications are useful as reference, but don’t expect them to replace your team’s own judgment.

One interesting data point: CodeRabbit found that AI-generated code has 1.7x more issues per PR than human-written code, with 75% more logic errors. AI writing code and AI reviewing code is already reality — but the final line of defense for classification and judgment is still human.

References