Skip to content

arXiv Paper Quality Assessment Guide: From Endorsement Mechanisms to a Practical Checklist

May 28, 2026 1 min
TL;DR arXiv does not perform peer review, and roughly 2% of submissions are rejected. Quality judgment relies on external signals: top venue acceptance > institution + open-source reproduction > citation quality. Includes a 20-item practical checklist and a 2026 toolbox (PWC has shut down).

🌏 中文版

Over 1,000 papers are uploaded to arXiv every day. Everyone has heard that “arXiv is not peer review,” but what exactly do its endorsement and moderation processes filter out — and let through? This post covers arXiv’s own quality mechanisms, how to interpret external signals, tools still usable in 2026, and a checklist to run through after reading a paper.

arXiv’s Two Gatekeepers

Endorsement: A Trust Network, Not Quality Certification

Since 2004, arXiv has required first-time submitters to obtain endorsement. According to arXiv’s official documentation, an endorser’s responsibility is:

“You should not endorse the author if the author is unfamiliar with the basic facts of the field, or if the work is entirely disconnected with current work in the area.”

In other words, the endorser confirms that “this person belongs to the scientific community,” not that “this paper is correct.” New authors from recognized academic institutions typically receive automatic endorsement and never encounter this barrier in practice.

Moderation: Format Review, Not Content Review

According to arXiv’s moderation policy, moderators are volunteer domain experts with terminal degrees. They can:

  • Reclassify: Move to a more appropriate category (being moved to the general category is widely seen as a downgrade in the community)
  • Reject submissions: Wrong format, non-research papers (coursework, research proposals), plagiarism, excessive submission frequency (limit of 3 per day)

According to Scientific American’s report, about 6% of submissions are held and about 2% are rejected. Compared to Nature/Science’s acceptance rates below 10%, arXiv’s bar is clearly on a different level.

Once a paper is announced, it becomes a permanent academic record. arXiv only removes papers for licensing issues, and withdrawals for policy violations retain the metadata.

Conclusion: Getting on arXiv only means the format is acceptable and the author belongs to the academic community. Quality judgment must rely on external signals.

External Quality Signal Pyramid

        ┌─────────────────────────┐
        │ Top Venue Acceptance    │  NeurIPS / ICML / ICLR / ACL / CVPR
        │       (Strongest)      │
        ├─────────────────────────┤
        │ Known Institution +    │  DeepMind / FAIR / runnable code
        │ Open-Source Reproduction│
        ├─────────────────────────┤
        │   Citation Quality     │  Highly Influential Citations > raw count
        ├─────────────────────────┤
        │ arXiv Only, No         │  Requires independent verification
        │ Corroboration          │
        └─────────────────────────┘

Conference Acceptance: The Most Direct Endorsement

A paper’s front page annotated with “Accepted at NeurIPS 2025” means it passed peer review by 3-4 reviewers. Major AI/ML conferences:

TierConferenceAcceptance Rate
Tier 1NeurIPS, ICML, ICLR~20-25%
Tier 1ACL, EMNLP (NLP); CVPR, ICCV (CV)~20-25%
Tier 2AAAI, IJCAI, AISTATS, UAI~25-30%

No conference annotation does not mean the paper is bad — many industry technical reports and foundation model papers (e.g., GPT-4, Llama) choose not to submit to conferences. But if a paper claiming breakthrough results has neither conference acceptance nor backing from a well-known institution, extra caution is warranted.

Citation Metrics: Quality Over Quantity

The DORA Declaration explicitly opposes using Impact Factor as a proxy for individual paper quality. More meaningful approaches:

  • Semantic Scholar’s “Highly Influential Citations”: Distinguishes between “mentioned in passing in related work” and “method genuinely builds on this foundation”
  • Citation graphs: Being extended by 30 independent teams is more valuable than being mentioned in 200 papers’ related work sections
  • Citation counts are meaningless for new papers: Within 6 months of publication, citations have not yet accumulated

Open-Source Reproduction: No Code Is a Negative Signal

Since 2025, not including code has shifted from “neutral” to “negative signal.” But beware: having a GitHub link with zero commits after the README is a known superficial pattern. What truly matters is a repo that actually runs, with clear seeds and environment configuration.

The 2026 Paper Evaluation Toolbox

Papers With Code was shut down by Meta in July 2025, and the integrated experience that once tracked 79,817 papers, 9,327 benchmarks, and 5,628 datasets is gone (CodeSOTA record, TIB-Blog report). Here are the currently available alternatives:

ToolPurposeFree
Semantic ScholarCitation quality analysis (Highly Influential Citations), TLDR summaries, 200M+ paper indexYes
Connected PapersVisual exploration of related fields from a seed paper (similarity-based, not citation graph)5 graphs/month
OpenReviewRead reviewer comments and scores for ICLR and other conferences directlyYes
HF Daily PapersDaily trending AI papers, community votingYes
CodeSOTASpiritual successor to PWC, SOTA leaderboard (with reproduction verification)Yes
ar5iv / arXiv HTMLHTML version of papers, easier to read and search than PDFYes
DBLPVerify author publication records, browse conference paper listsYes
Discovery ──→ HF Daily Papers / Semantic Scholar / X

Screening ──→ Authors, institutions, conference acceptance tags

Evaluation ──→ OpenReview reviewer comments / S2 citation quality

Exploration ──→ Connected Papers related work / DBLP author records

Verification ──→ CodeSOTA / GitHub / HF Models for implementations

Red Flag List

The Paper Itself

Red FlagWhy It’s a Problem
Related work cites non-existent papersAI-generated artifact; entire paper’s credibility drops to zero
Tested only on self-created datasetsCannot fairly compare with other methods
No ablation studyUnknown which component actually contributes
Reports only the most favorable metricSelective reporting
No error bars / confidence intervalsResults may be random fluctuation
Baselines over 2 years oldUnfair comparison
Claims to greatly surpass SOTA but no codeCannot be verified
Large discrepancy between abstract and results table numbersOver-packaging

arXiv-Specific Pitfalls

  • Version bombing: Frequent version updates in a short period, possibly silently fixing discovered issues
  • Moved to general category: Usually a moderator’s downgrade action
  • Self-citation inflation: Heavily citing one’s own prior unreviewed arXiv papers
  • Citation cartels: A group of authors mutually citing each other to inflate numbers — according to a arXiv 2509.07257 investigation, citation cartels are already a systemic problem in academic publishing

ML Reproducibility: 63.5% Success Rate

According to Raff (2019), the success rate of independently reproducing 255 papers was only 63.5% (Princeton reproducibility crisis page). Main reasons: missing code, unreported hyperparameters, random seed effects, and framework version differences.

arXiv 2407.12220 lists 43 Questionable Research Practices (QRPs), the most common of which include:

  • Train/test leakage: Training data contaminating the test set
  • Benchmark contamination: LLM pre-training data may have already seen benchmark data
  • Unfair baseline comparisons: Carefully tuning hyperparameters for one’s own model while using defaults for baselines
  • Selective metric reporting: Only reporting the best-performing metric

NeurIPS has adopted the ML Reproducibility Checklist, and the REFORMS framework provides a comprehensive checklist covering 8 modules and 32 items (arXiv 2308.07832).

Practical Checklist: What to Check After Reading an arXiv ML Paper

Synthesized from the REFORMS checklist, ML Reproducibility Checklist, and CodeSOTA guide:

Datasets

  • Uses standard benchmarks for the task
  • Data preprocessing has sufficient detail to reproduce
  • Train/val/test splits are standard or custom (and justified)

Baselines

  • Baselines are recent (within 12-18 months)
  • Baselines are run by the authors themselves, not copied from other papers
  • Baselines use the same compute budget

Metrics & Statistics

  • Reports all standard metrics for the task
  • Includes error bars or confidence intervals
  • Reports computational cost and inference speed

Reproducibility

  • Code is publicly available
  • Hyperparameters are fully listed
  • Training hardware and duration are disclosed

Integrity

  • Includes data leakage / contamination analysis
  • Shows failure cases (not just successes)
  • Limitations section honestly discusses constraints
  • Ablation study tests all key components

Per CodeSOTA’s recommendation: if more than 3 items are unchecked, treat the results as “preliminary and unverified.”

The Big Picture

Judging the quality of arXiv papers is a skill that requires practice. The core principle: arXiv’s threshold only filters format; quality judgment is up to you.

The most efficient approach is to start screening with external signals (conference acceptance, institution, open source), then use the checklist to closely examine the experimental design of papers that pass initial screening. Tools will change (the shutdown of PWC is the best example), but the judgment logic of “checking whether baselines are fair, whether ablations are complete, and whether results are reproducible” will not.

References