# SecureReview: Teaching LLMs to Read Code Like a Senior Engineer
*Draft for HuggingFace blog · OpenEnv Hackathon submission, India 2026*
-–
## The problem
Every existing OpenEnv environment tests the same skill — *can the agent **do** something?* Play a game, navigate a grid, call a tool, write an answer.
But there’s a different skill that matters more for the world we’re heading into: **can the agent read what’s already there, and spot what will break in production?**
Code review. Migration safety. Infrastructure misconfigurations. Vulnerable dependencies. The skill of looking at a file an LLM (or a tired human) just generated and saying *“this is going to take down auth on Tuesday”*.
That’s what **SecureReview** is — an OpenEnv environment that turns security review into a measurable RL task.
## The environment
Three review domains, all wired into the same FastAPI / Gym-style harness:
| Task | What the agent sees | What it has to find |
|—|—|—|
| `dependency_review` | `package.json`, `requirements.txt` | Vulnerable / typosquatted / hallucinated packages |
| `migration_review` | SQL migration scripts | Hot-row contention, RLS gaps, partition pruning, MVCC bloat |
| `iac_review` | Terraform, K8s YAML, Dockerfile, docker-compose, GitHub Actions | Public S3, hardcoded secrets, privileged containers, IAM wildcards |
**60+ hand-curated scenarios** across the three domains. Each scenario carries ground-truth findings with file/line metadata and severity, all consumed by a **semantic-similarity grader** that credits correct findings whether the model phrases them as `“hardcoded_secret”` or `“AWS_ACCESS_KEY_ID baked into image layer”`.
## The training
We ran the **canonical industry-standard hybrid pipeline**: SFT warmup on the env’s ground-truth findings, then GRPO refinement against the live grader. Same recipe DeepSeek-R1, Qwen-RL, and OpenAI’s post-training stack use.
| Task | Baseline | Trained | Δ | Wins |
|—|—|—|—|—|
| Dependency | `0.083` | `0.385` | **+0.302** | 20/24 |
| Migration | `0.170` | `0.465` | **+0.295** | 10/12 |
| IaC | `0.177` | `0.303` | **+0.126** | 6/13 |
Average **+0.24 mean reward lift**, individual scenarios gaining as much as **+0.91**. Each task trains in **under 30 seconds** on a single Hugging Face GPU credit.
## Why this is interesting
**The reward signal is dense by design.** Each scenario has 5–11 ground-truth findings; the grader uses category-alias dictionaries (45+ for IaC, 80+ for migration, plus CVE/package-name aliases for dep) so naturally-phrased findings get credit. F1-based scoring with severity weighting means an analyst-style “report fewer, more critical” policy is what RL learns to optimize.
**The same env scales from 1.5B to 14B.** Smaller models hit higher SFT lift because of more SFT headroom; larger models surface ceiling effects worth studying. Both are *features* the env exposes. Multi-scale runs are a one-click reproduce.
**It’s a real benchmark, not a toy.** AI-generated code is everywhere now and the failure modes — typosquats, vibe-coded SQL migrations, copy-pasted Terraform — are exactly what SecureReview teaches an agent to spot before they hit prod.
## Try it
- **Env**: [ SecureReview - a Hugging Face Space by sam25kat ]( SecureReview - a Hugging Face Space by sam25kat )
- **Trainers** (one-click reproduce):
- [securereview-trainer]( SecureReview GRPO Trainer - a Hugging Face Space by sam25kat ) (dep)
- [securereview-trainer-migration]( SecureReview Trainer — Migration - a Hugging Face Space by sam25kat )
- [securereview-trainer-iac]( SecureReview Trainer — IaC - a Hugging Face Space by sam25kat )
- **Code**: [ GitHub - sam25kat/Secure_Reveiw · GitHub ]( GitHub - sam25kat/Secure_Reveiw · GitHub )
Click “Run Training” on any trainer Space — full SFT->GRPO hybrid pipeline, training Loss + Before/After plots, **all in one click**.
-–
**Built for the OpenEnv Hackathon 2026 (India). Submission round 2.*
~The Cook House.*