Create DECISION_LOOP.md
Browse files- v2/DECISION_LOOP.md +192 -0
v2/DECISION_LOOP.md
ADDED
|
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ARC-AGI-3 Agent v2.1: Integrated Decision Loop
|
| 2 |
+
|
| 3 |
+
## New Modules (Phase 2)
|
| 4 |
+
|
| 5 |
+
### 1. Undo-as-Reasoning (Action-Time-Training)
|
| 6 |
+
**Module**: `v2/models/undo_reasoning.py`
|
| 7 |
+
|
| 8 |
+
The Undo action is a **free experiment**, not a safety net.
|
| 9 |
+
|
| 10 |
+
```
|
| 11 |
+
Standard loop: ATT loop:
|
| 12 |
+
Act β Observe Act β Observe β Undo β Re-observe
|
| 13 |
+
β Update world model
|
| 14 |
+
β Re-imagine all actions
|
| 15 |
+
β Pick the BEST one
|
| 16 |
+
β Execute for real
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
**Three modes**:
|
| 20 |
+
- **PROBE** (early game): Test each key action + undo. Cost: 2 actions per key Γ 6 keys = 12 actions. Gain: complete mechanics map + irreversibility detection.
|
| 21 |
+
- **VERIFY** (mid game): Only undo when world model prediction was wrong. Saves budget while correcting errors.
|
| 22 |
+
- **EXPLOIT** (late game): Never undo. Trust the model.
|
| 23 |
+
|
| 24 |
+
**Key insight**: Each probe gives us TWO transitions (action + undo) for the world model, doubling learning speed during exploration.
|
| 25 |
+
|
| 26 |
+
### 2. Symbolic Memory Buffer
|
| 27 |
+
**Module**: `v2/models/symbolic_memory.py`
|
| 28 |
+
|
| 29 |
+
Stores rules as explicit predicates:
|
| 30 |
+
```
|
| 31 |
+
[β] R0: IF any_state() THEN key_0 β player_moves(right, 1) (100% conf, 5β/0β)
|
| 32 |
+
[β] R1: IF player_adjacent_to(3) THEN key_2 β object_removed(3) (100% conf, 3β/0β)
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
**Why this matters for cross-level generalization**:
|
| 36 |
+
- Level 1 teaches: \"key 0 = move right, key 1 = move down\"
|
| 37 |
+
- Level 3 teaches: \"key 2 = collect when adjacent\"
|
| 38 |
+
- Level 6 requires **composing** L1 + L3: \"navigate to object, then collect\"
|
| 39 |
+
- RSSM latent state from Level 1 degrades by Level 6. Symbolic rules don't.
|
| 40 |
+
|
| 41 |
+
**Rule Inducer**: Watches transitions and extracts patterns:
|
| 42 |
+
```python
|
| 43 |
+
inducer.observe_transition(grid_before, action_key=0, action_pos=55, grid_after)
|
| 44 |
+
# β Induces: IF action_on_object() THEN key_0 β player_moves(right, 1)
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
### 3. Boredom Detector
|
| 48 |
+
**Module**: `v2/models/symbolic_memory.py` (BoredomDetector class)
|
| 49 |
+
|
| 50 |
+
Detects three stuck patterns:
|
| 51 |
+
1. **Stagnation**: Same state for N consecutive actions
|
| 52 |
+
2. **Repetition**: Same action tried M times with no effect
|
| 53 |
+
3. **Oscillation**: AβBβAβBβAβB pattern
|
| 54 |
+
|
| 55 |
+
When bored, suggests diversification:
|
| 56 |
+
```python
|
| 57 |
+
{
|
| 58 |
+
\"boredom_level\": 0.85,
|
| 59 |
+
\"try_keys\": [1, 2, 3, 4, 5], # Keys NOT tried recently
|
| 60 |
+
\"avoid_positions\": [55], # Positions that keep failing
|
| 61 |
+
\"suggestion\": \"random_walk\" # Break out of local loop
|
| 62 |
+
}
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Integrated Decision Loop (v2.1)
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 69 |
+
β MAIN AGENT LOOP β
|
| 70 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 71 |
+
β β
|
| 72 |
+
β 1. OBSERVE grid β
|
| 73 |
+
β β β
|
| 74 |
+
β 2. RECORD transition (if not first step) β
|
| 75 |
+
β βββ Update transition buffer β
|
| 76 |
+
β βββ Feed action effect tracker β
|
| 77 |
+
β βββ Feed rule inducer β symbolic memory β
|
| 78 |
+
β βββ Feed boredom detector β
|
| 79 |
+
β βββ Check world model prediction accuracy β
|
| 80 |
+
β β β
|
| 81 |
+
β 3. CHECK UNDO REASONER β
|
| 82 |
+
β βββ If awaiting undo result β process it, update WM β
|
| 83 |
+
β βββ If in PROBE mode β test current action with undo β
|
| 84 |
+
β βββ If in VERIFY mode + prediction wrong β undo and re-learn β
|
| 85 |
+
β β β
|
| 86 |
+
β 4. CHECK BOREDOM β
|
| 87 |
+
β βββ If bored β reset goal, try untested action β
|
| 88 |
+
β βββ If not bored β continue with current strategy β
|
| 89 |
+
β β β
|
| 90 |
+
β 5. SELECT ACTION (layered strategy) β
|
| 91 |
+
β β β
|
| 92 |
+
β βββ Layer 0: Undo (if ATT says to probe/verify) β
|
| 93 |
+
β β β
|
| 94 |
+
β βββ Layer 1: Symbolic rule lookup β
|
| 95 |
+
β β \"Is there a CONFIRMED rule for my current goal?\" β
|
| 96 |
+
β β e.g., \"I want to remove object(3). Rule R1 says: β
|
| 97 |
+
β β press key 2 when adjacent to color 3\" β
|
| 98 |
+
β β IF YES β execute that rule's action β
|
| 99 |
+
β β β
|
| 100 |
+
β βββ Layer 2: Goal-directed navigation β
|
| 101 |
+
β β \"I know the mechanics. Navigate toward subgoal.\" β
|
| 102 |
+
β β Uses action semantics (key 0 = right, key 1 = down, etc.) β
|
| 103 |
+
β β β
|
| 104 |
+
β βββ Layer 3: CEM planning (world model imagination) β
|
| 105 |
+
β β Only when: WM accuracy > 70% AND goal latent is known β
|
| 106 |
+
β β β
|
| 107 |
+
β βββ Layer 4: Smart cell-select probing β
|
| 108 |
+
β β If cell-select budget remaining β probe next candidate β
|
| 109 |
+
β β β
|
| 110 |
+
β βββ Layer 5: Systematic exploration (fallback) β
|
| 111 |
+
β Use targeted exploration with effective keys β
|
| 112 |
+
β β β
|
| 113 |
+
β 6. EXECUTE action in environment β
|
| 114 |
+
β β β
|
| 115 |
+
β 7. ONLINE TRAIN world model (every N steps) β
|
| 116 |
+
β β β
|
| 117 |
+
β 8. If level complete β persist state, advance level β
|
| 118 |
+
β β β
|
| 119 |
+
β βββ LOOP β
|
| 120 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## Action Budget Allocation (for a typical level)
|
| 124 |
+
|
| 125 |
+
Assuming humans solve in ~15 actions β agent targets β€30 actions for 25% RHAE floor.
|
| 126 |
+
|
| 127 |
+
| Phase | Actions | What Happens |
|
| 128 |
+
|-------|---------|-------------|
|
| 129 |
+
| Undo probes | 6-12 | Test each key + undo. Learn all mechanics. |
|
| 130 |
+
| Cell-select probes | 3-5 | Identify click purpose (teleport/toggle/etc.) |
|
| 131 |
+
| Goal setup | 0 | Visual analysis of salient objects (free) |
|
| 132 |
+
| Navigation | 5-10 | Move to each subgoal using learned mechanics |
|
| 133 |
+
| Interaction | 2-5 | Execute rules at each subgoal position |
|
| 134 |
+
| **Total** | **16-32** | Target: < 2Γ human baseline |
|
| 135 |
+
|
| 136 |
+
## Module Dependency Graph
|
| 137 |
+
|
| 138 |
+
```
|
| 139 |
+
βββββββββββββββββββ
|
| 140 |
+
β Environment β
|
| 141 |
+
ββββββββββ¬βββββββββ
|
| 142 |
+
β grid, reward, done
|
| 143 |
+
βΌ
|
| 144 |
+
βββββββββββββββββββββββββββββββββ
|
| 145 |
+
β Transition Buffer β
|
| 146 |
+
βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬βββββββ
|
| 147 |
+
β β β β β β
|
| 148 |
+
βΌ βΌ βΌ βΌ βΌ βΌ
|
| 149 |
+
ββββ΄ββ ββ΄βββ ββ΄βββ βββ΄βββ βββ΄βββ βββ΄βββββββ
|
| 150 |
+
βCNN β βEx-β βActβ βRuleβ βBoreβ βUndo β
|
| 151 |
+
βWM β βploβ βEffβ βInd-β βdom β βReason- β
|
| 152 |
+
βRSSMβ βrerβ βTrkβ βucerβ βDet β βer(ATT) β
|
| 153 |
+
ββββ¬ββ βββ¬ββ βββ¬ββ ββββ¬β βββ¬βββ ββββ¬βββββ
|
| 154 |
+
β β β β β β
|
| 155 |
+
βΌ βΌ βΌ βΌ βΌ βΌ
|
| 156 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
+
β ACTION SELECTOR β
|
| 158 |
+
β (6-layer priority: undo > symbolic β
|
| 159 |
+
β > goal-directed > CEM > cell-sel β
|
| 160 |
+
β > exploration) β
|
| 161 |
+
ββββββββββββββββ¬ββββββββββββββββββββββββ
|
| 162 |
+
β
|
| 163 |
+
βΌ
|
| 164 |
+
action (key, pos)
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## What Persists Across Levels
|
| 168 |
+
|
| 169 |
+
| Component | Persists? | Why |
|
| 170 |
+
|-----------|-----------|-----|
|
| 171 |
+
| RSSM h_state, z_state | β | Latent understanding of physics |
|
| 172 |
+
| Symbolic Memory rules | β | Explicit mechanics knowledge |
|
| 173 |
+
| Action Effect Tracker | β | \"key 0 = move right\" |
|
| 174 |
+
| Click Affordance Map | β | Which cells are interactive |
|
| 175 |
+
| Undo knowledge | β | Which keys are reversible |
|
| 176 |
+
| Transition Buffer | β | Training data for world model |
|
| 177 |
+
| Boredom state | β (reset) | Fresh patience per level |
|
| 178 |
+
| Goal/subgoals | β (reset) | New objectives per level |
|
| 179 |
+
| Exploration phase | β (reset to targeted) | Skip scan in later levels |
|
| 180 |
+
|
| 181 |
+
## Files
|
| 182 |
+
|
| 183 |
+
| File | Size | Purpose |
|
| 184 |
+
|------|------|---------|
|
| 185 |
+
| `v2/models/world_model.py` | 7.4M params | CNN encoder + RSSM dynamics + decoder |
|
| 186 |
+
| `v2/models/exploration.py` | β | Systematic multi-phase exploration |
|
| 187 |
+
| `v2/models/planning.py` | β | CEM planner with world model imagination |
|
| 188 |
+
| `v2/models/action_effects.py` | β | Action semantics learning |
|
| 189 |
+
| `v2/models/smart_cell_select.py` | β | Affordance map + budget + conditional mechanics |
|
| 190 |
+
| `v2/models/undo_reasoning.py` | β | ATT: undo-based hypothesis testing |
|
| 191 |
+
| `v2/models/symbolic_memory.py` | β | Rule buffer + inducer + boredom detector |
|
| 192 |
+
|