guychuk commited on
Commit
3ef3aaa
Β·
verified Β·
1 Parent(s): bc01c42

Create DECISION_LOOP.md

Browse files
Files changed (1) hide show
  1. v2/DECISION_LOOP.md +192 -0
v2/DECISION_LOOP.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ARC-AGI-3 Agent v2.1: Integrated Decision Loop
2
+
3
+ ## New Modules (Phase 2)
4
+
5
+ ### 1. Undo-as-Reasoning (Action-Time-Training)
6
+ **Module**: `v2/models/undo_reasoning.py`
7
+
8
+ The Undo action is a **free experiment**, not a safety net.
9
+
10
+ ```
11
+ Standard loop: ATT loop:
12
+ Act β†’ Observe Act β†’ Observe β†’ Undo β†’ Re-observe
13
+ β†’ Update world model
14
+ β†’ Re-imagine all actions
15
+ β†’ Pick the BEST one
16
+ β†’ Execute for real
17
+ ```
18
+
19
+ **Three modes**:
20
+ - **PROBE** (early game): Test each key action + undo. Cost: 2 actions per key Γ— 6 keys = 12 actions. Gain: complete mechanics map + irreversibility detection.
21
+ - **VERIFY** (mid game): Only undo when world model prediction was wrong. Saves budget while correcting errors.
22
+ - **EXPLOIT** (late game): Never undo. Trust the model.
23
+
24
+ **Key insight**: Each probe gives us TWO transitions (action + undo) for the world model, doubling learning speed during exploration.
25
+
26
+ ### 2. Symbolic Memory Buffer
27
+ **Module**: `v2/models/symbolic_memory.py`
28
+
29
+ Stores rules as explicit predicates:
30
+ ```
31
+ [βœ“] R0: IF any_state() THEN key_0 β†’ player_moves(right, 1) (100% conf, 5βœ“/0βœ—)
32
+ [βœ“] R1: IF player_adjacent_to(3) THEN key_2 β†’ object_removed(3) (100% conf, 3βœ“/0βœ—)
33
+ ```
34
+
35
+ **Why this matters for cross-level generalization**:
36
+ - Level 1 teaches: \"key 0 = move right, key 1 = move down\"
37
+ - Level 3 teaches: \"key 2 = collect when adjacent\"
38
+ - Level 6 requires **composing** L1 + L3: \"navigate to object, then collect\"
39
+ - RSSM latent state from Level 1 degrades by Level 6. Symbolic rules don't.
40
+
41
+ **Rule Inducer**: Watches transitions and extracts patterns:
42
+ ```python
43
+ inducer.observe_transition(grid_before, action_key=0, action_pos=55, grid_after)
44
+ # β†’ Induces: IF action_on_object() THEN key_0 β†’ player_moves(right, 1)
45
+ ```
46
+
47
+ ### 3. Boredom Detector
48
+ **Module**: `v2/models/symbolic_memory.py` (BoredomDetector class)
49
+
50
+ Detects three stuck patterns:
51
+ 1. **Stagnation**: Same state for N consecutive actions
52
+ 2. **Repetition**: Same action tried M times with no effect
53
+ 3. **Oscillation**: A→B→A→B→A→B pattern
54
+
55
+ When bored, suggests diversification:
56
+ ```python
57
+ {
58
+ \"boredom_level\": 0.85,
59
+ \"try_keys\": [1, 2, 3, 4, 5], # Keys NOT tried recently
60
+ \"avoid_positions\": [55], # Positions that keep failing
61
+ \"suggestion\": \"random_walk\" # Break out of local loop
62
+ }
63
+ ```
64
+
65
+ ## Integrated Decision Loop (v2.1)
66
+
67
+ ```
68
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
69
+ β”‚ MAIN AGENT LOOP β”‚
70
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
71
+ β”‚ β”‚
72
+ β”‚ 1. OBSERVE grid β”‚
73
+ β”‚ β”‚ β”‚
74
+ β”‚ 2. RECORD transition (if not first step) β”‚
75
+ β”‚ β”œβ”€β”€ Update transition buffer β”‚
76
+ β”‚ β”œβ”€β”€ Feed action effect tracker β”‚
77
+ β”‚ β”œβ”€β”€ Feed rule inducer β†’ symbolic memory β”‚
78
+ β”‚ β”œβ”€β”€ Feed boredom detector β”‚
79
+ β”‚ └── Check world model prediction accuracy β”‚
80
+ β”‚ β”‚ β”‚
81
+ β”‚ 3. CHECK UNDO REASONER β”‚
82
+ β”‚ β”œβ”€β”€ If awaiting undo result β†’ process it, update WM β”‚
83
+ β”‚ β”œβ”€β”€ If in PROBE mode β†’ test current action with undo β”‚
84
+ β”‚ └── If in VERIFY mode + prediction wrong β†’ undo and re-learn β”‚
85
+ β”‚ β”‚ β”‚
86
+ β”‚ 4. CHECK BOREDOM β”‚
87
+ β”‚ β”œβ”€β”€ If bored β†’ reset goal, try untested action β”‚
88
+ β”‚ └── If not bored β†’ continue with current strategy β”‚
89
+ β”‚ β”‚ β”‚
90
+ β”‚ 5. SELECT ACTION (layered strategy) β”‚
91
+ β”‚ β”‚ β”‚
92
+ β”‚ β”œβ”€β”€ Layer 0: Undo (if ATT says to probe/verify) β”‚
93
+ β”‚ β”‚ β”‚
94
+ β”‚ β”œβ”€β”€ Layer 1: Symbolic rule lookup β”‚
95
+ β”‚ β”‚ \"Is there a CONFIRMED rule for my current goal?\" β”‚
96
+ β”‚ β”‚ e.g., \"I want to remove object(3). Rule R1 says: β”‚
97
+ β”‚ β”‚ press key 2 when adjacent to color 3\" β”‚
98
+ β”‚ β”‚ IF YES β†’ execute that rule's action β”‚
99
+ β”‚ β”‚ β”‚
100
+ β”‚ β”œβ”€β”€ Layer 2: Goal-directed navigation β”‚
101
+ β”‚ β”‚ \"I know the mechanics. Navigate toward subgoal.\" β”‚
102
+ β”‚ β”‚ Uses action semantics (key 0 = right, key 1 = down, etc.) β”‚
103
+ β”‚ β”‚ β”‚
104
+ β”‚ β”œβ”€β”€ Layer 3: CEM planning (world model imagination) β”‚
105
+ β”‚ β”‚ Only when: WM accuracy > 70% AND goal latent is known β”‚
106
+ β”‚ β”‚ β”‚
107
+ β”‚ β”œβ”€β”€ Layer 4: Smart cell-select probing β”‚
108
+ β”‚ β”‚ If cell-select budget remaining β†’ probe next candidate β”‚
109
+ β”‚ β”‚ β”‚
110
+ β”‚ └── Layer 5: Systematic exploration (fallback) β”‚
111
+ β”‚ Use targeted exploration with effective keys β”‚
112
+ β”‚ β”‚ β”‚
113
+ β”‚ 6. EXECUTE action in environment β”‚
114
+ β”‚ β”‚ β”‚
115
+ β”‚ 7. ONLINE TRAIN world model (every N steps) β”‚
116
+ β”‚ β”‚ β”‚
117
+ β”‚ 8. If level complete β†’ persist state, advance level β”‚
118
+ β”‚ β”‚ β”‚
119
+ β”‚ └── LOOP β”‚
120
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
121
+ ```
122
+
123
+ ## Action Budget Allocation (for a typical level)
124
+
125
+ Assuming humans solve in ~15 actions β†’ agent targets ≀30 actions for 25% RHAE floor.
126
+
127
+ | Phase | Actions | What Happens |
128
+ |-------|---------|-------------|
129
+ | Undo probes | 6-12 | Test each key + undo. Learn all mechanics. |
130
+ | Cell-select probes | 3-5 | Identify click purpose (teleport/toggle/etc.) |
131
+ | Goal setup | 0 | Visual analysis of salient objects (free) |
132
+ | Navigation | 5-10 | Move to each subgoal using learned mechanics |
133
+ | Interaction | 2-5 | Execute rules at each subgoal position |
134
+ | **Total** | **16-32** | Target: < 2Γ— human baseline |
135
+
136
+ ## Module Dependency Graph
137
+
138
+ ```
139
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
140
+ β”‚ Environment β”‚
141
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
142
+ β”‚ grid, reward, done
143
+ β–Ό
144
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
145
+ β”‚ Transition Buffer β”‚
146
+ β””β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
147
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
148
+ β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό
149
+ β”Œβ”€β”€β”΄β”€β” β”Œβ”΄β”€β”€β” β”Œβ”΄β”€β”€β” β”Œβ”€β”΄β”€β”€β” β”Œβ”€β”΄β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”
150
+ β”‚CNN β”‚ β”‚Ex-β”‚ β”‚Actβ”‚ β”‚Ruleβ”‚ β”‚Boreβ”‚ β”‚Undo β”‚
151
+ β”‚WM β”‚ β”‚ploβ”‚ β”‚Effβ”‚ β”‚Ind-β”‚ β”‚dom β”‚ β”‚Reason- β”‚
152
+ β”‚RSSMβ”‚ β”‚rerβ”‚ β”‚Trkβ”‚ β”‚ucerβ”‚ β”‚Det β”‚ β”‚er(ATT) β”‚
153
+ β””β”€β”€β”¬β”€β”˜ β””β”€β”¬β”€β”˜ β””β”€β”¬β”€β”˜ β””β”€β”€β”¬β”˜ β””β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”˜
154
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
155
+ β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό
156
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
157
+ β”‚ ACTION SELECTOR β”‚
158
+ β”‚ (6-layer priority: undo > symbolic β”‚
159
+ β”‚ > goal-directed > CEM > cell-sel β”‚
160
+ β”‚ > exploration) β”‚
161
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
162
+ β”‚
163
+ β–Ό
164
+ action (key, pos)
165
+ ```
166
+
167
+ ## What Persists Across Levels
168
+
169
+ | Component | Persists? | Why |
170
+ |-----------|-----------|-----|
171
+ | RSSM h_state, z_state | βœ“ | Latent understanding of physics |
172
+ | Symbolic Memory rules | βœ“ | Explicit mechanics knowledge |
173
+ | Action Effect Tracker | βœ“ | \"key 0 = move right\" |
174
+ | Click Affordance Map | βœ“ | Which cells are interactive |
175
+ | Undo knowledge | βœ“ | Which keys are reversible |
176
+ | Transition Buffer | βœ“ | Training data for world model |
177
+ | Boredom state | βœ— (reset) | Fresh patience per level |
178
+ | Goal/subgoals | βœ— (reset) | New objectives per level |
179
+ | Exploration phase | βœ— (reset to targeted) | Skip scan in later levels |
180
+
181
+ ## Files
182
+
183
+ | File | Size | Purpose |
184
+ |------|------|---------|
185
+ | `v2/models/world_model.py` | 7.4M params | CNN encoder + RSSM dynamics + decoder |
186
+ | `v2/models/exploration.py` | β€” | Systematic multi-phase exploration |
187
+ | `v2/models/planning.py` | β€” | CEM planner with world model imagination |
188
+ | `v2/models/action_effects.py` | β€” | Action semantics learning |
189
+ | `v2/models/smart_cell_select.py` | β€” | Affordance map + budget + conditional mechanics |
190
+ | `v2/models/undo_reasoning.py` | β€” | ATT: undo-based hypothesis testing |
191
+ | `v2/models/symbolic_memory.py` | β€” | Rule buffer + inducer + boredom detector |
192
+