Better Ways to Build Self-Improving AI Agents

Synthesizing NeurIPS 2025 and Prior Work

Generated by AI based on papers retrieved from Exa’s NeurIPS paper search using the query “self improving agents”
📎 NeurIPS search result: exa.ai/search/neurips/cmis1jkew00013b6wn9y8x9gy
🔍 Powered by: exa.ai/search/neurips

1. Introduction

A growing body of NeurIPS work is converging on a simple idea: agents shouldn’t be static models with fixed prompts. Instead, they should practice, reflect, generate their own curricula, and rewrite parts of themselves.

This paper synthesizes techniques that make AI agents self-improving in a fairly strict sense:

The agent changes its own behavior over time (not just sampling multiple outputs).
The change is driven primarily by the agent’s own experience, feedback, or generated data, not human labels.
The mechanism is integrated into the agent loop (reflection, self-training, self-modification, etc.), rather than being a one-off offline fine-tune.

We focus on research around NeurIPS 2025 (e.g., Self-Challenging Agents, SEAL, Self-Generated In-Context Examples, SiriuS, Self-Improving Embodied Foundation Models, STaSC), and situate them within a broader landscape of earlier foundational work (Reflexion, Self-Refine, RISE, STaR, SELF, Voyager, SICA, Gödel Agent, and others).

We organize the space along six main mechanisms:

Self-reflection and in-loop feedback (prompt-level improvement without changing weights).
Self-generated data and curricula (agents create the data they learn from).
Self-adapting models (agents that fine-tune or edit themselves).
Self-improving code agents (agents that modify their own code, policies, or architecture).
Embodied self-improvement (agents learning by acting in environments).
Verification, safety, and control (keeping self-improvement from going off the rails).

Throughout, we highlight design patterns, pros/cons, and practical lessons for anyone building next-gen self-improving agents.

2. Self-Reflection and In-Loop Feedback

2.1 Prompt-Level Self-Improvement: Reflexion and Self-Refine

The simplest form of self-improvement is reflection at inference time:

Reflexion (Shinn et al., 2023) lets an LLM agent solve a task, see that it failed, write a natural-language critique of its own attempt, store that “reflection,” and try again conditioned on that feedback [7].
- On coding benchmarks like HumanEval, this verbal RL bumped pass@1 from GPT-4 baseline levels up to ~91%, outperforming a straight GPT-4 run.
- Pros:
  - No weight updates; cheap to adopt for any LLM.
  - Strong empirical gains across code and reasoning.
- Cons:
  - Improvements are ephemeral unless reflections are persisted and reused.
  - The model can hallucinate bad reflections and reinforce them.
Self-Refine (Madaan et al., 2023) uses a similar pattern: generate → critique → revise, repeating until convergence [8].
- Works well for text and code generation, often improving quality over single-shot outputs.

Design takeaway:
Reflection-style loops are easy to bolt onto existing agents and often give large incremental gains for small engineering effort. But they don’t change the underlying model; think of them as a runtime optimization layer, not long-term learning.

2.2 Learning to Self-Correct: RISE, STaR, SELF, STaSC

A stronger variant is to train the model itself to be better at self-correction, so it “knows how to improve” internally:

RISE – Recursive Introspection (Qu et al., 2024) fine-tunes models on multi-turn traces where an initial answer is wrong, feedback arrives, and a corrected answer follows [9].
- After training, the model can simulate this introspection loop internally at inference time, improving multi-step math reasoning without explicit external scaffolding.
- Pros: Self-correction becomes a built-in capability.
- Cons: Needs curated mistake–fix traces; may be domain-specific (e.g., math).
STaR – Self-Taught Reasoner (Zelikman et al., 2022) and SELF (Lu et al., 2023) use self-generated reasoning traces: the model produces lots of solutions, filters for correct ones, and then fine-tunes on these reasoning paths [12,13].
- Over iterations, small models become substantially stronger reasoners—purely from their own generated proofs / derivations.
STaSC – Self-Taught Self-Correction (Moskvoretskii et al., 2025) pushes this idea into open-domain QA [6].
- A small LM generates an answer, then a correction, and gets fine-tuned on the corrected outputs—no human labels.
- This closes a lot of the gap between 2B-scale models and much larger baselines on benchmarks like Natural Questions.

Design takeaway:
These works show that “learning to improve” can itself be a training objective: you don’t just train for “good answers,” you train for good corrections and good reasoning traces. This blurs the line between “agent loop” and “model training.”

2.3 Self-Reward and Self-Consistency as Implicit Self-Improvement

Some techniques don’t explicitly change the agent, but shape its effective behavior:

Self-Consistency (Wang et al., 2022) samples multiple reasoning chains and picks the majority answer [10]. This passively improves reliability.
Self-Rewarding Language Models (Yuan et al., 2025) have the model score its own outputs and use those scores as a reward signal for RL, bypassing a separate reward model [11].

These don’t always fit strict “agents that change themselves over time,” but they are useful building blocks: they provide better internal feedback signals that other self-improvement loops can leverage.

3. Self-Generated Data and Auto-Curricula

The most NeurIPS-2025-flavored thread is agents that generate the data and tasks they learn from.

3.1 Self-Challenging Language Model Agents

Self-Challenging Agents (Zhou et al., NeurIPS 2025) let an LLM play two roles: challenger and executor [1].

The challenger creates new tasks in a “Code-as-Task” format: an instruction plus verified test code.
The executor tries to solve the task; the tests provide a scalar reward.
Successfully solved tasks become training data; RL on this self-generated set doubles performance of a LLaMA-3.1-8B agent on tool-use benchmarks like M³ToolEval and TauBench compared to no self-training.

Pros

Fully label-free: no human annotations for tasks or rewards.
Tasks automatically scale with capability: as the agent gets better, it can generate harder tasks.
The presence of test code provides precise, non-linguistic feedback.

Cons

Requires domains where success can be coded as tests; many real-world tasks lack such clean verifiers.
Risk of curriculum collapse: the agent keeps generating tasks near its comfort zone unless explicitly pushed toward diversity.

3.2 Self-Generated In-Context Examples for Sequential Decision-Making

Self-Generated In-Context Examples (Sarukkai et al., NeurIPS 2025) apply a simpler idea to sequential environments [2]:

Whenever the agent successfully solves a task (e.g., in ALFWorld or other interactive domains), it stores the full successful trajectory.
Future tasks are solved by prompting the model with a few past successful trajectories as in-context examples.
This simple mechanism lifts ALFWorld performance from 73% → 89% and, in some settings, from 73% → 93%, rivaling or surpassing much larger frozen models—purely by reusing self-generated experience.

Pros

Extremely easy to implement: it’s experience replay for prompting.
No gradient updates required; works with any strong LLM as-is.
Acts as a memory system of “how I solved this before”, which is exactly what agents need in practice.

Cons

Context length is limited; you can’t stuff all trajectories into a single prompt, so curation strategies are needed.
If the database isn’t curated, the agent may replay suboptimal patterns or overfit to narrow niches.

3.3 Multi-Agent Experience Bootstrapping: SiriuS

SiriuS – Self-Improving Multi-Agent Systems via Bootstrapped Reasoning (Zhao et al., NeurIPS 2025) extends experience replay to multi-agent dialogues [4]:

In multi-agent tasks (collaborative reasoning, negotiation), SiriuS logs successful interaction traces and stores them in a shared experience library.
Failed trajectories are post-hoc repaired (e.g., improved by another agent or offline process) and added as positive examples.
Agents are then fine-tuned or prompted using this library, yielding 2.86–21.88% accuracy gains across reasoning and negotiation benchmarks.

Pros

Multi-agent systems get better over time as they accumulate more high-quality teamwork examples.
The “repair failed trajectories and re-use them” idea is a powerful way to mine signal from failures.

Cons

Maintaining and curating a global experience library adds heavy systems overhead.
If the environment or objectives shift, old trajectories can become misleading.

3.4 Self-Generated Data for Open-Domain Reasoning: STaR, SELF, STaSC

The broader self-generated data story is rounded out by:

STaR (Zelikman et al., 2022): generate reasoning traces, verify correctness, then fine-tune on those traces [12].
SELF (Lu et al., 2023): iterate self-labeling and retraining with language feedback as the filtering mechanism [13].
STaSC (Moskvoretskii et al., 2025): self-correction for small models on open-domain QA [6].

These works collectively show that self-generated curricula can rival human-curated datasets when combined with verification and selection.

Design takeaway:
Self-generated data is the engine of long-term self-improvement. The core design challenge is signal quality:

Where do labels / rewards come from? (tests, environment, language feedback)
How do we filter out bad self-labels to avoid garbage-in, garbage-out?

NeurIPS 2025 pushes this from toy math to tool-use, decision-making, and multi-agent negotiation, making it much more relevant for real agents.

4. Self-Adapting Models: SEAL and Friends

Reflection and auto-curricula mostly operate around the model. Another line of work aims to edit the model itself based on its own feedback.

4.1 SEAL – Self-Adapting Language Models

SEAL (Self-Adapting Language Models; Zweiger et al., NeurIPS 2025) is a representative example [3]:

The model generates self-edit instructions—natural-language descriptions of what needs to change (e.g., “for this pattern of question, prefer answer type X”).
These edits are turned into fine-tuning examples and used to update the model’s weights via RL / supervised learning, using downstream performance as reward.
On factual QA, SEAL improves accuracy from 33.5% → 47%. On certain few-shot reasoning tasks it goes from 0% → 72.5%.

Pros

Edits are interpretable as natural language, making debugging and oversight easier than black-box weight updates.
The model effectively becomes a co-author of its own training set.

Cons

Still requires a training pipeline in the loop (RL or SFT); not as lightweight as reflection.
As with any self-training, there’s a risk of self-reinforcing biases if the evaluation signal is misaligned.

4.2 Self-Rewarding and Gödel-Style Agents

Other works explore more explicit self-adaptation frameworks:

Self-Rewarding LMs (Yuan et al., 2025) train models to produce and optimize against their own reward signals [11].
Gödel Agent (Yin et al., 2024) sketches a self-referential architecture inspired by Gödel Machines, where the agent can propose modifications to itself and accept them if they pass a specified improvement test [19].

These works are more conceptual but directly target the “agent that rewrites itself under rules” picture.

Design takeaway:
Self-adapting models like SEAL show how you can turn the agent’s own introspections into training data. The hard part is designing trustworthy, well-aligned improvement criteria so the model doesn’t optimize itself into a corner.

5. Self-Improving Code Agents

Code-centric agents are a sweet spot for self-improvement because code is executable and tests are cheap.

5.1 Self-Taught Optimizer (STO)

Self-Taught Optimizer (STO) (anonymous, NeurIPS 2025) demonstrates a striking recursive pattern [8,17]:

Start with a basic “code improver” program that, given some code and a utility function, calls an LLM (e.g., GPT-4) to propose improved variants.
Use that improver to improve arbitrary code on downstream tasks.
Then, apply the improver to its own code, iteratively rewriting the improver itself.

Empirically, STO discovers classical search patterns like beam search, simulated annealing, and genetic-algorithm-like strategies—without human algorithmic guidance. The resulting self-modified improver solves coding tasks substantially better than the seed version.

Pros

Clear instance of recursive self-improvement: the agent’s “algorithm” gets better via its own use of LLMs.
Improvements are backed by objective metrics (test coverage, performance).

Cons

The base LLM remains fixed; only the wrapper code self-improves—so we’re not yet in full Gödel-machine territory.
Requires robust sandboxing; generated code could in principle be harmful or break its environment.

5.2 SICA – Self-Improving Coding Agent

SICA (Robeyns et al., 2025) takes the next step: the agent directly edits its own agent script [18].

Start from a baseline coding agent (e.g., for SWE-Bench-style tasks).
Evaluate performance on a benchmark (success rate, runtime, cost).
If performance is unsatisfactory, the agent enters a self-edit phase: using an LLM to propose modifications to its own source code (prompts, heuristics, architecture).
Apply candidate edits, re-evaluate, and keep the changes that improve metrics.

SICA reports 17–53% performance improvements on coding tasks, sometimes with cost/time reductions, purely through this self-edit loop. Safety checks constrain what can be changed and ensure tests are passed before adoption.

Pros

Concrete demonstration of an agent that treats its own implementation as editable state.
Edits persist, so improvements accumulate over time.

Cons

High engineering and compute overhead: repeated LLM calls plus full re-evaluation loops.
Risk of overfitting to the benchmark or inadvertently disabling safety checks; guardrails are crucial.

5.3 Voyager and Code-as-Policies

Voyager (Wang et al., 2023) shows a related pattern in Minecraft [14]:

GPT-4 acts as a planner and coder, generating skills as reusable code snippets (e.g., “build a shelter,” “craft tools”).
Successful skills are stored in a skill library and reused in future tasks, effectively acting as persistent self-improvement of the agent’s code-level abilities.

This aligns with earlier Code-as-Policies work (Liang et al., 2022) [16]: using code as the medium for policies makes it easy to extend, reuse, and refine agent capabilities.

Design takeaway:
For code agents, the most promising pattern is:

Represent skills and strategies as executable artifacts (code), and give the agent the ability to debug, rewrite, and reorganize those artifacts over time.

This gives you persistent, compositional self-improvement—much more tangible than ephemeral prompt tweaks.

6. Embodied Self-Improvement

Self-improvement becomes more “agentic” when it happens through interaction with an environment.

6.1 Self-Improving Embodied Foundation Models

Self-Improving Embodied Foundation Models (EFMs) (Ghasemipour et al., NeurIPS 2025) propose a two-stage recipe for robot policies [5]:

Supervised Fine-Tuning (SFT) on demonstrations, with an auxiliary prediction of “steps-to-go” (how many actions remain until success).
Use this steps-to-go signal as an intrinsic reward in a self-practice RL phase: policies are improved by practicing in the environment to minimize steps-to-go.

The steps-to-go predictor also doubles as a success detector, eliminating hand-crafted rewards. EFMs show that this self-improvement phase:

Is more sample-efficient than collecting more demos.
Yields novel behaviors beyond the original demonstration set.
Improves real-robot success rates across multiple tasks.

Pros

Combines the breadth of foundation models with the specialization of environment-specific RL.
Uses a learned reward grounded in the model’s representation rather than ad-hoc human shaping.

Cons

Two-stage training is complex; errors in the learned reward can misguide RL.
Real-world practice must be carefully managed for safety and wear-and-tear.

6.2 Voyager and Inner Monologue

In the same spirit:

Voyager (Minecraft) uses an automatic curriculum, skill library, and iterative refinement to continually improve [14].
Inner Monologue (Huang et al., 2022) layers internal planning over low-level actions: the agent reasons in natural language about what to do, then acts, and can revise plans based on outcomes [15].

These systems show that planning, skill accumulation, and environment feedback can work together as a form of self-improvement, even without gradient updates.

Design takeaway:
Embodied self-improvement often hinges on good proxy rewards (steps-to-go, success detection) and persistent skill representations (code policies, skills) that can be composed and refined.

7. Verification, Safety, and Control

Letting agents improve themselves raises obvious questions:

How do we know a change is an improvement?
How do we prevent degeneration, bias amplification, or goal drift?

Several cross-cutting patterns emerge:

External verifiers and tests.
- Self-Challenging uses code tests.
- Code agents rely on unit tests, benchmarks, and performance metrics.
- Chain-of-Verification–style approaches fact-check intermediate reasoning.
Conservative acceptance criteria.
- SICA only keeps a self-edit if it improves benchmark performance under predefined metrics [18].
- Gödel Agent conceptually requires proving (or at least strongly arguing) that a self-modification won’t harm the objective [19].
Diversity and anti-echo-chamber mechanisms.
- Self-Generated Examples and SiriuS employ curation and replay strategies to avoid overfitting to a narrow slice of experience [2,4].
- SELF and STaR combine multiple iterations and filtering to keep the self-training signal robust [12,13].
Human or meta-agent oversight at critical boundaries.
- While not always emphasized in papers, realistic deployments will need off-switches, review gates, and policy filters around self-modification and self-training.

Design takeaway:
The core shift in mindset is:

Don’t just design how the agent changes itself; design rules and tests that govern what changes are allowed.

This is where self-improvement research intersects safety and alignment.

8. Practical Design Patterns (Pros, Cons, Where They Shine)

Here’s a compact view of the main mechanisms through the lens of “how would I use this to build a better agent?”:

Mechanism	Core Idea	Pros	Cons	Example Works
Reflection loops	Agent critiques and retries its own outputs without changing weights.	Easy to add; big gains for little work; model-agnostic.	Ephemeral; adds latency; reflections can be wrong.	Reflexion [7], Self-Refine [8]
Self-correction training	Train model on mistake–correction traces and self-generated reasoning paths.	Lasting improvement; embeds “how to improve” into weights.	Needs curated traces; risk of domain overfitting.	RISE [9], STaR [12], SELF [13], STaSC [6]
Self-generated curricula	Agent creates its own tasks / examples and learns from them.	Label-free, open-ended, often large gains (e.g., 2× on tool use).	Quality control and diversity are hard; compute-heavy.	Self-Challenging [1], Self-Generated Examples [2], SiriuS [4]
Self-adapting models	Model generates its own edits / training data and updates itself.	Interpretable edits; persistent performance gains.	Requires training pipeline; danger of self-reinforcing bias.	SEAL [3], Self-Rewarding LMs [11]
Self-improving code agents	Agent treats its own code/policies as editable artifacts and rewrites them.	Strong, persistent improvements; clear metrics via tests.	Safety and overfitting risk; complex to engineer and verify.	STO [17], SICA [18], Voyager [14]
Embodied self-practice	Agent practices in env. using learned rewards / success detectors.	Can exceed demo data; natural form of “practice makes perfect.”	Real-world safety, sim-to-real gap, training complexity.	Self-Improving EFMs [5], Voyager [14], Inner Monologue [15]

9. Where This Leaves Us (and What’s Next)

Across NeurIPS 2025 and adjacent work, a coherent picture emerges:

Self-improvement is no longer a vague long-term dream; it’s a set of concrete recipes.
- For reflection-based bots, add Reflexion-style loops.
- For tool-use agents, adopt Self-Challenging or self-generated in-context examples.
- For small models, deploy STaSC/SELF-style self-training to narrow the gap with frontier models.
- For code agents, take inspiration from STO and SICA and let the agent edit its own scaffolding.
- For robots, adopt two-stage approaches like EFMs with learned intrinsic rewards.
The biggest gains often come from turning interaction traces into reusable structure.
- Trajectories become exemplars, skills, training data, or code.
- The agent becomes surrounded by an ever-growing “experience layer” that it can query, refine, and distill.
The bottleneck is increasingly not model size, but feedback quality and control.
- How do we ensure that self-generated labels and self-proposed edits actually push the agent in the right direction?
- How do we keep self-improving agents safe, robust, and aligned as they change themselves?

For builders, a pragmatic roadmap is:

Start with in-loop reflection and self-generated exemplars. Cheap, easy wins.
Add self-training on verified traces (à la STaR/SELF/STaSC) where you have a clear correctness signal.
Introduce persistent representations of skills / policies (code, graphs, skills) that agents can rewrite, as in STO, SICA, Voyager.
Wrap everything in tests and constraints: treat self-improvement as a proposal process gated by rigorous checks.

The sci-fi vision—an agent that safely, recursively improves all aspects of itself—is still far out. But the NeurIPS 2025 cluster on “self improving agents” shows that many of the ingredients already work in specialized domains. The next frontier is compositionality: agents that combine reflection, self-generated curricula, self-adapting weights, code-level self-modification, and environment practice in a single, controlled architecture.

References (Selected)

Y. Zhou et al., “Self-Challenging Language Model Agents”, NeurIPS 2025, OpenReview.
R. Sarukkai et al., “Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks”, NeurIPS 2025, OpenReview.
A. Zweiger et al., “Self-Adapting Language Models (SEAL)”, NeurIPS 2025, OpenReview.
Y. Zhao et al., “SiriuS: Self-Improving Multi-Agent Systems via Bootstrapped Reasoning”, NeurIPS 2025, OpenReview.
S. Ghasemipour et al., “Self-Improving Embodied Foundation Models”, NeurIPS 2025.
E. Moskvoretskii et al., “Self-Taught Self-Correction (STaSC)”, 2025, arXiv.
N. Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning”, 2023, arXiv:2303.11366.
A. Madaan et al., “Self-Refine: Iterative Refinement with Self-Feedback”, 2023.
C. Qu et al., “Recursive Introspection: Teaching LLM Agents How to Self-Improve”, 2024, OpenReview.
X. Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, 2022.
W. Yuan et al., “Self-Rewarding Language Models”, 2025.
E. Zelikman et al., “STaR: Bootstrapping Reasoning with Reasoning”, 2022.
Y. Lu et al., “SELF: Self-Evolution with Language Feedback”, 2023, arXiv:2310.00533.
J. Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models”, 2023, arXiv:2305.16291.
W. Huang et al., “Inner Monologue: Embodied Reasoning through Planning with Language Models”, 2022.
J. Liang et al., “Code as Policies: Language Model Programs for Embodied Control”, 2022.
E. Zelikman et al., “Self-Taught Optimizer (STO)”, NeurIPS 2025 (anonymous at submission time).
J. Robeyns et al., “SICA: A Self-Improving Coding Agent”, 2025, OpenReview.
Q. Yin et al., “Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement”, 2024, arXiv:2410.04444.
Y. Sheng, “From Language Models to Practical Self-Improving Computer Agents”, 2024, arXiv:2404.11964.