Prompt Optimization with Edit-Level Memory

A structured optimization loop for agent prompts with behavioral evaluation, mutation memory, and targeted search.

Prompt optimization for tool-using agents is difficult for two related reasons. First, prompt quality is not directly observable; you only see it through behavior on actual tasks. Second, improvements usually come from several prompt changes at once, which makes it hard to tell which edits are worth keeping and which ones just happened to be nearby when the score improved.

In OSAgent, we address this by representing the system prompt as a structured configuration in src/prompt_eval rather than as one large free-form string. The search space is discrete and interpretable: it includes prompt sections for identity, priorities, safety, workflow, validation, communication, tool guidance, and section ordering, along with higher-level variants and boolean feature toggles.

Optimization Procedure

The optimizer in optimizer.rs follows a repeated search-evaluate-update loop. It samples a candidate PromptConfig, materializes a concrete system prompt, evaluates it on behavioral test cases, and then updates edit-level statistics based on the observed result.

Behavioral Scoring

The scoring model evaluates each result along five dimensions: correctness, tool accuracy, efficiency, safety, and format. Those scores are combined into a weighted aggregate, but the breakdown is retained so the optimizer can tell the difference between “more correct” and “faster but riskier.”

Edit-Level Memory

The central mechanism lives in memory.rs. Instead of storing outcomes only for complete prompt configurations, the system computes a set of edit atoms between successive evaluated configurations and tracks running statistics for each one.

Targeted Mutation

After repeated non-improving iterations, the optimizer shifts into a targeted mutation mode. It looks up edits that previously helped on the failing tests and applies a small number of those edits, while still keeping one random mutation for exploration.

Tradeoffs

The main tradeoff is reduced coverage of the prompt space. By avoiding unconstrained free-form rewriting, the system cannot discover arbitrary prompt formulations. But the reduced search space makes optimization more stable and keeps the search process interpretable.

References

[1]MemAPO - Self-Evolving Memory for Automatic Prompt Optimization.

[2]EPOCH - Agentic Protocol for Multi-Round System Optimization.

[3]MARS - Metacognitive Agent Reflective Self-improvement.

[4]P²O - Joint Policy and Prompt Optimization.