Automated Evaluation and Prompt Refinement System Powered by Gemini
To meet our objectives, we developed an automated method utilizing Gemini models for evaluating simplification quality and refining prompts. However, creating prompts for subtle simplification—where readability must enhance without compromising meaning or detail—presents challenges. An automated system tackles this issue by facilitating extensive trial-and-error to identify the most effective prompts.
Automated Evaluation
Manual evaluation is unsuitable for quick iterations. Our system features two innovative evaluation components:
- Readability Assessment: Moving beyond basic metrics like Flesch-Kincaid, we utilized a Gemini prompt to rate text readability on a scale from 1 to 10. This prompt was iteratively refined based on human feedback, allowing for a more nuanced evaluation of comprehension ease. Our tests revealed that this LLM-based readability assessment aligns better with human readability judgments than the Flesch-Kincaid method.
- Fidelity Assessment: Preserving meaning is crucial. Using Gemini 1.5 Pro, we established a process that maps assertions from the original text to the simplified version. This technique identifies specific types of errors, such as information loss, gain, or distortion, each assigned a severity weight, yielding a detailed measure of adherence to the original meaning (completeness and entailment).
Iterative Prompt Refinement: LLMs Optimizing Each Other
The success of the final simplification (produced by Gemini 1.5 Flash) is heavily influenced by the initial prompt. We automated the prompt optimization process through a prompt refinement loop: utilizing auto-evaluation scores for readability and fidelity, another Gemini 1.5 Pro model assessed the simplification prompt’s efficacy and suggested improved prompts for subsequent iterations.
This establishes a robust feedback loop where one LLM system progressively enhances its own instructions based on performance metrics, diminishing the need for manual prompt crafting and facilitating the discovery of highly effective simplification strategies. In this project, the loop ran for 824 iterations until performance reached a plateau.
This automated mechanism, where one LLM evaluates another’s output and refines its prompts according to performance metrics (readability and fidelity) and specific errors, signifies a key advancement. It surpasses tedious manual prompt engineering, empowering the system to autonomously uncover highly effective strategies for nuanced simplification through numerous iterations.