A training paradigm where models generate multiple completions and are rewarded based on the accuracy of the final answer (e.g., code passing tests or math solutions). This amplifies "Aha moments" and self-correction behaviors in reasoning models.