Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (2025)

Zhi Zhou  Yuhao Tan  Zenan Li  Yuan Yao  Lan-Zhe Guo  Xiaoxing Ma  Yu-Feng Li

Abstract

Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, single-shot inference often yields unreliable results for complex reasoning tasks, leading researchers to explore multiple reasoning paths through methods such as perplexity and self-consistency.In this paper, we present the first theoretical error decomposition analysis of these techniques, breaking down their error into estimation error and model error. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function, while self-consistency exhibits high estimation error due to a slow error convergence rate.To overcome these limitations, we propose Reasoning-Pruning Perplexity Consistency (Rpc). This approach combines Perplexity Consistency, which seamlessly integrates LLM perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths to effectively prevent the degeneration of estimation error reduction.Theoretical analysis demonstrates that Rpc not only accelerates the convergence rate of estimation error to an exponential level but also holds strong potential for further reducing model error.Extensive empirical evaluations on seven benchmark datasets confirm that Rpc can significantly improve reasoning performance, sample efficiency, and confidence reliability.

Large Language Model, Self Consistency, Math Reasoning, Code Generation

1 Introduction

Recently, large language models (LLMs) have shown significant progress in various applications such as problem solving(Lewkowycz etal., 2022; Li etal., 2024a), planning(Valmeekam etal., 2023; Deng etal., 2024), and decision making(Ouyang & Li, 2023; Sblendorio etal., 2024), demonstrating their reasoning capabilities.Since single-shot inference is not always reliable, especially in complex reasoning tasks, one often requires the LLM to produce multiple reasoning paths, facilitating its reasoning performance.

When multiple reasoning paths for a given problem are available, the reasoning performance is determined by the confidence estimation for each result.To achieve this, perplexity methods(Chen etal., 1998; Murugadoss etal., 2025) apply LLMs’ internal probability to estimate the confidence of the reasoning path.Although the internal probability is quite accurate, the reasoning path confidence is highly insufficient to distinguish each answer, thereby greatly limiting the effectiveness of perplexity methods(Chen etal., 2024).In contrast,self-consistency methods(Wang etal., 2022; Chen etal., 2023b) switch to establish the answer confidence using a pre-defined consistency function.However, the answer confidence cannot be directly derived from the internal probabilities of LLMs, necessitating the use of Monte-Carlo estimation, which significantly degrades the convergence rate(Aggarwal etal., 2023; Wan etal., 2024; Wang etal., 2024).

To better understand the limitations of current methods and to guide the development of an effective and efficient LLM reasoning approach, we formulate the LLM reasoning problem and present a theoretical analysis that decomposes the reasoning error into two components: Estimation Error and Model Error.Self-consistency methods, which rely on the Monte-Carlo estimation, achieve only a linear estimation error reduction rate with respect to the sample size.The linear convergence rate leads to the method requiring a large sampled budget.For instance, implementing self-consistency with 64 samples on the MATH dataset using the GPT-4 API costs approximately $2000(Li etal., 2024c), rendering it extremely expensive for both researchers and organizations.As to perplexity methods, their estimation error converges exponentially as they use the internal probability of LLMs.The exponential convergence rate ensures that perplexity methods can work well even in a very limited sample budget, while its final convergent result is far from satisfactory due to the high model error.A comparison between self-consistency methods and perplexity methods is shown in Figure1.This complementary between estimation error and model error raises a chance to further improve the LLM reasoning performance:Can we design a method that achieves both fast estimation error convergence rate and low model error?To the best of our knowledge, the efforts in this aspect remain limited.

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (1)

In this paper, we explore effectively and efficiently integrating the internal LLM probability into the self-consistency framework, allowing us to utilize an accurate probability for rapid estimation error reduction while maintaining low model error.We name this confidence estimation approach Perplexity Consistency.Our theoretical study illustrates that perplexity consistency provides a tight integration and can indeed achieve the goal.However, the reduction rate of perplexity consistency estimation error undesirably degenerates to a linear rate when the magnitude of the LLM’s internal probability is low.To tackle this issue, we further introduce Reasoning Pruning to automatically model the probability distribution for each reasoning problem and remove low-probability reasoning paths.Combining the perplexity consistency and the reasoning pruning, we propose Reasoning-pruning Perplexity Consistency (Rpc).

Our theoretical and experimental results confirm the efficient and effective performance of Rpc.Specifically, on four mathematical reasoning datasets, Rpc successfully reduces the sampling budget by at least 50% while achieving the same reasoning performance as self-consistency.Conversely, with an equal sampling budget, Rpc outperforms existing methods by 1.29% on average.Additionally, Rpc provides confidence estimates that align better with the ground truth compared to existing methods.

To summarize, the main contributions of the paper are:

(1) We formulate the LLM reasoning problem and offer a theoretical analysis that decomposes LLM reasoning performance into estimation error and model error. This analysis emphasizes the benefits of self-consistency while revealing its limitations when working with limited sampling budgets.

(2) Building on our theoretical framework, we introduce the Rpc, which integrates Perplexity Consistency and Reasoning Pruning. This approach utilizes precise LLM probabilities and eliminates low-probability reasoning paths to enhance reasoning performance.

(3) Our theoretical analysis shows that Perplexity Consistency achieves an exponential error reduction rate in most cases, and Reasoning Pruning effectively compensates for the remaining degeneration issues.

(4) Through extensive experiments conducted on four mathematical reasoning and three code generation tasks, our proposed Rpc delivers promising results of improving both accuracy and confidence consistency.

2 Problem and Analysis

In this section, we start by outlining the problem formulation of LLM reasoning through sampling multiple reasoning paths.Then, we provide a theoretical analysis that decomposes LLM reasoning performance into estimation error and model error.Finally, we present experimental results verifying our theoretical analysis.Our theoretical and empirical analysis motivates our follow-up method design.

2.1 Problem Formulation

Given a reasoning problem (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), where x𝑥xitalic_x represents the input query, and y𝑦yitalic_y represents the ground-truth answer.The LLM generates a reasoning path t^=(t1,,tm)^𝑡subscript𝑡1subscript𝑡𝑚\hat{t}=(t_{1},\ldots,t_{m})over^ start_ARG italic_t end_ARG = ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) by sequentially sampling tokens according to the conditional probability distribution p(ti|x,t<i)𝑝conditionalsubscript𝑡𝑖𝑥subscript𝑡absent𝑖p(t_{i}\,|\,x,t_{<i})italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_t start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ), where m𝑚mitalic_m denotes the length of the reasoning path.The probability of generating the reasoning path t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG is defined as p(t^|x)𝑝conditional^𝑡𝑥p(\hat{t}\,|\,x)italic_p ( over^ start_ARG italic_t end_ARG | italic_x ), a.k.a the confidence of the reasoning path t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG.An answer extraction function g()𝑔g(\cdot)italic_g ( ⋅ ) maps the reasoning path to the final answer y^=g(t^)^𝑦𝑔^𝑡\hat{y}=g(\hat{t})over^ start_ARG italic_y end_ARG = italic_g ( over^ start_ARG italic_t end_ARG ), and the reasoning correctness is evaluated by the indicator function 𝕀[y^=y]𝕀delimited-[]^𝑦𝑦\mathbb{I}[\hat{y}=y]blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ].We can extend the probability to the answer y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, i.e., the answer confidence, denoted as p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ).

The confidence essentially represents the probability that the reasoning path t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG or answer y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is correct, which enables LLMs to select the most reliable solution among multiple candidates.Nevertheless, enumerating all reasoning paths or answers is unfeasible;we have to estimate the LLM confidence based on finite n𝑛nitalic_n sampled reasoning paths instead.Furthermore,to measure the reasoning performance of LLMs, we use the squared error of confidence estimation p^(t^|x)^𝑝conditional^𝑡𝑥\hat{p}(\hat{t}\,|\,x)over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_t end_ARG | italic_x ) to the reasoning path t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG:

(p)=(p^(t^|x)𝕀[g(t^)=y])2.𝑝superscript^𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\mathcal{E}(p)=\big{(}\hat{p}(\hat{t}\,|\,x)-\mathbb{I}[g(\hat{t})=y]\big{)}^{%2}.caligraphic_E ( italic_p ) = ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

If we can extend the confidence estimation to the answer y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, the squared error can be reformulated as

(p)=(p^(y^|x)𝕀[y^=y])2.𝑝superscript^𝑝conditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\mathcal{E}(p)=\big{(}\hat{p}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y]\big{)}^{2}.caligraphic_E ( italic_p ) = ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Below, we analyze two confidence estimation methods, i.e., self-consistency method(Wang etal., 2022) and perplexity method(Huang etal., 2023). Specifically, the self-consistency method computes the answer confidence using Monte-Carlo estimation based on a consistency function 𝕀Csubscript𝕀𝐶\mathbb{I}_{C}blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, while the perplexity method directly computes the confidence of reasoning paths using internal LLM probabilities.

2.2 Theoretical Analysis

To maximize the reasoning performance of LLMs,self-consistency methods (denoted as Sc)(Xiong etal., 2024; Abbasi-Yadkori etal., 2024; Becker & Soatto, 2024) often sample n𝑛nitalic_n reasoning paths t~1,,t~nsubscript~𝑡1subscript~𝑡𝑛\tilde{t}_{1},\dots,\tilde{t}_{n}over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and then estimate the probability of each answer by

p^(Sc)(y^|x)superscript^𝑝Scconditional^𝑦𝑥\displaystyle\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x )=1ni=1n𝕀[y~i=y^],y~i=g(t~i).formulae-sequenceabsent1𝑛superscriptsubscript𝑖1𝑛𝕀delimited-[]subscript~𝑦𝑖^𝑦subscript~𝑦𝑖𝑔subscript~𝑡𝑖\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[\tilde{y}_{i}=\hat{y}],\quad%\tilde{y}_{i}=g(\tilde{t}_{i}).= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG ] , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Then, the reasoning error of the Sc method for a given problem (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) can be computed by

(p^(Sc))superscript^𝑝Sc\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Sc})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT )=𝔼t~ip(t|x)(p^(Sc)(y^|x)𝕀[y^=y])2absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}(\hat{p}^{(\textsc{Sc})%}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼t~ip(t|x)(1ni=1n𝕀[y~i=y^]𝕀[y^=y])2.absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥superscript1𝑛superscriptsubscript𝑖1𝑛𝕀delimited-[]subscript~𝑦𝑖^𝑦𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}(\frac{1}{n}\sum_{i=1}^%{n}\mathbb{I}[\tilde{y}_{i}=\hat{y}]-\mathbb{I}[\hat{y}=y])^{2}.= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG ] - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

To illustrate the key factors affecting the reasoning error,we provide an error decomposition in the following proposition.

Proposition 1 (Sc Reasoning Error Decomposition).

For any input x𝑥xitalic_x with ground-truth answer y𝑦yitalic_y, let p^(Sc)(y^|x)superscript^𝑝Scconditional^𝑦𝑥\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) denote the estimated probability of y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG by Sc.Then, the reasoning error (p^(Sc))superscript^𝑝Sc\mathcal{E}(\hat{p}^{(\textsc{Sc})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ) can be divided into two components:

(p^(Sc))=superscript^𝑝Scabsent\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Sc})})=caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ) =1np(y^|x)(1p(y^|x))Estimation Errorsubscript1𝑛𝑝conditional^𝑦𝑥1𝑝conditional^𝑦𝑥Estimation Error\displaystyle\underbrace{\frac{1}{n}p(\hat{y}\,|\,x)(1-p(\hat{y}\,|\,x))}_{%\text{Estimation Error}}under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ( 1 - italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) end_ARG start_POSTSUBSCRIPT Estimation Error end_POSTSUBSCRIPT
+(p(y^|x)𝕀[y^=y])2Model Error.subscriptsuperscript𝑝conditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2Model Error\displaystyle\qquad+\underbrace{\big{(}p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y]%\big{)}^{2}}_{\text{Model Error}}.+ under⏟ start_ARG ( italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Model Error end_POSTSUBSCRIPT .
Remark 1.

The detailed proof is provided in AppendixA.1.The estimation error refers to the error caused by the finite sampling from the LLM probability, while the model error indicates the LLM’s limited reasoning capability.Note that the estimation error of Sc reduces to only the variance as the sampling is unbiased.This proposition demonstrates that, aside from the model error, which is determined by the LLM’s inherent reasoning capabilities,the reasoning error is bounded by the estimation error.Moreover, the estimation error reduction rate of the sample size is linear, resulting in a large error margin when the sampling is insufficient.

To effectively offset the estimation error, we switch to analyze the reasoning error of perplexity methods (denoted as Ppl).In contrast to the Sc method that estimates the answer probability using the Monte-Carlo estimation, Ppl directly utilizes the internal probability of LLMs p(t~i|x)𝑝conditionalsubscript~𝑡𝑖𝑥p(\tilde{t}_{i}\,|\,x)italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) for the sampled reasoning path t~isubscript~𝑡𝑖\tilde{t}_{i}over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.Therefore, for given the unique set of n𝑛nitalic_n sampled reasoning paths ={t~1,,t~n}subscript~𝑡1subscript~𝑡𝑛\mathcal{R}=\left\{\tilde{t}_{1},\ldots,\tilde{t}_{n}\right\}caligraphic_R = { over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the estimated probability of each reasoning path t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG is

p^(Ppl)(t^x)superscript^𝑝Pplconditional^𝑡𝑥\displaystyle\hat{p}^{(\textsc{Ppl})}(\hat{t}\mid x)over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG ∣ italic_x )={p(t~i|x),ift^=t~i0,otherwiseabsentcases𝑝conditionalsubscript~𝑡𝑖𝑥if^𝑡subscript~𝑡𝑖0otherwise\displaystyle=\left\{\begin{array}[]{ll}p(\tilde{t}_{i}\,|\,x),&\text{if}\quad%\hat{t}=\tilde{t}_{i}\\0,&\text{otherwise}\end{array}\right.= { start_ARRAY start_ROW start_CELL italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) , end_CELL start_CELL if over^ start_ARG italic_t end_ARG = over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY
=t~𝕀[t^=t~]p(t~|x).absentsubscript~𝑡𝕀delimited-[]^𝑡~𝑡𝑝conditional~𝑡𝑥\displaystyle=\sum_{\tilde{t}\in\mathcal{R}}\mathbb{I}\left[\hat{t}=\tilde{t}%\,\right]p(\tilde{t}\,|\,x).= ∑ start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ over^ start_ARG italic_t end_ARG = over~ start_ARG italic_t end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG | italic_x ) .

Similarly, we also use the mean squared error to measure the reasoning performance of Ppl:

(p^(Ppl))=𝔼t~ip(t|x)(p^(Ppl)(t^|x)𝕀[y^=y])2superscript^𝑝Pplsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Ppl})})=\mathbb{E}_{\tilde{t}_{i}%\sim p(t\,|\,x)}(\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)-\mathbb{I}[\hat{y}=y]%)^{2}caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼t~ip(t|x)(t~𝕀[t^=t~]p(t~|x)𝕀[g(t^)=y])2.absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥superscriptsubscript~𝑡𝕀delimited-[]^𝑡~𝑡𝑝conditional~𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\displaystyle\qquad=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}(\sum_{\tilde{t}%\in\mathcal{R}}\mathbb{I}\left[\hat{t}=\tilde{t}\,\right]p(\tilde{t}\,|\,x)-%\mathbb{I}[g(\hat{t})=y])^{2}.= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ over^ start_ARG italic_t end_ARG = over~ start_ARG italic_t end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Now, we can obtain the following proposition.

Proposition 2 (Ppl Reasoning Error Decomposition).

For any given input x𝑥xitalic_x with ground-truth answer y𝑦yitalic_y, let p^(Ppl)(t^|x)superscript^𝑝Pplconditional^𝑡𝑥\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) denote the estimated probability of t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG by Ppl method.Then, the reasoning error (p^(Ppl))superscript^𝑝Ppl\mathcal{E}(\hat{p}^{(\textsc{Ppl})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ) can be divided into two components:

(p^(Ppl))=superscript^𝑝Pplabsent\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Ppl})})=caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ) =(1p(t^|x))np(t^|x)(2𝕀[y^i=y]p(t^|x))Estimation Errorsubscriptsuperscript1𝑝conditional^𝑡𝑥𝑛𝑝conditional^𝑡𝑥2𝕀delimited-[]subscript^𝑦𝑖𝑦𝑝conditional^𝑡𝑥Estimation Error\displaystyle\underbrace{(1-p(\hat{t}\,|\,x))^{n}{p}(\hat{t}\,|\,x)(2\mathbb{I%}[\hat{y}_{i}=y]-p(\hat{t}\,|\,x))}_{\text{Estimation Error}}under⏟ start_ARG ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ( 2 blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) end_ARG start_POSTSUBSCRIPT Estimation Error end_POSTSUBSCRIPT
+(p(t^|x)𝕀[g(t^)=y])2Model Error.subscriptsuperscript𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2Model Error\displaystyle\qquad+\underbrace{\big{(}p(\hat{t}\,|\,x)-\mathbb{I}[g(\hat{t})=%y]\big{)}^{2}}_{\text{Model Error}}.+ under⏟ start_ARG ( italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Model Error end_POSTSUBSCRIPT .
Remark 2.

The detailed proof is provided in AppendixA.1.Compared with Sc, the estimation error of Ppl decreases exponentially, which is much faster. However, the model error of Ppl is usually larger than that of Sc in practice. In AppendixA.2, we provide Proposition3 to demonstrate that Sc achieves a smaller model error than Ppl in the ideal case, due to the advantages of the consistency function.

2.3 Empirical Observations

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (2)

To confirm our theoretical results, we conduct some initial experiments on the GSM8K dataset using the InternLM-MATH-Plus 7B model. We limit the sample size n𝑛nitalic_n from 1111 to 6666 and plot the accuracy curves and the estimation error in Figure2.Additionally, we include an ablative version called by Naïve-Sc.Naïve-Sc applies the Monte-Carlo estimation, which is consistent with Sc,but its consistency function is degraded to a Naïve version to the reasoning path matching rather than the answer matching, which is consistent with Ppl.In other words, the reasoning error (p^(Naïve-Sc))superscript^𝑝Naïve-Sc\mathcal{E}(\hat{p}^{(\text{Na\"{i}ve-}\textsc{Sc})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Naïve- smallcaps_Sc ) end_POSTSUPERSCRIPT ) of Naïve-Sc can be decomposed as

1np(t^|x)(1p(t^|x))Estimation Error+(p(t^|x)𝕀[g(t^)=y])2Model Error.subscript1𝑛𝑝conditional^𝑡𝑥1𝑝conditional^𝑡𝑥Estimation Errorsubscriptsuperscript𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2Model Error\displaystyle\underbrace{\frac{1}{n}p(\hat{t}\,|\,x)(1-p(\hat{t}\,|\,x))}_{%\text{Estimation Error}}+\underbrace{\big{(}p(\hat{t}\,|\,x)-\mathbb{I}[g(\hat%{t})=y]\big{)}^{2}}_{\text{Model Error}}.under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) end_ARG start_POSTSUBSCRIPT Estimation Error end_POSTSUBSCRIPT + under⏟ start_ARG ( italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Model Error end_POSTSUBSCRIPT .

The derived results highlight the following two observations:

(I) Estimation Error.The estimation errors of both Sc and Ppl decrease as the sample size increases.However, the accuracy curves and estimation error illustrate that the Ppl has a much faster convergence rate compared to Sc.Specifically, Ppl reaches a stable result with n=3𝑛3n=3italic_n = 3, while Sc cannot converge even for n=6𝑛6n=6italic_n = 6.Naïve-Sc confirms this result, showing a lower convergent rate, since it uses the same Monte-Carlo estimation with Sc.

(II) Model Error.Sc and Ppl ultimately converge to different results. This is because their model errors are intrinsically different. Sc groups reasoning paths that yield the same answer through its consistency function, ensuring a higher accuracy of Sc. In contrast, Ppl only estimates the probability of individual reasoning paths without considering answer-level consistency. Naïve-Sc also supports this conclusion, converging to the worst results due to its lack of a proper consistency function.

Key Insights.Our theoretical and empirical analyses point to the deficiencies and potential synergies of Sc and Ppl.Although Sc achieves a lower model error due to the advantages of the consistency function, its estimation error is hindered by a slower convergence rate. In contrast, Ppl exhibits a much faster convergence rate in estimation error using LLM internal probabilities, but this comes at the cost of a higher model error. This naturally raises a fundamental research question: Can we design a method fusing strengths of Sc and Ppl: achieve both a rapid estimation error convergence rate and a low model error simultaneously?

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (3)

3 Methodology

Based on our theoretical and empirical analysis, we propose a new method called Reasoning-Pruning Perplexity Consistency (Rpc).Specifically, we first integrate internal LLM probability into the self-consistency framework, paving the way for a confidence estimation function called Perplexity Consistency (Pc).This function utilizes LLM probabilities to reduce estimation error more efficiently while maintaining a low model error.Our further analysis guides the design of a new Reasoning Pruning (Rp) module that addresses the limitations of Pc by filtering out reasoning paths with low probabilities.Figure3 shows the overall framework.

3.1 Perplexity Consistency

To improve the efficiency of estimation error reduction, we propose Pc, which directly leverages the LLM’s prediction probability like Ppl, obtaining the benefit of an exponential convergent rate;and also applies the consistency function of Sc, to minimize the model error.Formally, for the unique set of n𝑛nitalic_n sampled reasoning paths ={t~1,,t~n}subscript~𝑡1subscript~𝑡𝑛\mathcal{R}=\{\tilde{t}_{1},\dots,\tilde{t}_{n}\}caligraphic_R = { over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the estimated probability of the answer is

p^(Pc)(y^|x)=t~𝕀[g(t~)=y^]p(t~|x),superscript^𝑝Pcconditional^𝑦𝑥subscript~𝑡𝕀delimited-[]𝑔~𝑡^𝑦𝑝conditional~𝑡𝑥\displaystyle\hat{p}^{(\textsc{Pc})}(\hat{y}\,|\,x)=\sum_{\tilde{t}\in\mathcal%{R}}\mathbb{I}[g(\tilde{t})=\hat{y}]p(\tilde{t}\,|\,x),over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) = ∑ start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over~ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG | italic_x ) ,

which calculates the cumulative probability of all unique reasoning paths whose answer is y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.Therefore, the mean squared error of Pc is

(p^(Pc))=𝔼t~ip(t|x)[(p^(Pc)(y^|x)𝕀[y^=y])2].superscript^𝑝Pcsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscriptsuperscript^𝑝Pcconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Pc})})=\mathbb{E}_{\tilde{t}_{i}%\sim p(t\,|\,x)}\big{[}(\hat{p}^{(\textsc{Pc})}(\hat{y}\,|\,x)-\mathbb{I}[\hat%{y}=y])^{2}\big{]}.caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Now, we present the following theorem, which explores the reasoning error decomposition of Pc.

Theorem 3 (Pc Reasoning Error Decomposition).

Assume that k=|{t~g(t~)=y^}|𝑘conditional-set~𝑡𝑔~𝑡^𝑦k=|\{\tilde{t}\mid g(\tilde{t})=\hat{y}\}|italic_k = | { over~ start_ARG italic_t end_ARG ∣ italic_g ( over~ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG } | and define α:=11kp(y^|x)assign𝛼11𝑘𝑝conditional^𝑦𝑥\alpha:=1-\frac{1}{k}p(\hat{y}\,|\,x)italic_α := 1 - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ).Then, the reasoning error (p^(Pc))superscript^𝑝Pc\mathcal{E}(\hat{p}^{(\textsc{Pc})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ) of Pc can be divided into two components:

(p^(Pc))=superscript^𝑝Pcabsent\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Pc})})=caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ) =αnp(y^|x)(2𝕀[y^=y](1+αn)p(y^|x))Estimation Errorsubscriptsuperscript𝛼𝑛𝑝conditional^𝑦𝑥2𝕀delimited-[]^𝑦𝑦1superscript𝛼𝑛𝑝conditional^𝑦𝑥Estimation Error\displaystyle\underbrace{\alpha^{n}p(\hat{y}\,|\,x)\big{(}2\mathbb{I}[\hat{y}=%y]-(1+\alpha^{n})p(\hat{y}\,|\,x)\big{)}}_{\text{Estimation Error}}under⏟ start_ARG italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ( 2 blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] - ( 1 + italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) end_ARG start_POSTSUBSCRIPT Estimation Error end_POSTSUBSCRIPT
+(p(y^|x)𝕀[y^i=y])2Model Error.subscriptsuperscript𝑝conditional^𝑦𝑥𝕀delimited-[]subscript^𝑦𝑖𝑦2Model Error\displaystyle\qquad+\underbrace{\left(p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}_{i}=%y]\right)^{2}}_{\text{Model Error}}.+ under⏟ start_ARG ( italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Model Error end_POSTSUBSCRIPT .
Remark 4.

The proof is presented in AppendixA.3.The theorem states that Pc successfully fuses the strengths of Ppl and Sc: it achieves the same level of model error as Sc while ensuring the same convergence rate as Ppl in the estimation error.Particularly, the convergence rate can be computed as αnp(y^|x)=(11kp(y^|x))np(y^|x)superscript𝛼𝑛𝑝conditional^𝑦𝑥superscript11𝑘𝑝conditional^𝑦𝑥𝑛𝑝conditional^𝑦𝑥\alpha^{n}p(\hat{y}\,|\,x)=(1-\frac{1}{k}p(\hat{y}\,|\,x))^{n}p(\hat{y}\,|\,x)italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) = ( 1 - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_y end_ARG | italic_x ).

The convergence rate is primarily influenced by the magnitude of p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ). In most scenarios, it remains exponential, facilitating rapid estimation error reduction.However, when p(y^|x)0𝑝conditional^𝑦𝑥0p(\hat{y}\,|\,x)\to 0italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) → 0 and np(y^|x)1much-less-than𝑛𝑝conditional^𝑦𝑥1np(\hat{y}\,|\,x)\ll 1italic_n italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ≪ 1, we only have αn11+np(y^|x)superscript𝛼𝑛11𝑛𝑝conditional^𝑦𝑥\alpha^{n}\to\frac{1}{1+np(\hat{y}\,|\,x)}italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → divide start_ARG 1 end_ARG start_ARG 1 + italic_n italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) end_ARG(Kozma, 2021), resulting in the convergence rate unexpectedly degenerating to a linear result.

3.2 Reasoning Pruning

Our analysis of the convergence rate in estimation error suggests some possibility for further improving the estimation error reduction efficiency.Specifically, the rate degenerates when p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) is too small.Therefore, we propose directly pruning these low-probability answers instead of sampling, called by Reasoning Pruning (Rp).

Reasoning pruning essentially aims to set p^(y^|x)=0^𝑝conditional^𝑦𝑥0\hat{p}(\hat{y}\,|\,x)=0over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) = 0, i.e., when the cumulative probability of all its corresponding reasoning paths are very low,making the estimation error vanish.Although pruning low-probability answers can boost the efficiency of estimation error reduction, it also induces a level of model error.In other words, it may ignore some correct answers of low probability, thus implicitly degrading the performance of LLM reasoning.

Ideally, if we know the probability p(y|x)𝑝conditional𝑦𝑥p(y\,|\,x)italic_p ( italic_y | italic_x ) that the LLM can generate the correct answer, the optimal threshold for pruning should be τ=p(y|x)𝜏𝑝conditional𝑦𝑥\tau=p(y\,|\,x)italic_τ = italic_p ( italic_y | italic_x ). In this case, one can obtain not only an exponential convergence rate but also a significant reduction in model error, as all incorrect answers y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG satisfying p(y^|x)<τ𝑝conditional^𝑦𝑥𝜏p(\hat{y}\,|\,x)<\tauitalic_p ( over^ start_ARG italic_y end_ARG | italic_x ) < italic_τ are eliminated. However, to obtain this optimal error reduction, two challenges need to be resolved: (1) we only have an estimate of p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) since the accurate value is unknown; and (2) we cannot know the ground-truth probability p(y|x)𝑝conditional𝑦𝑥p(y\,|\,x)italic_p ( italic_y | italic_x ), making the threshold difficult to determine.

For the first problem, we propose directly pruning reasoning paths instead of answers in Rp, with the following theorem.

Theorem 5 (Effectiveness of Reasoning Path Pruning).

Assume that the optimal threshold τ=p(y|x)𝜏𝑝conditional𝑦𝑥\tau=p(y\,|\,x)italic_τ = italic_p ( italic_y | italic_x ), and let k^=|{t~ig(t~i)=y^,i=1,,n}|^𝑘conditional-setsubscript~𝑡𝑖formulae-sequence𝑔subscript~𝑡𝑖^𝑦𝑖1𝑛\hat{k}=|\{\tilde{t}_{i}\mid g(\tilde{t}_{i})=\hat{y},i=1,\dots,n\}|over^ start_ARG italic_k end_ARG = | { over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG italic_y end_ARG , italic_i = 1 , … , italic_n } |, which refers to the size of samples whose answer is y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.Hence, Rp achieves the optimal error reduction with at least the probability

1exp(2k^k2(1τ1α)2).12^𝑘superscript𝑘2superscript1𝜏1𝛼21-\exp\Big{(}-{2\hat{k}k^{2}}(1-\frac{\tau}{1-\alpha})^{2}\Big{)}.1 - roman_exp ( - 2 over^ start_ARG italic_k end_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 1 - italic_α end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Remark 6.

The proof is included in AppendixA.4.The theorem provides a guarantee that Rp can achieve the optimal error reduction for each given problem (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) at a high probability.Note that the optimal error reduction not only boosts the estimation error reduction efficiency but also effectively reduces the model error, thus improving the final reasoning capability of LLMs.

The remaining problem is determining the optimal threshold for reasoning path removal.To achieve this goal, we develop an automated strategy.Inspired by open-set recognition(Bendale & Boult, 2016), we model the probability distribution of Ω1subscriptΩ1\Omega_{1}roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Ω2subscriptΩ2\Omega_{2}roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as a mixture of two Weibull distributions, representing high and low probability regions.

Elaborately, we define the PDF of mixture distribution as:

f(x)=w1fW(x;k1,λ1)+w2fW(x;k2,λ2),𝑓𝑥subscript𝑤1subscript𝑓W𝑥subscript𝑘1subscript𝜆1subscript𝑤2subscript𝑓W𝑥subscript𝑘2subscript𝜆2f(x)=w_{1}f_{\text{W}}(x;k_{1},\lambda_{1})+w_{2}f_{\text{W}}(x;k_{2},\lambda_%{2}),italic_f ( italic_x ) = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT W end_POSTSUBSCRIPT ( italic_x ; italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT W end_POSTSUBSCRIPT ( italic_x ; italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where the Weibull PDF(Weibull, 1951) is defined as fW(x;k,λ)=kλ(xλ)k1exp((xλ)k)subscript𝑓W𝑥𝑘𝜆𝑘𝜆superscript𝑥𝜆𝑘1superscript𝑥𝜆𝑘f_{\text{W}}(x;k,\lambda)=\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1%}\exp{\left(-(\frac{x}{\lambda})^{k}\right)}italic_f start_POSTSUBSCRIPT W end_POSTSUBSCRIPT ( italic_x ; italic_k , italic_λ ) = divide start_ARG italic_k end_ARG start_ARG italic_λ end_ARG ( divide start_ARG italic_x end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_exp ( - ( divide start_ARG italic_x end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).We use the maximum likelihood estimation methods to estimate the parameters, i.e., (k1,λ1)subscript𝑘1subscript𝜆1(k_{1},\lambda_{1})( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (k2,λ2)subscript𝑘2subscript𝜆2(k_{2},\lambda_{2})( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on the probability distribution of all sampled reasoning paths for each reasoning problem.We assume that Weibull(k1,λ1)Weibullsubscript𝑘1subscript𝜆1\text{Weibull}(k_{1},\lambda_{1})Weibull ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the high probability distribution and Weibull(k2,λ2)Weibullsubscript𝑘2subscript𝜆2\text{Weibull}(k_{2},\lambda_{2})Weibull ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the low probability distribution. Then, the probability of reasoning path t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG being in the high probability distribution is derived by

PHigh(x)=w1fW(x;k1,λ1)w1fW(x;k1,λ1)+w2fW(x;k2,λ2).subscript𝑃High𝑥subscript𝑤1subscript𝑓W𝑥subscript𝑘1subscript𝜆1subscript𝑤1subscript𝑓W𝑥subscript𝑘1subscript𝜆1subscript𝑤2subscript𝑓W𝑥subscript𝑘2subscript𝜆2P_{\text{High}}(x)=\frac{w_{1}f_{\text{W}}(x;k_{1},\lambda_{1})}{w_{1}f_{\text%{W}}(x;k_{1},\lambda_{1})+w_{2}f_{\text{W}}(x;k_{2},\lambda_{2})}.italic_P start_POSTSUBSCRIPT High end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT W end_POSTSUBSCRIPT ( italic_x ; italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT W end_POSTSUBSCRIPT ( italic_x ; italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT W end_POSTSUBSCRIPT ( italic_x ; italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG .

where x𝑥xitalic_x is the value of LLM internal probability.

Then, we remove sampled reasoning paths t~~𝑡\tilde{t}over~ start_ARG italic_t end_ARG satisfying PHigh(p^(t~|x))<0.5subscript𝑃High^𝑝conditional~𝑡𝑥0.5P_{\text{High}}(\hat{p}(\tilde{t}\,|\,x))<0.5italic_P start_POSTSUBSCRIPT High end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG ( over~ start_ARG italic_t end_ARG | italic_x ) ) < 0.5, which more likely to be in the low probability distribution. Moreover, to ensure the algorithm’s stability when n𝑛nitalic_n is limited, we employ the Truncated Mean method(Marazzi & Ruffieux, 1999), retaining outputs with a probability greater than the overall mean. This prevents the removal of too many reasoning paths due to potential inaccurate estimation of the mixture distribution.

Overall, we apply the Reasoning Pruning to all sampled reasoning paths t~1,,t~nsubscript~𝑡1subscript~𝑡𝑛\tilde{t}_{1},\dots,\tilde{t}_{n}over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for removing low probability reasoning paths and then compute the confidence based on Perplexity Consistency, forming our proposed Reasoning-pruning Perplexity Consistency confidence (Rpc). The pseudo-code is shown in Algorithm1 in AppendixB.

4 Experiments

In this section, we first conduct experiments to answer the following research questions:

  1. RQ1: Efficiency. How does Rpc reduce the number of samples required to achieve comparable performance through faster convergence?

  2. RQ2: Efficacy. How does Rpc improve reasoning performance compared to existing methods?

  3. RQ3: Reliability. How does Rpc enhance the reliability of confidence estimation compared to existing methods?

Additional discussions are devoted to further demonstrating the effectiveness of Rpc.Due to space limitations, supplementary experimental results are included in AppendixD.

MethodMATHMathOdysseyOlympiadBenchAIME
Accuracy#SamplingsAccuracy#SamplingsAccuracy#SamplingsAccuracy#Samplings
Best of Sc50.576428.3211211.071289.40128
Pc50.633228.5111211.071289.0064
ΔΔ\Deltaroman_Δ+0.06-50.0%+0.19-0.0%0.00-0.0%0.00-50.0%
Rpc51.163229.313211.07649.5048
ΔΔ\Deltaroman_Δ+0.59-50.0%+0.99-71.4%0.00-50.0%+0.10-62.5%

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (4)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (5)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (6)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (7)

4.1 Experimental Setting

In this section, we briefly introduce the comparison methods, datasets, and details of the implementation.The experimental settings can be found in AppendixC.

Comparison Methods.We compare three types of LLM confidences: perplexity confidence(Wang etal., 2022) (Ppl), self-consistency confidence(Chen etal., 1998) (Sc), and verbalized confidence(Tian etal., 2023) (Verb).The verbalized confidence is computed based on the probability that the LLM outputs “True” versus “False” when asked an “Is-True” question. For code generation tasks, we extracted verbalized confidence scores from the model’s numerical likelihood expressions by prompting the LLM.

Datasets.We introduce four popular benchmarks for math reasoning: MATH(Hendrycks etal., 2021b), MathOdyssey(Fang etal., 2024), OlympiadBench(He etal., 2024), and AIME(Zamil & Rabby, 2024).As to code generation tasks, we evaluate each method on three benchmarks, i.e., HumanEval(Chen etal., 2021), MBPP(Austin etal., 2021), and introductory-level problems of APPS(Hendrycks etal., 2021a).

Implementation Details.For math reasoning tasks, we evaluate the InternLM2-Math-Plus models with 1.8B and 7B parameters(Ying etal., 2024), as well as the DeepSeekMath-RL 7B model(Shao etal., 2024). The consistency function 𝕀Csubscript𝕀𝐶\mathbb{I}_{C}blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is answer comparison. For code generation tasks, we evaluate the Deepseek-Coder 33B model. The consistency function 𝕀Csubscript𝕀𝐶\mathbb{I}_{C}blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is constructed based on semantic equivalence(Malík & Vojnar, 2021) by clustering code based on given test cases. We set the sample size to n=128𝑛128n=128italic_n = 128 for the MathOdyssey, OlympiadBench, and AIME datasets and n=64𝑛64n=64italic_n = 64 for the MATH dataset by default. Each experiment is repeated 10 times with different random seeds, and the average performance is reported. All experiments were conducted on Linux servers with A800 and H800 GPUs.

MethodMATHMathOdysseyOlympiadBenchAIMEAverage
Accuracy(\uparrow)ECE(\downarrow)Accuracy(\uparrow)ECE(\downarrow)Accuracy(\uparrow)ECE(\downarrow)Accuracy(\uparrow)ECE(\downarrow)Acc.(\uparrow)ECE(\downarrow)
Ppl46.99 ±plus-or-minus\pm± 0.2048.99 ±plus-or-minus\pm± 0.1927.35 ±plus-or-minus\pm± 1.2267.70 ±plus-or-minus\pm± 1.227.27 ±plus-or-minus\pm± 0.3686.90 ±plus-or-minus\pm± 0.355.96 ±plus-or-minus\pm± 0.4888.98 ±plus-or-minus\pm± 0.4921.9073.14
Verb26.14 ±plus-or-minus\pm± 0.2547.46 ±plus-or-minus\pm± 0.0710.06 ±plus-or-minus\pm± 0.6169.92 ±plus-or-minus\pm± 0.883.68 ±plus-or-minus\pm± 0.1684.68 ±plus-or-minus\pm± 0.253.17 ±plus-or-minus\pm± 0.1786.29 ±plus-or-minus\pm± 0.2010.7672.09
Sc50.57 ±plus-or-minus\pm± 0.176.71 ±plus-or-minus\pm± 0.1828.25 ±plus-or-minus\pm± 0.6012.23 ±plus-or-minus\pm± 0.5411.07 ±plus-or-minus\pm± 0.1520.20 ±plus-or-minus\pm± 0.169.40 ±plus-or-minus\pm± 0.2114.35 ±plus-or-minus\pm± 0.2324.8213.37
Rpc51.95 ±plus-or-minus\pm± 0.156.41 ±plus-or-minus\pm± 0.1831.62 ±plus-or-minus\pm± 0.759.87 ±plus-or-minus\pm± 0.7311.14 ±plus-or-minus\pm± 0.1518.86 ±plus-or-minus\pm± 0.189.74 ±plus-or-minus\pm± 0.2314.32 ±plus-or-minus\pm± 0.2126.1112.37

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (8)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (9)

MethodInternLM2-Math-Plus 1.8B
MATHMathOdysseyOlympiadBenchAIME
Ppl33.24 ±plus-or-minus\pm± 0.2416.56 ±plus-or-minus\pm± 0.883.08 ±plus-or-minus\pm± 0.201.66 ±plus-or-minus\pm± 0.15
Verb7.21 ±plus-or-minus\pm± 0.172.81 ±plus-or-minus\pm± 0.260.77 ±plus-or-minus\pm± 0.060.26 ±plus-or-minus\pm± 0.05
Sc36.48 ±plus-or-minus\pm± 0.1514.52 ±plus-or-minus\pm± 0.465.99 ±plus-or-minus\pm± 0.172.66 ±plus-or-minus\pm± 0.20
Rpc37.88 ±plus-or-minus\pm± 0.1616.35 ±plus-or-minus\pm± 0.616.52 ±plus-or-minus\pm± 0.243.26 ±plus-or-minus\pm± 0.24
MethodDeepSeekMath-RL 7B
MATHMathOdysseyOlympiadBenchAIME
Ppl42.51 ±plus-or-minus\pm± 0.2322.34 ±plus-or-minus\pm± 1.005.90 ±plus-or-minus\pm± 0.313.37 ±plus-or-minus\pm± 0.46
Verb14.29 ±plus-or-minus\pm± 0.232.55 ±plus-or-minus\pm± 0.242.36 ±plus-or-minus\pm± 0.151.91 ±plus-or-minus\pm± 0.12
Sc53.33 ±plus-or-minus\pm± 0.0936.68 ±plus-or-minus\pm± 0.6511.29 ±plus-or-minus\pm± 0.179.42 ±plus-or-minus\pm± 0.23
Rpc53.37 ±plus-or-minus\pm± 0.1137.25 ±plus-or-minus\pm± 0.6911.30 ±plus-or-minus\pm± 0.119.52 ±plus-or-minus\pm± 0.31

4.2 Empirical Results

RQ1: Efficiency. How does Rpc reduce the number of samples required to achieve comparable performance through faster convergence?

We evaluate our proposed Rpc against the standard self-consistency method using four mathematical benchmark datasets with the InternLM-2-MATH-Plus 7B model. For the MATH dataset, we set the reasoning path size to 64, while we set the number of reasoning paths to 128 for the other datasets with Sc. We then record the best performance and minimum sampling requirements for Sc. For both Rpc and our Perplexity Consistency module (denoted as Pc), we report the minimum number of samples needed to match or exceed the performance of the Sc in Table1.

The results of Pc indicate improved convergence rates compared to Sc in several cases, while maintaining similar rates in others. These findings support the rapid convergence and degeneration issues of Pc in Theorem3. Rpc shows significant efficiency improvements by requiring fewer samples to achieve comparable performance relative to Sc. Moreover, the degeneration issues observed in Pc are effectively addressed in Rpc, validating both the effectiveness of the Reasoning Pruning module and our Theorem5.

RQ2: Efficacy. How does Rpc improve reasoning performance compared to existing methods?

We evaluate the performance of Pc and Rpc in Figure4 across various sample budgets. The results demonstrate that Rpc achieves better performance than both Ppl (which relies on internal LLM probabilities) and Sc (which employs Monte Carlo sampling). The detailed accuracy results, including mean and standard deviation in Table2 support these findings.

We also analyze the performance of Pc separately. The results indicate that Pc has a faster convergence rate than Sc, which aligns with Theorem3. The significant performance gains from Pc to Rpc, as shown in 9(a) and 9(b), validate the effectiveness of the Reasoning Pruning module. This suggests that Reasoning Pruning helps reduce model errors when the LLM exhibits good alignment by eliminating incorrect reasoning paths that carry low LLM probability scores.

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (10)

RQ3: Reliability. How does Rpc enhance the reliability of confidence estimation compared to existing methods?

To evaluate the reliability of confidence estimation, we analyze the ECE results of Rpc and comparison methods in Table2. ECE measures the difference between predicted probabilities and empirical accuracy, as directly computing estimation error using ground-truth probabilities is virtually impractical. The results demonstrate that our Rpc approach achieves lower ECE values and higher accuracy compared to Ppl and Sc, indicating more reliable confidence estimation. We visualize this improvement through reliability diagrams comparing Sc and Rpc in Figure5 on MathOdyssey, which clearly shows the reduced calibration error of Rpc.

4.3 Further Discussion

Results on Code Generation Tasks.To investigate whether our proposed approaches can generalize to other reasoning tasks, such as code generation tasks, we evaluate Rpc and comparison methods on three code generation benchmarks, as illustrated in Figure6. The results show that Rpc achieves the highest accuracy across all datasets, demonstrating its effectiveness in reasoning tasks beyond mathematics.

Results Across Model Scales and Architectures.To evaluate the generalization ability of our approaches across different model scales and architectures, we conducted additional experiments using InternLM2-Math-Plus 1.8B and DeepSeek-Math 7B models. The results in Table3 show that Rpc consistently outperforms existing methods, which is consistent with results in Table2.

5 Related Work

This paper is related to the two research topics, i.e., LLM Reasoning Boosting and LLM Confidence Estimation.

LLM Reasoning Boosting.

Recent research has developed various methods to improve LLM reasoning capabilities. CoT(Kojima etal., 2022) proposes the “Let’s think step by step” prompt to guide LLMs in generating structured solutions. For complex problems, Least-to-Most(Zhou etal., 2023) introduces a decomposition strategy that breaks down challenges into manageable sub-problems. Few-shot methods(Wei etal., 2022; Fu etal., 2023; Zhang etal., 2023) leverage carefully selected examples to improve reasoning performance. To enable more comprehensive reasoning, search-based methods(Guan etal., 2025) integrate Monte Carlo Tree Search (MCTS). Recent advancements have further enhanced MCTS by incorporating reward models(Zhang etal., 2024; Park etal., 2024). To directly optimize reasoning abilities, researchers have explored fine-tuning approaches(Yu etal., 2024; Li etal., 2024b, d; Ying etal., 2024) using specialized datasets and reinforcement learning techniques(Shao etal., 2024; Luo etal., 2025).

While these methods focus on generating accurate reasoning paths, our Rpc can build upon them by utilizing multiple sampling strategies, enabling better reasoning performance.

LLM Confidence Estimation.The confidence estimation for LLM can be categorized into three types: (1) perplexity confidence, (2) verbalized confidence, and (3) self-consistency confidence. Perplexity confidence(Huang etal., 2023; Duan etal., 2024) utilizes the geometric mean of LLM prediction probabilities (i.e., perplexity(Chen etal., 1998; Blei etal., 2003)) to evaluate model adherence(Murugadoss etal., 2025) and prompt quality(YAO etal., 2024). Verbalized confidence(Kadavath etal., 2022; Xiong etal., 2024; Tian etal., 2023) directly asks LLMs to express their confidence through various approaches, such as multi-agent deliberation(Yang etal., 2024), multi-step evaluation(Xiong etal., 2024), top-k ranking(Tian etal., 2023), few-shot prompting(Liu etal., 2024), and reflection(Dhuliawala etal., 2024; Zhao etal., 2024). Self-consistency confidence(Wang etal., 2022; Chen etal., 2023b; Cheng etal., 2024) measures the agreement among multiple generated answers to improve reasoning performance, with recent work(Xiong etal., 2024; Abbasi-Yadkori etal., 2024; Becker & Soatto, 2024) further developing this approach as a confidence metric. Recent studies recognize its computational issues and propose early stopping(Li etal., 2024c) and dynamic sampling methods(Wang etal., 2024; Wan etal., 2024; Aggarwal etal., 2023) to improve efficiency.

Our proposed Rpc integrates LLM probability with a self-consistency framework, allowing perplexity and verbalized confidence to be used. Rpc achieves enhanced confidence estimation with fixed reasoning paths, showing a complement to existing self-consistency methods.

6 Conclusion

In this paper, we address a foundational challenge in LLM reasoning: determining the most reliable answer from multiple reasoning paths by measuring LLM confidence.We present a theoretical framework to decompose the reasoning error into estimation error and model error, revealing that perplexity methods suffer from substantial model error due to lacking consistency function while self-consistency suffers from high estimation error because of a slow error convergence rate.To tackle this limitation, we introduce Reasoning-pruning Perplexity Consistency (Rpc), a confidence estimation method with two key components:Perplexity Consistency utilizes internal LLM probabilities to achieve faster estimation error convergence in major cases.Reasoning Pruning prunes low-probability reasoning paths to prevent the remaining degeneration cases.Our theoretical analysis and extensive experiments demonstrate that Rpc achieves superior error convergence rates and reasoning performance compared to existing methods.

Limitations and Future Work. One limitation of this work is that we have only taken an initial step toward improving confidence estimation for self-consistency. The two components of Rpc can be further enhanced: Perplexity Consistency could incorporate additional probability metrics from LLM outputs, while Reasoning Pruning could be extended with more sophisticated pruning strategies, such as analyzing the probability distribution of each candidate answer. We believe our initial approach and theoretical analysis can guide future research in this promising direction.

Impact Statement

This work advances the efficiency and effectiveness of LLM reasoning with multiple reasoning paths.Our method can benefit various applications requiring reliable artificial intelligence reasoning, such as mathematical problem-solving and code generation. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

  • Abbasi-Yadkori etal. (2024)Abbasi-Yadkori, Y., Kuzborskij, I., György, A., and Szepesvári, C.To believe or not to believe your LLM.CoRR, 2024.
  • Aggarwal etal. (2023)Aggarwal, A. M.P., Yang, Y., and Mausam.Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, 2023.
  • Austin etal. (2021)Austin, J., Odena, A., Nye, M.I., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C.J., Terry, M., Le, Q.V., and Sutton, C.Program synthesis with large language models.CoRR, abs/2108.07732, 2021.
  • Becker & Soatto (2024)Becker, E. and Soatto, S.Cycles of thought: Measuring LLM confidence through stable explanations.CoRR, abs/2406.03441, 2024.
  • Bendale & Boult (2016)Bendale, A. and Boult, T.E.Towards open set deep networks.In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572, 2016.
  • Blei etal. (2003)Blei, D.M., Ng, A.Y., and Jordan, M.I.Latent dirichlet allocation.Journal of Machine Learning Research, 3(1):993–1022, 2003.
  • Chen etal. (2023a)Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J., and Chen, W.Codet: Code generation with generated tests.In Proceedings of the 11th International Conference on Learning Representations, 2023a.
  • Chen etal. (2021)Chen, M., Tworek, J., Jun, H., Yuan, Q., deOliveiraPinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W.H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A.N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W.Evaluating large language models trained on code.CoRR, abs/2107.03374, 2021.
  • Chen etal. (1998)Chen, S.F., Beeferman, D., and Rosenfeld, R.Evaluation metrics for language models.1998.
  • Chen etal. (2023b)Chen, X., Aksitov, R., Alon, U., Ren, J., Xiao, K., Yin, P., Prakash, S., Sutton, C., Wang, X., and Zhou, D.Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023b.
  • Chen etal. (2024)Chen, Y., Jhamtani, H., Sharma, S., Fan, C., and Wang, C.Steering large language models between code execution and textual reasoning.CoRR, abs/2410.03524, 2024.
  • Cheng etal. (2024)Cheng, F., Zouhar, V., Arora, S., Sachan, M., Strobelt, H., and El-Assady, M.Relic: Investigating large language model responses using self-consistency.In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–18, 2024.
  • Deng etal. (2024)Deng, Y., Zhang, W., Lam, W., Ng, S., and Chua, T.Plug-and-play policy planner for large language model powered dialogue agents.In Proceedings of the 12th International Conference on Learning Representations, 2024.
  • Dhuliawala etal. (2024)Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J.Chain-of-verification reduces hallucination in large language models.In Findings of the Association for Computational Linguistics, pp. 3563–3578, 2024.
  • Duan etal. (2024)Duan, J., Cheng, H., Wang, S., Zavalny, A., Wang, C., Xu, R., Kailkhura, B., and Xu, K.Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 5050–5063, 2024.
  • Fang etal. (2024)Fang, M., Wan, X., Lu, F., Xing, F., and Zou, K.Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data.CoRR, abs/2406.18321, 2024.
  • Fu etal. (2023)Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T.Complexity-based prompting for multi-step reasoning.In Proceedings of the 11th International Conference on Learning Representations, 2023.
  • Guan etal. (2025)Guan, X., Zhang, L.L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., and Yang, M.rstar-math: Small llms can master math reasoning with self-evolved deep thinking, 2025.
  • He etal. (2024)He, C., Luo, R., Bai, Y., Hu, S., Thai, Z.L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M.Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 3828–3850, 2024.
  • Hendrycks etal. (2021a)Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J.Measuring coding challenge competence with APPS.In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021a.
  • Hendrycks etal. (2021b)Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J.Measuring mathematical problem solving with the math dataset.In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021b.
  • Huang etal. (2023)Huang, Y., Song, J., Wang, Z., Zhao, S., Chen, H., Juefei-Xu, F., and Ma, L.Look before you leap: An exploratory study of uncertainty measurement for large language models.arXiv preprint arXiv:2307.10236, 2023.
  • Kadavath etal. (2022)Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., Showk, S.E., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., and Kaplan, J.Language models (mostly) know what they know.CoRR, abs/2207.05221, 2022.
  • Kojima etal. (2022)Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., and Iwasawa, Y.Large language models are zero-shot reasoners.In Advances in Neural Information Processing Systems, pp. 22199–22213, 2022.
  • Kozma (2021)Kozma, L.Useful inequalities, 2021.
  • Lewkowycz etal. (2022)Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V.V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V.Solving quantitative reasoning problems with language models.In Advances in Neural Information Processing Systems, pp. 3843–3857, 2022.
  • Li etal. (2024a)Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., and Ichter, B.Chain of code: Reasoning with a language model-augmented code emulator.In Proceedings of the 41st International Conference on Machine Learning, 2024a.
  • Li etal. (2024b)Li, C., Yuan, Z., Yuan, H., Dong, G., Lu, K., Wu, J., Tan, C., Wang, X., and Zhou, C.MuggleMath: Assessing the impact of query and response augmentation on math reasoning.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 10230–10258, 2024b.
  • Li etal. (2024c)Li, Y., Yuan, P., Feng, S., Pan, B., Wang, X., Sun, B., Wang, H., and Li, K.Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning.In Proceddings of the 12th International Conference on Learning Representations, 2024c.
  • Li etal. (2024d)Li, Z., Zhou, Z., Yao, Y., Zhang, X., Li, Y.-F., Cao, C., Yang, F., and Ma, X.Neuro-symbolic data generation for math reasoning.In Proceedings of the 38th Annual Conference on Neural Information Processing Systems, 2024d.
  • Liu etal. (2024)Liu, Y., Yang, T., Huang, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., and Zhang, Q.Calibrating llm-based evaluator.In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, pp. 2638–2656, 2024.
  • Luo etal. (2025)Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., and Zhang, D.Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.In Proceddings of the 13th International Conference on Learning Representations, 2025.
  • Malík & Vojnar (2021)Malík, V. and Vojnar, T.Automatically checking semantic equivalence between versions of large-scale c projects.In Proceedings of the14th IEEE Conference on Software Testing, Verification and Validation, pp. 329–339, 2021.
  • Marazzi & Ruffieux (1999)Marazzi, A. and Ruffieux, C.The truncated mean of an asymmetric distribution.Computational Statistics & Data Analysis, 32(1):79–100, 1999.
  • Murugadoss etal. (2025)Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., and Sarkar, A.Evaluating the evaluator: Measuring LLMs’ adherence to task evaluation instructions.In Proceedings of the 39th AAAI Conference on Artificial Intelligence, 2025.
  • Ouyang & Li (2023)Ouyang, S. and Li, L.Autoplan: Automatic planning of interactive decision-making tasks with large language models.In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3114–3128, 2023.
  • Park etal. (2024)Park, S., Liu, X., Gong, Y., and Choi, E.Ensembling large language models with process reward-guided tree search for better complex reasoning.CoRR, abs/2412.15797, 2024.
  • Sblendorio etal. (2024)Sblendorio, E., Dentamaro, V., Cascio, A.L., Germini, F., Piredda, M., and Cicolini, G.Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models’ feasibility in clinical decision-making.International Journal of Medical Informatics, 188:105501, 2024.
  • Shao etal. (2024)Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., and Guo, D.Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024.
  • Tian etal. (2023)Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C.D.Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5433–5442. Association for Computational Linguistics, 2023.
  • Valmeekam etal. (2023)Valmeekam, K., Marquez, M., Sreedharan, S., and Kambhampati, S.On the planning abilities of large language models - A critical investigation.In Advances in Neural Information Processing Systems, pp. 75993–76005, 2023.
  • Wan etal. (2024)Wan, G., Wu, Y., Chen, J., and Li, S.Dynamic self-consistency: Leveraging reasoning paths for efficient LLM sampling.CoRR, abs/2408.17017, 2024.
  • Wang etal. (2022)Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., and Zhou, D.Self-consistency improves chain of thought reasoning in language models.In Proceedings of the 11th International Conference on Learning Representations, 2022.
  • Wang etal. (2024)Wang, X., Feng, S., Li, Y., Yuan, P., Zhang, Y., Pan, B., Wang, H., Hu, Y., and Li, K.Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning.CoRR, abs/2408.13457, 2024.
  • Wei etal. (2022)Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., and Zhou, D.Chain-of-thought prompting elicits reasoning in large language models.In Advances in Neural Information Processing Systems, pp. 24824–24837, 2022.
  • Weibull (1951)Weibull, W.A statistical distribution function of wide applicability.Journal of applied mechanics, 1951.
  • Xiong etal. (2024)Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., and Hooi, B.Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.In Proceedings of the 12th International Conference on Learning Representations, 2024.
  • Yang etal. (2024)Yang, R., Rajagopal, D., Hayati, S.A., Hu, B., and Kang, D.Confidence calibration and rationalization for LLMs via multi-agent deliberation.In International Conference on Learning Representations Workshop on Reliable and Responsible Foundation Models, 2024.
  • YAO etal. (2024)YAO, Y., Wu, H., Guo, Z., Biyan, Z., Gao, J., Luo, S., Hou, H., Fu, X., and Song, L.Learning from correctness without prompting makes LLM efficient reasoner.In Proceedings of the 1st Conference on Language Modeling, 2024.
  • Ying etal. (2024)Ying, H., Zhang, S., Li, L., Zhou, Z., Shao, Y., Fei, Z., Ma, Y., Hong, J., Liu, K., Wang, Z., Wang, Y., Wu, Z., Li, S., Zhou, F., Liu, H., Zhang, S., Zhang, W., Yan, H., Qiu, X., Wang, J., Chen, K., and Lin, D.Internlm-math: Open math large language models toward verifiable reasoning, 2024.
  • Yu etal. (2024)Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A., and Liu, W.Metamath: Bootstrap your own mathematical questions for large language models.In Proceddings of the 12th International Conference on Learning Representations, 2024.
  • Zamil & Rabby (2024)Zamil, P. and Rabby, G.Aime problems 1983 to 2024, 2024.
  • Zhang etal. (2024)Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., and Tang, J.ReST-MCTS*: LLM self-training via process reward guided tree search.In Proceedings of the 38th Annual Conference on Neural Information Processing Systems, 2024.
  • Zhang etal. (2023)Zhang, Z., Zhang, A., Li, M., and Smola, A.Automatic chain of thought prompting in large language models.In Proceedings of the 11th International Conference on Learning Representations, 2023.
  • Zhao etal. (2024)Zhao, X., Zhang, H., Pan, X., Yao, W., Yu, D., Wu, T., and Chen, J.Fact-and-reflection (far) improves confidence calibration of large language models.In Findings of the Association for Computational Linguistics, pp. 8702–8718, 2024.
  • Zhou etal. (2023)Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q.V., and Chi, E.H.Least-to-most prompting enables complex reasoning in large language models.In Proceedings of the 11th International Conference on Learning Representations, 2023.

Appendix A Theoretical Results

A.1 Proof of Proposition1 and Proposition2

Proof.

(Sc) First, we denote the sampling probability distribution of the LLM as p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ),and the confidence function as p^(Sc)(y^|x)=1ni=1n𝕀[y~i=y^i]superscript^𝑝Scconditional^𝑦𝑥1𝑛superscriptsubscript𝑖1𝑛𝕀delimited-[]subscript~𝑦𝑖subscript^𝑦𝑖\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[%\tilde{y}_{i}=\hat{y}_{i}]over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], where y~1,,y~nsubscript~𝑦1subscript~𝑦𝑛\tilde{y}_{1},\dots,\tilde{y}_{n}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are sampled on the distribution p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ).Apply the error decomposition, we have

(p^(Sc))superscript^𝑝Sc\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Sc})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT )=𝔼y~ip(y~i|x)[(p^(Sc)(y^|x)𝕀[y^=y])2]absentsubscript𝔼similar-tosubscript~𝑦𝑖𝑝conditionalsubscript~𝑦𝑖𝑥delimited-[]superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{y}_{i}\sim{p}(\tilde{y}_{i}\,|\,x)}\left[(%\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}\right]= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼y~ip(y~i|x)[(1ni=1n𝕀[y~i=y^i]𝕀[y^=y])2]absentsubscript𝔼similar-tosubscript~𝑦𝑖𝑝conditionalsubscript~𝑦𝑖𝑥delimited-[]superscript1𝑛superscriptsubscript𝑖1𝑛𝕀delimited-[]subscript~𝑦𝑖subscript^𝑦𝑖𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{y}_{i}\sim{p}(\tilde{y}_{i}\,|\,x)}\big{[}(%\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[\tilde{y}_{i}=\hat{y}_{i}]-\mathbb{I}[\hat%{y}=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) end_POSTSUBSCRIPT [ ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼y~ip(y~i|x)[(1ni=1n𝕀[y~i=y^i]p(y^|x)+p(y^|x)𝕀[y^i=y])2]absentsubscript𝔼similar-tosubscript~𝑦𝑖𝑝conditionalsubscript~𝑦𝑖𝑥delimited-[]superscript1𝑛superscriptsubscript𝑖1𝑛𝕀delimited-[]subscript~𝑦𝑖subscript^𝑦𝑖𝑝conditional^𝑦𝑥𝑝conditional^𝑦𝑥𝕀delimited-[]subscript^𝑦𝑖𝑦2\displaystyle=\mathbb{E}_{\tilde{y}_{i}\sim{p}(\tilde{y}_{i}\,|\,x)}\big{[}(%\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[\tilde{y}_{i}=\hat{y}_{i}]-p(\hat{y}\,|\,x%)+p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}_{i}=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) end_POSTSUBSCRIPT [ ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) + italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=1np(y~i|x)(1p(y~i|x))+(p(y^|x)𝕀[y^i=y])2.absent1𝑛𝑝conditionalsubscript~𝑦𝑖𝑥1𝑝conditionalsubscript~𝑦𝑖𝑥superscript𝑝conditional^𝑦𝑥𝕀delimited-[]subscript^𝑦𝑖𝑦2\displaystyle=\frac{1}{n}{p}(\tilde{y}_{i}\,|\,x)(1-{p}(\tilde{y}_{i}\,|\,x))+%\left(p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}_{i}=y]\right)^{2}.= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_p ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) ( 1 - italic_p ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) ) + ( italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

(Ppl) Another way is to use the confidence function to build the sampling probability distribution,i.e.,

p^(Ppl)(t^|x)={p(t^|x),ifthere ist~i=t^0,otherwise=i=1n𝕀(t~i=t^)p(t^|x).superscript^𝑝Pplconditional^𝑡𝑥cases𝑝conditional^𝑡𝑥ifthere issubscript~𝑡𝑖^𝑡0otherwisesuperscriptsubscript𝑖1𝑛𝕀subscript~𝑡𝑖^𝑡𝑝conditional^𝑡𝑥\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)=\left\{\begin{array}[]{ll}p(\hat{t}\,|%\,x),&\text{if}~{}\text{there is}~{}\tilde{t}_{i}=\hat{t}\\0,&\text{otherwise}\end{array}\right.=\sum_{i=1}^{n}\mathbb{I}(\tilde{t}_{i}=%\hat{t})p(\hat{t}\,|\,x).over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) = { start_ARRAY start_ROW start_CELL italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) , end_CELL start_CELL if there is over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_t end_ARG end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_t end_ARG ) italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) .

Now, we have

𝔼y~ip(t~i|x)[p^(Ppl)(t^|x)p(t^|x)]=(1p(t^|x))np(t^|x)subscript𝔼similar-tosubscript~𝑦𝑖𝑝conditionalsubscript~𝑡𝑖𝑥delimited-[]superscript^𝑝Pplconditional^𝑡𝑥𝑝conditional^𝑡𝑥superscript1𝑝conditional^𝑡𝑥𝑛𝑝conditional^𝑡𝑥\displaystyle\mathbb{E}_{\tilde{y}_{i}\sim{p}(\tilde{t}_{i}\,|\,x)}[\hat{p}^{(%\textsc{Ppl})}(\hat{t}\,|\,x)-p(\hat{t}\,|\,x)]=-(1-p(\hat{t}\,|\,x))^{n}p(%\hat{t}\,|\,x)blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) end_POSTSUBSCRIPT [ over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ] = - ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_t end_ARG | italic_x )
𝔼y~ip(t~i|x)[(p^(Ppl)(t^|x)p(t^|x))2]=(1p(t^|x))np(t^|x)2.subscript𝔼similar-tosubscript~𝑦𝑖𝑝conditionalsubscript~𝑡𝑖𝑥delimited-[]superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝑝conditional^𝑡𝑥2superscript1𝑝conditional^𝑡𝑥𝑛𝑝superscriptconditional^𝑡𝑥2\displaystyle\mathbb{E}_{\tilde{y}_{i}\sim{p}(\tilde{t}_{i}\,|\,x)}[(\hat{p}^{%(\textsc{Ppl})}(\hat{t}\,|\,x)-p(\hat{t}\,|\,x))^{2}]=(1-p(\hat{t}\,|\,x))^{n}%p(\hat{t}\,|\,x)^{2}.blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Hence,

(p^(Ppl))superscript^𝑝Ppl\displaystyle\mathcal{E}(\hat{p}^{(\textsc{Ppl})})caligraphic_E ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT )=𝔼t~ip(t|x)[(p^(Ppl)(t^|x)𝕀[y^=y])2]absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}\big{[}(\hat{p}^{(%\textsc{Ppl})}(\hat{t}\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼t~ip(t|x)[(p^(Ppl)(t^|x)p(t^|x)+p(t^|x)𝕀[g(t^)=y])2]absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝑝conditional^𝑡𝑥𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}\big{[}(\hat{p}^{(%\textsc{Ppl})}(\hat{t}\,|\,x)-p(\hat{t}\,|\,x)+p(\hat{t}\,|\,x)-\mathbb{I}[g(%\hat{t})=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) + italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(1p(t^|x))np(t^|x)2+2(1p(t^|x))np(t^|x)𝕀[g(t^)=y]+(p(t^|x)𝕀[g(t^)=y])2absentsuperscript1𝑝conditional^𝑡𝑥𝑛𝑝superscriptconditional^𝑡𝑥22superscript1𝑝conditional^𝑡𝑥𝑛𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦superscript𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\displaystyle=-(1-p(\hat{t}\,|\,x))^{n}p(\hat{t}\,|\,x)^{2}+2(1-p(\hat{t}\,|\,%x))^{n}p(\hat{t}\,|\,x)\mathbb{I}[g(\hat{t})=y]+(p(\hat{t}\,|\,x)-\mathbb{I}[g%(\hat{t})=y])^{2}= - ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] + ( italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1p(t^|x))np(t^|x)(2𝕀[g(t^)=y]p(t^|x))+(p(t^|x)𝕀[g(t^)=y])2.absentsuperscript1𝑝conditional^𝑡𝑥𝑛𝑝conditional^𝑡𝑥2𝕀delimited-[]𝑔^𝑡𝑦𝑝conditional^𝑡𝑥superscript𝑝conditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\displaystyle=(1-p(\hat{t}\,|\,x))^{n}p(\hat{t}\,|\,x)(2\mathbb{I}[g(\hat{t})=%y]-p(\hat{t}\,|\,x))+(p(\hat{t}\,|\,x)-\mathbb{I}[g(\hat{t})=y])^{2}.= ( 1 - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ( 2 blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] - italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) ) + ( italic_p ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Hence, we complete the proof.∎

A.2 Model Error Comparison in Ideal Case

Proposition 3 (Comparison of Model Errors).

Consider a setting with infinite sampling of reasoning paths (n𝑛n\to\inftyitalic_n → ∞) where each incorrect reasoning path leads to a unique answer, that is, g(t~i)g(t~j)𝑔subscript~𝑡𝑖𝑔subscript~𝑡𝑗g(\tilde{t}_{i})\neq g(\tilde{t}_{j})italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≠ italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for any ij𝑖𝑗i\neq jitalic_i ≠ italic_j where g(t~i)y𝑔subscript~𝑡𝑖𝑦g(\tilde{t}_{i})\neq yitalic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≠ italic_y and g(t~j)y𝑔subscript~𝑡𝑗𝑦g(\tilde{t}_{j})\neq yitalic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≠ italic_y. The model error of self-consistency (Model(Sc)superscriptsubscriptModelSc\mathcal{E}_{\text{Model}}^{(\textsc{Sc})}caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT) and perplexity (Model(Ppl)superscriptsubscriptModelPpl\mathcal{E}_{\text{Model}}^{(\textsc{Ppl})}caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT) satisfy:

Model(Sc)Model(Ppl)superscriptsubscriptModelScsuperscriptsubscriptModelPpl\mathcal{E}_{\text{Model}}^{(\textsc{Sc})}\leq\mathcal{E}_{\text{Model}}^{(%\textsc{Ppl})}caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ≤ caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT(1)

The inequality is strict when the consistency function identifies equivalent correct reasoning paths more than once.

Remark 7.

Although the assumptions in Proposition3 are idealized, this special case demonstrates that the consistency function in self-consistency achieves better model error than the perplexity method.In practice, the perplexity method always gives larger model error compared to the self-consistency method, as it does not leverage the consistency function to analyze the structural properties of specific reasoning problems.The proof is presented as follows.

Proof.

We first recall the definitions of the model error for self-consistency and perplexity:

{Model(Sc)=y^{g(t~i)|i=1n}(p^(Sc)(y^|x)𝕀[y^=y])2,Model(Ppl)=t^(p^(Ppl)(t^|x)𝕀[g(t^)=y])2.casessuperscriptsubscriptModelScabsentsubscript^𝑦conditional-set𝑔subscript~𝑡𝑖𝑖1𝑛superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2superscriptsubscriptModelPplabsentsubscript^𝑡superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\begin{cases}\mathcal{E}_{\text{Model}}^{(\textsc{Sc})}&=\sum_{\hat{y}\in\{g(%\tilde{t}_{i})\,|\,i=1\ldots n\}}\left(\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-%\mathbb{I}[\hat{y}=y]\right)^{2},\\\mathcal{E}_{\text{Model}}^{(\textsc{Ppl})}&=\sum_{\hat{t}\in\mathcal{R}}\left%(\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)-\mathbb{I}[g(\hat{t})=y]\right)^{2}.%\end{cases}{ start_ROW start_CELL caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ∈ { italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 … italic_n } end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW(2)

where =Set(t~1,,t~n)Setsubscript~𝑡1subscript~𝑡𝑛\mathcal{R}=\text{Set}(\tilde{t}_{1},\ldots,\tilde{t}_{n})caligraphic_R = Set ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the set of reasoning paths sampled from the LLM.We can compute the difference between the model error of Sc and Ppl as follows:

Model(Sc)Model(Ppl)superscriptsubscriptModelScsuperscriptsubscriptModelPpl\displaystyle\mathcal{E}_{\text{Model}}^{(\textsc{Sc})}-\mathcal{E}_{\text{%Model}}^{(\textsc{Ppl})}caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT - caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT=y^{g(t~i)|i=1n}(p^(Sc)(y^|x)𝕀[y^=y])2t^(p^(Ppl)(t^|x)𝕀[g(t^)=y])2absentsubscript^𝑦conditional-set𝑔subscript~𝑡𝑖𝑖1𝑛superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2subscript^𝑡superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝕀delimited-[]𝑔^𝑡𝑦2\displaystyle=\sum_{\hat{y}\in\{g(\tilde{t}_{i})\,|\,i=1\ldots n\}}\left(\hat{%p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y]\right)^{2}-\sum_{\hat{%t}\in\mathcal{R}}\left(\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)-\mathbb{I}[g(%\hat{t})=y]\right)^{2}= ∑ start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ∈ { italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 … italic_n } end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(3)
=y^{g(t^i)|i=1n}[(p^(Sc)(y^|x)𝕀[y^=y])2t^𝕀[g(t^)=y^](p^(Ppl)(t^|x)𝕀[y^=y])2](A)absentsubscript^𝑦conditional-set𝑔subscript^𝑡𝑖𝑖1𝑛subscriptdelimited-[]superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝕀delimited-[]^𝑦𝑦2𝐴\displaystyle=\sum_{\hat{y}\in\{g(\hat{t}_{i})\,|\,i=1\ldots n\}}\underbrace{%\left[\left(\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y]\right%)^{2}-\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]\left(\hat{p}^%{(\textsc{Ppl})}(\hat{t}\,|\,x)-\mathbb{I}[\hat{y}=y]\right)^{2}\right]}_{(A)}= ∑ start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ∈ { italic_g ( over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 … italic_n } end_POSTSUBSCRIPT under⏟ start_ARG [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT ( italic_A ) end_POSTSUBSCRIPT

Assuming infinite samplings, the unbiasedness of Sc gives us:

p^(Sc)(y^|x)=1ni=1n𝕀[g(t~i)=y^]=t^𝕀[g(t^)=y^]p^(Ppl)(t^|x).superscript^𝑝Scconditional^𝑦𝑥1𝑛superscriptsubscript𝑖1𝑛𝕀delimited-[]𝑔subscript~𝑡𝑖^𝑦subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦superscript^𝑝Pplconditional^𝑡𝑥\displaystyle\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)=\frac{1}{n}\sum_{i=1}^{n}%\mathbb{I}[g(\tilde{t}_{i})=\hat{y}]=\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(%\hat{t})=\hat{y}]\cdot\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x).over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG italic_y end_ARG ] = ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] ⋅ over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) .(4)

For any y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, consider two cases:

  1. (i)

    If y^=y^𝑦𝑦\hat{y}=yover^ start_ARG italic_y end_ARG = italic_y, then y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is the correct answer. We have

    (A)𝐴\displaystyle(A)( italic_A )=(p^(Sc)(y^|x)1)2t^𝕀[g(t^)=y^](p^(Ppl)(t^|x)1)2absentsuperscriptsuperscript^𝑝Scconditional^𝑦𝑥12subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦superscriptsuperscript^𝑝Pplconditional^𝑡𝑥12\displaystyle=\left(\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-1\right)^{2}-\sum_{%\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]\left(\hat{p}^{(\textsc{%Ppl})}(\hat{t}\,|\,x)-1\right)^{2}= ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(5)
    =(p^(Sc)(y^|x)2+12p^(Sc)(y^|x))t^𝕀[g(t^)=y^](p^(Ppl)(t^|x)2+12p^(Ppl)(t^|x))absentsuperscript^𝑝Scsuperscriptconditional^𝑦𝑥212superscript^𝑝Scconditional^𝑦𝑥subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦superscript^𝑝Pplsuperscriptconditional^𝑡𝑥212superscript^𝑝Pplconditional^𝑡𝑥\displaystyle=\left(\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)^{2}+1-2\hat{p}^{(%\textsc{Sc})}(\hat{y}\,|\,x)\right)-\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(%\hat{t})=\hat{y}]\left(\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)^{2}+1-2\hat{p}^%{(\textsc{Ppl})}(\hat{t}\,|\,x)\right)= ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 - 2 over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) ) - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 - 2 over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) )
    =p^(Sc)(y^|x)2+1t^𝕀[g(t^)=y^]p^(Ppl)(t^|x)2t^𝕀[g(t^)=y^]absentsuperscript^𝑝Scsuperscriptconditional^𝑦𝑥21subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦superscript^𝑝Pplsuperscriptconditional^𝑡𝑥2subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦\displaystyle=\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)^{2}+1-\sum_{\hat{t}\in%\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]\cdot\hat{p}^{(\textsc{Ppl})}(\hat{t%}\,|\,x)^{2}-\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]= over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] ⋅ over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ]
    p^(Sc)(y^|x)2+1p^(Sc)(y^|x)2t^𝕀[g(t^)=y^]t^𝕀[g(t^)=y^]absentsuperscript^𝑝Scsuperscriptconditional^𝑦𝑥21superscript^𝑝Scsuperscriptconditional^𝑦𝑥2subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦\displaystyle\leq\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)^{2}+1-\frac{\hat{p}^{(%\textsc{Sc})}(\hat{y}\,|\,x)^{2}}{\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(%\hat{t})=\hat{y}]}-\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]≤ over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 - divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] end_ARG - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ]

    Let p^(Sc)(y^|x)2=B2superscript^𝑝Scsuperscriptconditional^𝑦𝑥2superscript𝐵2\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)^{2}=B^{2}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and t^𝕀[g(t^)=y^]=Csubscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦𝐶\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]=C∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] = italic_C, then

    (A)𝐴\displaystyle(A)( italic_A )B2+1B2CCabsentsuperscript𝐵21superscript𝐵2𝐶𝐶\displaystyle\leq B^{2}+1-\frac{B^{2}}{C}-C≤ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 - divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG - italic_C(6)
    =1C[B2C+CB2C2]absent1𝐶delimited-[]superscript𝐵2𝐶𝐶superscript𝐵2superscript𝐶2\displaystyle=\frac{1}{C}\left[B^{2}C+C-B^{2}-C^{2}\right]= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG [ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C + italic_C - italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
    =1C[(CB2)(1C)]absent1𝐶delimited-[]𝐶superscript𝐵21𝐶\displaystyle=\frac{1}{C}\left[(C-B^{2})(1-C)\right]= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG [ ( italic_C - italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - italic_C ) ]
    0.absent0\displaystyle\leq 0.≤ 0 .

    Equality holds if t^𝕀[g(t^)=y^]=C=1subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦𝐶1\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]=C=1∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] = italic_C = 1. This indicates that (A)<0𝐴0(A)<0( italic_A ) < 0 if the consistency function is effective at least once, making t^𝕀[g(t^)=y^]=C>1subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦𝐶1\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]=C>1∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] = italic_C > 1.

  2. (ii)

    If y^y^𝑦𝑦\hat{y}\neq yover^ start_ARG italic_y end_ARG ≠ italic_y, then y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is incorrect. Assuming distinct answers for wrong reasoning paths, we have t^𝕀[g(t^)=y^]=1subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦1\sum_{\hat{t}\in\mathcal{R}}\mathbb{I}[g(\hat{t})=\hat{y}]=1∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] = 1, thus

    (A)=[(p^(Sc)(y^|x)𝕀[y^=y])2t^𝕀[g(t^)=y^](p^(Ppl)(t^|x)𝕀[y^=y])2]=[(p^(Sc)(y^|x)𝕀[y^=y])2(p^(Sc)(y^|x)𝕀[y^=y])2]=0,𝐴absentdelimited-[]superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2subscript^𝑡𝕀delimited-[]𝑔^𝑡^𝑦superscriptsuperscript^𝑝Pplconditional^𝑡𝑥𝕀delimited-[]^𝑦𝑦2missing-subexpressionabsentdelimited-[]superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2superscriptsuperscript^𝑝Scconditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦20\displaystyle\begin{aligned} (A)&=\left[\left(\hat{p}^{(\textsc{Sc})}(\hat{y}%\,|\,x)-\mathbb{I}[\hat{y}=y]\right)^{2}-\sum_{\hat{t}\in\mathcal{R}}\mathbb{I%}[g(\hat{t})=\hat{y}]\left(\hat{p}^{(\textsc{Ppl})}(\hat{t}\,|\,x)-\mathbb{I}[%\hat{y}=y]\right)^{2}\right]\\&=\left[\left(\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y]%\right)^{2}-\left(\hat{p}^{(\textsc{Sc})}(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y]%\right)^{2}\right]=0,\end{aligned}start_ROW start_CELL ( italic_A ) end_CELL start_CELL = [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_I [ italic_g ( over^ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_t end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = [ ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 , end_CELL end_ROW(7)

    since only one t^^𝑡\hat{t}\in\mathcal{R}over^ start_ARG italic_t end_ARG ∈ caligraphic_R satisfying that g(t^)𝑔^𝑡g(\hat{t})italic_g ( over^ start_ARG italic_t end_ARG ) equals the incorrect answer y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.

Therefore, Model(Ppl)Model(Sc)0superscriptsubscriptModelPplsuperscriptsubscriptModelSc0\mathcal{E}_{\text{Model}}^{(\textsc{Ppl})}-\mathcal{E}_{\text{Model}}^{(%\textsc{Sc})}\leq 0caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Ppl ) end_POSTSUPERSCRIPT - caligraphic_E start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( Sc ) end_POSTSUPERSCRIPT ≤ 0, indicating that the model error of self-consistency is always less than or equal to the model error of perplexity under our assumptions. Moreover, if the consistency function is effective at least once, the model error of self-consistency is strictly less than the model error of perplexity.∎

A.3 Proof of Theorem3

Proof.

For given answer y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, the estimated probability of Pc is p^(y^|x)=i=1n𝕀[g(t~i)=y^]p(t~i|x)^𝑝conditional^𝑦𝑥superscriptsubscript𝑖1𝑛𝕀delimited-[]𝑔subscript~𝑡𝑖^𝑦𝑝conditionalsubscript~𝑡𝑖𝑥\hat{p}(\hat{y}\,|\,x)=\sum_{i=1}^{n}\mathbb{I}[g(\tilde{t}_{i})=\hat{y}]p(%\tilde{t}_{i}\,|\,x)over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG italic_y end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ), allowing the reasoning error of Pc can be computed as follows.

(Pc)(p^)superscriptPc^𝑝\displaystyle\mathcal{E}^{(\textsc{Pc})}(\hat{p})caligraphic_E start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG )=𝔼t~ip(t|x)[(p^(y^|x)𝕀[y^=y])2]absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscript^𝑝conditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}\big{[}(\hat{p}(\hat{y}%\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼t~ip(t|x)[(p^(y^|x)p(y^|x)+p(y^|x)𝕀[y^=y])2].absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscript^𝑝conditional^𝑦𝑥𝑝conditional^𝑦𝑥𝑝conditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}\big{[}(\hat{p}(\hat{y}%\,|\,x)-p(\hat{y}\,|\,x)+p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}\big{]}.= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) + italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

We define k:=|{t~g(t~)=y^}|assign𝑘conditional-set~𝑡𝑔~𝑡^𝑦k:=|\{\tilde{t}\mid g(\tilde{t})=\hat{y}\}|italic_k := | { over~ start_ARG italic_t end_ARG ∣ italic_g ( over~ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG } |, which means that how many reasoning paths whose answers are y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG can be covered given limited sampling budget.Note that we further have 𝔼t~p(t|x)[𝕀[g(t~)=y^]p(t~|x)]=1kp(y^|x)subscript𝔼similar-to~𝑡𝑝conditional𝑡𝑥delimited-[]𝕀delimited-[]𝑔~𝑡^𝑦𝑝conditional~𝑡𝑥1𝑘𝑝conditional^𝑦𝑥\mathbb{E}_{\tilde{t}\sim p(t\,|\,x)}[\mathbb{I}[g(\tilde{t})=\hat{y}]p(\tilde%{t}\,|\,x)]=\frac{1}{k}{p}(\hat{y}\,|\,x)blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ blackboard_I [ italic_g ( over~ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG | italic_x ) ] = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ), thus

𝔼t~ip(t|x)[p^(y^|x)]subscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]^𝑝conditional^𝑦𝑥\displaystyle\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}[\hat{p}(\hat{y}\,|\,x)]blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) ]=𝔼[i=1n𝕀[g(t~i)=y^]p(t~i|x)]=(1(11kp(y^|x))n)p(y^|x)=(1αn)p(y^|x),absent𝔼delimited-[]superscriptsubscript𝑖1𝑛𝕀delimited-[]𝑔subscript~𝑡𝑖^𝑦𝑝conditionalsubscript~𝑡𝑖𝑥1superscript11𝑘𝑝conditional^𝑦𝑥𝑛𝑝conditional^𝑦𝑥1superscript𝛼𝑛𝑝conditional^𝑦𝑥\displaystyle=\mathbb{E}\big{[}\sum_{i=1}^{n}\mathbb{I}[g(\tilde{t}_{i})=\hat{%y}]p(\tilde{t}_{i}\,|\,x)\big{]}=\Big{(}1-\big{(}1-\frac{1}{k}p(\hat{y}\,|\,x)%\big{)}^{n}\Big{)}p(\hat{y}\,|\,x)=(1-\alpha^{n})p(\hat{y}\,|\,x),= blackboard_E [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I [ italic_g ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG italic_y end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) ] = ( 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) = ( 1 - italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ,

where α:=1kp(y^|x)assign𝛼1𝑘𝑝conditional^𝑦𝑥\alpha:=\frac{1}{k}p(\hat{y}\,|\,x)italic_α := divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ).Again, we have

𝔼t~ip(t|x)[p^(y^|x)p(y^|x)]=(1α)np(y^|x))\displaystyle\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}[\hat{p}(\hat{y}\,|\,x)-%p(\hat{y}\,|\,x)]=-\big{(}1-\alpha\big{)}^{n}p(\hat{y}\,|\,x))blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ] = - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) )
𝔼t~ip(t|x)[(p^(y^|x)p(y^|x))2]=(1(1α)n)(1α)np(y^|x))2\displaystyle\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}[(\hat{p}(\hat{y}\,|\,x)%-p(\hat{y}\,|\,x))^{2}]=\Big{(}1-\big{(}1-\alpha\big{)}^{n}\Big{)}\big{(}1-%\alpha\big{)}^{n}p(\hat{y}\,|\,x))^{2}blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Hence,

(Pc)(p^)superscriptPc^𝑝\displaystyle\mathcal{E}^{(\textsc{Pc})}(\hat{p})caligraphic_E start_POSTSUPERSCRIPT ( Pc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG )=𝔼t~ip(t|x)[(p^(y^|x)𝕀[y^=y])2]absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscript^𝑝conditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}\big{[}(\hat{p}(\hat{y}%\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼t~ip(t|x)[(p^(y^|x)p(y^|x)+p(y^|x)𝕀[y^=y])2]absentsubscript𝔼similar-tosubscript~𝑡𝑖𝑝conditional𝑡𝑥delimited-[]superscript^𝑝conditional^𝑦𝑥𝑝conditional^𝑦𝑥𝑝conditional^𝑦𝑥𝕀delimited-[]^𝑦𝑦2\displaystyle=\mathbb{E}_{\tilde{t}_{i}\sim p(t\,|\,x)}\big{[}(\hat{p}(\hat{y}%\,|\,x)-p(\hat{y}\,|\,x)+p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}=y])^{2}\big{]}= blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) - italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) + italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(1α)np(y^|x)(2𝕀[y^=y](1+(1α)n)p(y^|x))+(p(y^|x)𝕀[y^i=y])2,absentsuperscript1𝛼𝑛𝑝conditional^𝑦𝑥2𝕀delimited-[]^𝑦𝑦1superscript1𝛼𝑛𝑝conditional^𝑦𝑥superscript𝑝conditional^𝑦𝑥𝕀delimited-[]subscript^𝑦𝑖𝑦2\displaystyle=(1-\alpha)^{n}p(\hat{y}\,|\,x)\big{(}2\mathbb{I}[\hat{y}=y]-(1+(%1-\alpha)^{n})p(\hat{y}\,|\,x)\big{)}+\left(p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y%}_{i}=y]\right)^{2},= ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ( 2 blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] - ( 1 + ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) + ( italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which completes the proof.∎

A.4 Proof of Theroem5

Proof.

Let us set the pruning threshold by τ:=p(y|x)assign𝜏𝑝conditional𝑦𝑥\tau:=p(y\,|\,x)italic_τ := italic_p ( italic_y | italic_x ). Then, we have

(Rpc)(p^)=superscriptRpc^𝑝absent\displaystyle\mathcal{E}^{(\textsc{Rpc})}(\hat{p})=caligraphic_E start_POSTSUPERSCRIPT ( Rpc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG ) =αp(y^|x)(2𝕀[y^=y](1+α)p(y^|x))𝕀[(p(y^)|x)<τ]Estimation Errorsubscript𝛼𝑝conditional^𝑦𝑥2𝕀delimited-[]^𝑦𝑦1𝛼𝑝conditional^𝑦𝑥𝕀delimited-[]conditional𝑝^𝑦𝑥𝜏Estimation Error\displaystyle\underbrace{\alpha p(\hat{y}\,|\,x)\big{(}2\mathbb{I}[\hat{y}=y]-%(1+\alpha)p(\hat{y}\,|\,x)\big{)}\mathbb{I}[(p(\hat{y})\,|\,x)<\tau]}_{\text{%Estimation Error}}under⏟ start_ARG italic_α italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ( 2 blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] - ( 1 + italic_α ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) blackboard_I [ ( italic_p ( over^ start_ARG italic_y end_ARG ) | italic_x ) < italic_τ ] end_ARG start_POSTSUBSCRIPT Estimation Error end_POSTSUBSCRIPT
+(p(y^|x)𝕀[y^i=y])2𝕀[(p(y^)|x)<τ]Model Errorsubscriptsuperscript𝑝conditional^𝑦𝑥𝕀delimited-[]subscript^𝑦𝑖𝑦2𝕀delimited-[]conditional𝑝^𝑦𝑥𝜏Model Error\displaystyle\qquad+\underbrace{\left(p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}_{i}=%y]\right)^{2}\mathbb{I}[(p(\hat{y})\,|\,x)<\tau]}_{\text{Model Error}}+ under⏟ start_ARG ( italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_I [ ( italic_p ( over^ start_ARG italic_y end_ARG ) | italic_x ) < italic_τ ] end_ARG start_POSTSUBSCRIPT Model Error end_POSTSUBSCRIPT

However, we only have an estimation of p(y^|x)𝑝conditional^𝑦𝑥p(\hat{y}\,|\,x)italic_p ( over^ start_ARG italic_y end_ARG | italic_x ), i.e., p^(y^|x)=k𝔼t~p(t|x)[𝕀[g(t~)=y^]p(t~|x)]kk^i=1kp(t~i|x)^𝑝conditional^𝑦𝑥𝑘subscript𝔼similar-to~𝑡𝑝conditional𝑡𝑥delimited-[]𝕀delimited-[]𝑔~𝑡^𝑦𝑝conditional~𝑡𝑥𝑘^𝑘superscriptsubscript𝑖1𝑘𝑝conditionalsubscript~𝑡𝑖𝑥\hat{p}(\hat{y}\,|\,x)=k\mathbb{E}_{\tilde{t}\sim p(t\,|\,x)}[\mathbb{I}[g(%\tilde{t})=\hat{y}]p(\tilde{t}\,|\,x)]\approx\frac{k}{\hat{k}}\sum_{i=1}^{k}p(%\tilde{t}_{i}\,|\,x)over^ start_ARG italic_p end_ARG ( over^ start_ARG italic_y end_ARG | italic_x ) = italic_k blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG ∼ italic_p ( italic_t | italic_x ) end_POSTSUBSCRIPT [ blackboard_I [ italic_g ( over~ start_ARG italic_t end_ARG ) = over^ start_ARG italic_y end_ARG ] italic_p ( over~ start_ARG italic_t end_ARG | italic_x ) ] ≈ divide start_ARG italic_k end_ARG start_ARG over^ start_ARG italic_k end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ), where t~1,,t~k^subscript~𝑡1subscript~𝑡^𝑘\tilde{t}_{1},\dots,\tilde{t}_{\hat{k}}over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUBSCRIPT are k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG sampled reasoning paths whose answer is y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.Therefore, the reasoning error of our approximate version can be computed by

^(Rpc)(p^)=superscript^Rpc^𝑝absent\displaystyle\hat{\mathcal{E}}^{(\textsc{Rpc})}(\hat{p})=over^ start_ARG caligraphic_E end_ARG start_POSTSUPERSCRIPT ( Rpc ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG ) =α(p)p(y^|x)(2𝕀[y^=y](1+α(p))p(y^|x))𝕀[1ki=1k^p(t~i|x)<1mτ]Estimation Errorsubscript𝛼𝑝𝑝conditional^𝑦𝑥2𝕀delimited-[]^𝑦𝑦1𝛼𝑝𝑝conditional^𝑦𝑥𝕀delimited-[]1𝑘superscriptsubscript𝑖1^𝑘𝑝conditionalsubscript~𝑡𝑖𝑥1𝑚𝜏Estimation Error\displaystyle\underbrace{\alpha(p)p(\hat{y}\,|\,x)\big{(}2\mathbb{I}[\hat{y}=y%]-(1+\alpha(p))p(\hat{y}\,|\,x)\big{)}\mathbb{I}[\frac{1}{k}\sum_{i=1}^{\hat{k%}}p(\tilde{t}_{i}\,|\,x)<\frac{1}{m}\tau]}_{\text{Estimation Error}}under⏟ start_ARG italic_α ( italic_p ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ( 2 blackboard_I [ over^ start_ARG italic_y end_ARG = italic_y ] - ( 1 + italic_α ( italic_p ) ) italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ) blackboard_I [ divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) < divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_τ ] end_ARG start_POSTSUBSCRIPT Estimation Error end_POSTSUBSCRIPT
+(p(y^|x)𝕀[y^i=y])2𝕀[1k^i=1k^p(t~i|x)<1mτ
]
Model Error
subscriptsuperscript𝑝conditional^𝑦𝑥𝕀delimited-[]subscript^𝑦𝑖𝑦2𝕀delimited-[]1^𝑘superscriptsubscript𝑖1^𝑘𝑝conditionalsubscript~𝑡𝑖𝑥1𝑚𝜏
Model Error
\displaystyle\qquad+\underbrace{\left(p(\hat{y}\,|\,x)-\mathbb{I}[\hat{y}_{i}=%y]\right)^{2}\mathbb{I}\big{[}\frac{1}{\hat{k}}\sum_{i=1}^{\hat{k}}p(\tilde{t}%_{i}\,|\,x)<\frac{1}{m}\tau\vskip 12.0pt plus 4.0pt minus 4.0pt]}_{\text{Model% Error}}+ under⏟ start_ARG ( italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) - blackboard_I [ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_I [ divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_k end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) < divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_τ ] end_ARG start_POSTSUBSCRIPT Model Error end_POSTSUBSCRIPT

Hence, we only need to consider the probability 1k^i=1k^p(t~i|x)>1mτ1^𝑘superscriptsubscript𝑖1^𝑘𝑝conditionalsubscript~𝑡𝑖𝑥1𝑚𝜏\frac{1}{\hat{k}}\sum_{i=1}^{\hat{k}}p(\tilde{t}_{i}\,|\,x)>\frac{1}{m}\taudivide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_k end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) > divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_τ.Using Hoeffding’s inequality, we can obtain that

(1k^i=1k^p(t~i|x)1kp(y^|x)τ)exp(2k^γ2p(y^|x)2)1^𝑘superscriptsubscript𝑖1^𝑘𝑝conditionalsubscript~𝑡𝑖𝑥1𝑘𝑝conditional^𝑦𝑥𝜏2^𝑘superscript𝛾2𝑝superscriptconditional^𝑦𝑥2\mathbb{P}\big{(}\frac{1}{\hat{k}}\sum_{i=1}^{\hat{k}}p(\tilde{t}_{i}\,|\,x)-%\frac{1}{k}p(\hat{y}\,|\,x)\geq\tau\big{)}\leq\exp\Big{(}-\frac{2\hat{k}\gamma%^{2}}{p(\hat{y}\,|\,x)^{2}}\Big{)}blackboard_P ( divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_k end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) ≥ italic_τ ) ≤ roman_exp ( - divide start_ARG 2 over^ start_ARG italic_k end_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )

We set γ=τ1kp(y^|x)=τ+α1𝛾𝜏1𝑘𝑝conditional^𝑦𝑥𝜏𝛼1\gamma=\tau-\frac{1}{k}p(\hat{y}\,|\,x)=\tau+\alpha-1italic_γ = italic_τ - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_p ( over^ start_ARG italic_y end_ARG | italic_x ) = italic_τ + italic_α - 1, then

(1k^i=1k^p(t~i|x)τ)exp(2k^k2(1τ1α)2).1^𝑘superscriptsubscript𝑖1^𝑘𝑝conditionalsubscript~𝑡𝑖𝑥𝜏2^𝑘superscript𝑘2superscript1𝜏1𝛼2\mathbb{P}\big{(}\frac{1}{\hat{k}}\sum_{i=1}^{\hat{k}}p(\tilde{t}_{i}\,|\,x)%\geq\tau\big{)}\leq\exp\Big{(}-{2\hat{k}k^{2}}(1-\frac{\tau}{1-\alpha})^{2}%\Big{)}.blackboard_P ( divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_k end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) ≥ italic_τ ) ≤ roman_exp ( - 2 over^ start_ARG italic_k end_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 1 - italic_α end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Hence, we complete the proof.∎

Appendix B Pseudo Code of Rpc Method

In this section, we provide the pseudo code of Rpc.The output of Algorithm1 is a set of reasoning paths with the highest confidence.The extraction function g()𝑔g(\cdot)italic_g ( ⋅ ) can be used to transform the reasoning paths to answers.

0:

1:Sampled Reasoning paths t~1,,t~nsubscript~𝑡1subscript~𝑡𝑛\tilde{t}_{1},\ldots,\tilde{t}_{n}over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

2:LLM Internal Probabilities p1,,pnsubscript𝑝1subscript𝑝𝑛p_{1},\ldots,p_{n}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

3:Consisitency function 𝕀C(,)subscript𝕀𝐶\mathbb{I}_{C}(\cdot,\cdot)blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ , ⋅ )

3:Most-confident reasoning paths with probabilities

4:{Phase 1: Reasoning Pruning}

5:(k1,λ1,k2,λ2,w1,w2)subscript𝑘1subscript𝜆1subscript𝑘2subscript𝜆2subscript𝑤1subscript𝑤2absent(k_{1},\lambda_{1},k_{2},\lambda_{2},w_{1},w_{2})\leftarrow( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ← Model probability distribution parameters from p1,,pnsubscript𝑝1subscript𝑝𝑛p_{1},\ldots,p_{n}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT {Using subsection3.2}

6:pmean1ni=1npisubscript𝑝mean1𝑛superscriptsubscript𝑖1𝑛subscript𝑝𝑖p_{\text{mean}}\leftarrow\frac{1}{n}\sum_{i=1}^{n}p_{i}italic_p start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

7:Iretain{iPHigh(pi)>0.5orpipmean}subscriptIretainconditional-set𝑖subscript𝑃Highsubscript𝑝𝑖0.5orsubscript𝑝𝑖subscript𝑝mean\mathrm{I}_{\text{retain}}\leftarrow\{i\mid P_{\text{High}}(p_{i})>0.5\text{ %or }p_{i}\geq p_{\text{mean}}\}roman_I start_POSTSUBSCRIPT retain end_POSTSUBSCRIPT ← { italic_i ∣ italic_P start_POSTSUBSCRIPT High end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0.5 or italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT } {PHighsubscript𝑃HighP_{\text{High}}italic_P start_POSTSUBSCRIPT High end_POSTSUBSCRIPT is defined in subsection3.2}

8:{Phase 2: Perplexity Consistency}

9:USet(t~1,,t~n)USetsubscript~𝑡1subscript~𝑡𝑛\mathrm{U}\leftarrow\text{Set}(\tilde{t}_{1},\ldots,\tilde{t}_{n})roman_U ← Set ( over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

10:Iunique{it~iUandiIretain}subscriptIuniqueconditional-set𝑖subscript~𝑡𝑖Uand𝑖subscriptIretain\mathrm{I}_{\text{unique}}\leftarrow\{i\mid\tilde{t}_{i}\in\mathrm{U}\text{ %and }i\in\mathrm{I}_{\text{retain}}\}roman_I start_POSTSUBSCRIPT unique end_POSTSUBSCRIPT ← { italic_i ∣ over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_U and italic_i ∈ roman_I start_POSTSUBSCRIPT retain end_POSTSUBSCRIPT }

11:foreach reasoning path t~U~𝑡U\tilde{t}\in\mathrm{U}over~ start_ARG italic_t end_ARG ∈ roman_Udo

12:C(t~)iretain𝕀C[t~,t~i]piC~𝑡subscript𝑖subscriptretainsubscript𝕀𝐶~𝑡subscript~𝑡𝑖subscript𝑝𝑖\mathrm{C}(\tilde{t})\leftarrow\sum_{i\in\mathcal{I}_{\text{retain}}}\mathbb{I%}_{C}[\tilde{t},\tilde{t}_{i}]p_{i}roman_C ( over~ start_ARG italic_t end_ARG ) ← ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT retain end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT [ over~ start_ARG italic_t end_ARG , over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

13:endfor

14:Cmaxmaxt~UC(t~)subscriptCmaxsubscript~𝑡UC~𝑡\text{C}_{\text{max}}\leftarrow\max_{\tilde{t}\in\mathrm{U}}\mathrm{C}(\tilde{%t})C start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ← roman_max start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG ∈ roman_U end_POSTSUBSCRIPT roman_C ( over~ start_ARG italic_t end_ARG )

15:return {(t~,C(t~))t~U,C(t~)=Cmax}conditional-set~𝑡C~𝑡formulae-sequence~𝑡UC~𝑡subscriptCmax\{(\tilde{t},\mathrm{C}(\tilde{t}))\mid\tilde{t}\in\mathrm{U},\text{C}(\tilde{%t})=\mathrm{C}_{\text{max}}\}{ ( over~ start_ARG italic_t end_ARG , roman_C ( over~ start_ARG italic_t end_ARG ) ) ∣ over~ start_ARG italic_t end_ARG ∈ roman_U , C ( over~ start_ARG italic_t end_ARG ) = roman_C start_POSTSUBSCRIPT max end_POSTSUBSCRIPT }

Appendix C Detailed Experimental Settings

C.1 Datasets

For mathematical reasoning tasks, we evaluate our proposed methods and comparison methods on four mathematical datasets that include MATH, MathOdyssey, OlympiadBench, and AIME datasets.

  • MATH dataset(Hendrycks etal., 2021b) is a dataset comprised of challenging competition math problems and we use its 5,000 testing data for evaluation.

  • MathOdyssey dataset(Fang etal., 2024) contains 387 problems, covering advanced high-school level, university-level, and Olympiad-level mathematics.

  • OlympiadBench dataset(He etal., 2024) contains 8,476 Olympiad-level mathematics and physics problems. We select the English problems without images, resulting in a testing dataset of 1,284 problems.

  • AIME dataset(Zamil & Rabby, 2024) contains 993 test problems collected from the American Invitational Mathematics Examination, spanning from 1983 to 2024.

For code generation tasks, we conduct experiments on three common benchmark datasets.HumanEval(Chen etal., 2021) contains 164 hand-written Python programming problems.MBPP(Austin etal., 2021)(sanitized version) consists of 427 entry-level programming problems.We also include the introductory-level problems of APPS(Hendrycks etal., 2021a), which contains 1000 problems.

C.2 Detailes of Mathematical Reasoning Task

For all the experiments in the main paper, we use a sampling temperature of 1.0 and set the top-k parameter to 0.95.

Prompt for Math Reasoning Tasks.

The InternLM2-MATH-Plus 1.8B and 7B models are chat models that facilitate conversations between two roles: “user” and “assistant”.The prompt for the “user” role is provided in PromptC.2. Similarly, the prompt for the DeepSeek-Math 7B model is shown in PromptC.2.

Prompt for Math Verbalized Method.

We observed that the tuned math models are challenging to prompt for generating confidence. Therefore, we adopted the methods from Tian etal. (2023) to calculate the probability based on the likelihood of the first generated “True” token and the first generated “False” token. The corresponding prompt is provided in PromptC.2.

C.3 Detailes of Code Generation Task

Code Generation.

On the code generation task, we let LLM generate a code snippet to solve a given programming problem, and then evaluate its functional correctness based on the ground-truth test cases provided by the dataset. In detail, we set the top p to 0.95, and the max generation length to 1024. For code snippet post-processing, we first extract the code text from the code block surrounded by triple-backticks(‘‘‘), and then we followChen etal. (2021) to truncate the generated code snippet before the following stop sequences: “\nclass”, “\ndef”, “\n#”, “\nif”, “\nprint”. At the same time, we also obtain the log-probability of each token from the LLM response. For ”verbalization” setting, the verbalized confidence is also extracted from the text generated by LLM along with the code snippet.

Self-consistency on Code.

We follow Chen etal. (2023a) to sample 100 test cases for each programming problem from the same model. Then, we achieved self-consistency in code at the semantic equivalence level, which is based on the execution behavior of any two codes on this set of test cases. More formally, we implemented the consistency function 𝕀C(,)subscript𝕀𝐶\mathbb{I}_{C}(\cdot,\cdot)blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ , ⋅ ) as an indicator function that indicates whether two codes are semantically equivalent, i.e., 𝕀C(x,y)=1subscript𝕀𝐶𝑥𝑦1\mathbb{I}_{C}(x,y)=1blackboard_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x , italic_y ) = 1 if and only if code x𝑥xitalic_x and y𝑦yitalic_y execute the same result on this set of test cases.

Prompt for Generating Code.

The prompt for generating code consists of a header, a functional signature, and a docstring and LLM needs to implement the body of this function. An Illustration is shown in PromptC.3.

Prompt for Code Verbalized Method.

For generating code with verbalized confidence, we added instructions for generating verbalized confidence, as well as format requirements to facilitate the extraction of code and confidence score. We also gave a simple example to help LLM understand the format requirements at the end of the prompt.An Illustration is shown in PromptC.3.

Appendix D Detailed Experimental Results

D.1 Performance on GSM8k dataset.

In Section2, we analyze the GSM8k dataset, which is relatively easy that allows accurate estimation of ground-truth probabilities with 100 samples. Due to this characteristic, we exclude the GSM8k dataset from our main experiments, as a limited number of samples is sufficient for accurate confidence evaluation.Table4 shows the performance of InternLM-2-MATH-Plus 7B model on GSM8k with various sampling temperatures, where the number of samples is set to n=6𝑛6n=6italic_n = 6 to maintain consistency with Figure2. The results show that the Rpc method outperforms comparison methods.


PplVerbScRpc
T=1.086.97 ±plus-or-minus\pm± 0.2963.67 ±plus-or-minus\pm± 0.9889.32 ±plus-or-minus\pm± 0.2689.45 ±plus-or-minus\pm± 0.38
T=1.186.78 ±plus-or-minus\pm± 0.4362.25 ±plus-or-minus\pm± 1.0089.44 ±plus-or-minus\pm± 0.3589.51 ±plus-or-minus\pm± 0.36
T=1.386.65 ±plus-or-minus\pm± 0.5761.29 ±plus-or-minus\pm± 0.8488.92 ±plus-or-minus\pm± 0.3289.08 ±plus-or-minus\pm± 0.57

D.2 Results with High Sampling Temperature.

Using a high sampling temperature enables language models to produce more diverse outputs, potentially enhancing reasoning performance.However, it also leads to an increase in estimation error.To investigate the effectiveness of our approaches in addressing the estimation error issue, we conducted experiments with higher sampling temperatures (T = 1.1 and T = 1.3) using the InternLM-2-MATH-Plus 7B model. The results in Table5 indicate that our Rpc approach consistently surpasses baseline methods. Notably, a significant performance gap persists between Rpc and Sc, indicating that our methods effectively tackle the estimation error issue even under high-temperature sampling conditions.

MethodTemperature = 1.1
MATHMathOdysseyOlympiadBenchAIME
Ppl47.35 ±plus-or-minus\pm± 0.1628.59 ±plus-or-minus\pm± 1.307.27 ±plus-or-minus\pm± 0.236.02 ±plus-or-minus\pm± 0.34
Verb25.51 ±plus-or-minus\pm± 0.239.41 ±plus-or-minus\pm± 0.443.66 ±plus-or-minus\pm± 0.163.07 ±plus-or-minus\pm± 0.15
Sc50.66 ±plus-or-minus\pm± 0.2227.89 ±plus-or-minus\pm± 0.4310.74 ±plus-or-minus\pm± 0.158.73 ±plus-or-minus\pm± 0.24
Rpc52.58 ±plus-or-minus\pm± 0.1432.98 ±plus-or-minus\pm± 0.6911.00 ±plus-or-minus\pm± 0.249.30 ±plus-or-minus\pm± 0.29
MethodTemperature = 1.3
MATHMathOdysseyOlympiadBenchAIME
Ppl47.58 ±plus-or-minus\pm± 0.3126.38 ±plus-or-minus\pm± 1.417.76 ±plus-or-minus\pm± 0.466.50 ±plus-or-minus\pm± 0.41
Verb24.62 ±plus-or-minus\pm± 0.338.60 ±plus-or-minus\pm± 0.263.11 ±plus-or-minus\pm± 0.172.29 ±plus-or-minus\pm± 0.12
Sc50.65 ±plus-or-minus\pm± 0.1427.61 ±plus-or-minus\pm± 0.6710.49 ±plus-or-minus\pm± 0.188.02 ±plus-or-minus\pm± 0.20
Rpc53.12 ±plus-or-minus\pm± 0.1433.19 ±plus-or-minus\pm± 0.5610.91 ±plus-or-minus\pm± 0.188.83 ±plus-or-minus\pm± 0.23

D.3 Performance with Diverse Number of Samplings

In the Figure4, we plot the accuracy of the InternLM-2-MATH-Plus 7B model on four mathematical reasoning datasets with different sample sizes n𝑛nitalic_n. Here, we give the detailed results on different models and different sampling temperatures.

Different Models Scales.

The performance of relateively small model, InternLM-2-MATH-Plus 1.8B, is presented in Figure7. Similar conclusions can be drawn from these results. For the MathOdyssey dataset, the Ppl method shows superior performance compared to other methods, which can be attributed to the relatively low model error of Ppl on this dataset, allowing the perplexity-based approach to function effectively. Furthermore, the Rpc method consistently outperforms the Sc method, which demonstrates its ability to enhance the convergence properties of the Sc method.

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (11)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (12)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (13)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (14)

Different Sampling Temperatures.

We also evaluate the InternLM-2-MATH-Plus 7B model with sampling temperatures T=1.1𝑇1.1T=1.1italic_T = 1.1 and T=1.3𝑇1.3T=1.3italic_T = 1.3. The results are presented in Figure8 and Figure9.The results demonstrate that the Rpc method effectively improves the model’s reasoning performance at higher temperatures, as it leverages the increased diversity in sampling to enhance self-consistency. In contrast, the Sc method’s performance deteriorates due to increased estimation errors at higher temperatures.

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (15)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (16)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (17)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (18)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (19)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (20)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (21)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (22)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Margart Wisoky

Last Updated:

Views: 6072

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.