Home | Posts

Maximum Likelihood Estimation of Network Size by N Parallel k-lookups in Kademlia DHT

PDF version πŸ”—

1. Introduction

The lookup (often informally referred to as "k-lookup") is the primary and most crucial function in the Kademlia peer-to-peer distributed hash table (DHT) protocol [1]. It is a recursive, iterative algorithm designed to locate the kk closest nodes to a specific target ID within the network. The final set of kk closest nodes found has XOR distances that are minimised within the network, and the magnitude of these distances is related to the density of nodes, which itself depends on network size nn. It's clear that performing several lookups to random NN target nodes, we get more information about anticipated network size. A method of estimation nn by processing NN parallel k-lookups is proposed in [2]. It's based on the fact that distances in each ordered set of k-lookup are the order statistics with specific beta distributions which depend on nn. Then, according to the method of moments (MoM) [3] the estimator for nn is obtained by matching the expected value of the beta distribution with the empirical result by averaging distances from lookup data. The suggested estimator is great from a practical point; however, some aspects of the provided solution are heuristic, e.g., the ways of averaging sample data. It's also hard to make assumptions on the efficiency of the MoM estimator, because there is no way of getting theoretical minimum variance for any unbiased estimator.

Using the same model as in [2], the current work considers Maximum Likelihood Estimation (MLE) [4]. The MLE is generally preferred over MoM because it produces more efficient, consistent, and often unbiased estimators with stronger theoretical properties. MLE requires no heuristic assumption and can be derived directly from the joint Probability Distribution Function (PDF) of distances from N parallel k-lookups. The joint PDF also allows us to get the CramΓ©r-Rao bound [5] for minimal variance of the efficient estimator.

The main results of the work are:

  • the MLE estimator (3), which uses the average of logarithms of normalised distances from NN random kk-lookups;
  • the CramΓ©r–Rao bound (4) for minimal variance of the efficient estimator and its approximations (6-7);
  • illustration of the efficiency of the MLE estimator by results of computer simulation.

2. The model

Kademlia distance estimation works by querying kk "closest" nodes to a random target ID.

Let D1,…,DnD_1, \dots, D_n be XOR distances from a probe node to all nn network nodes, and Ui=Di/(2Lβˆ’1)U_i = D_i/(2^L-1) are normalised distances for LL-bit addresses. U1,…,UnU_1, \dots, U_n are modelled as independent random variables uniformly distributed on (0,1)(0, 1).

The result of a single lookup is the ordered set of kk smallest distances {U(1),…,U(k)}\{U_{(1)}, \dots, U_{(k)}\} selected from U1,…,UnU_1, \dots, U_n. Here U(i)U_{(i)} is the distance to the ii-th closest node. So, we observe the first kk order statistics [6] from a sample of normalised distances U1,…,UnU_1, \dots, U_n.

When performing several k-lookups to N random probes, we make an assumption that all N samples of distances (so, the observed order statistics) are independent.

3. The likelihood function

For a single lookup the likelihood function (LF) is the joint PDF of the first kk order statistics U(1),…,U(k)U_{(1)}, \dots, U_{(k)} from a sample of size nn:

fU(1),…,U(k)(u1,…,uk∣n)=n!(nβˆ’k)!(1βˆ’uk)nβˆ’k,for0<u1<β‹―<uk<1\begin{gather*} f_{U_{(1)}, \dots, U_{(k)}}(u_1, \dots, u_k \mid n) = \frac{n!}{(n-k)!} (1 - u_k)^{n-k}, \\ \text{for} \quad 0 < u_1 < \dots < u_k < 1 \tag{1} \end{gather*}

We derive this later, but notice an important fact now: the optimal ML estimate of nn can be obtained by taking into account only kk-th order statistics. The statistics up to kk are irrelevant to the estimate. In other words, to estimate the network size, only the last k-th distance in ordered lookup results is sufficient.

This conclusion can be generalised for the optimal Bayesian estimator, because the LF as a part of the posterior distribution is the only place which captures the observation.

Note that the heuristic solutions proposed in [2] take into account all distances from the lookup result.

Using simplified notation, true in the area where LF differs from zero

f1(u∣n)=n!(nβˆ’k)!(1βˆ’u)nβˆ’kf_1(u \mid n) = \frac{n!}{(n-k)!} (1 - u)^{n-k}

and taking into account the assumption on independence of the samples, we get the LF for several k-lookups to N random nodes:

L(n)=∏i=1Nf1(ui∣n)==[n!(nβˆ’k)!]N∏i=1N(1βˆ’ui)nβˆ’k,\begin{align*} L(n) &= \prod_{i=1}^{N} f_1(u_i \mid n) = \\ &= \left[ \frac{n!}{(n-k)!} \right]^N \prod_{i=1}^{N} (1 - u_i)^{n-k}, \tag{2} \end{align*}

where the observation uiu_i is the kk-th smallest normalised distance taken from the ii-th lookup.

4. Maximum likelihood estimate

The MLE estimator n^\hat n maximises (2). It means that n^\hat n is an integer at which the likelihood sequence L(n)L(n) stops increasing and starts decreasing. To find it, consider the ratio function G(n)=L(n)/L(nβˆ’1)G(n) = L(n) / L(n - 1) and the conditions the n^\hat n conforms to:

  1. The likelihood is increasing up to n^\hat n: G(n^)β‰₯1G(\hat n) \ge 1.
  2. The likelihood starts decreasing after n^\hat n: G(n^+1)<1G(\hat n + 1) < 1.

As follows from (2), the ratio function is:

G(n)=(nnβˆ’k)N∏i=1N(1βˆ’ui)G(n) = \left( \frac{n}{n-k} \right)^N \prod_{i=1}^{N} (1 - u_i)

and within rounding error the MLE is the root of equation G(n)=1G(n) = 1:

n^=k1βˆ’[∏i=1N(1βˆ’ui)]1/N\hat n = \frac {k} {1 - \left[ \prod_{i=1}^{N} (1 - u_i) \right]^{1/N}}

From a practical point of view, the last expression is easier to implement in the equivalent logarithmic form:

n^=k1βˆ’exp⁑(Lβ€Ύu),whereLβ€Ύu=1Nβˆ‘i=1Nln⁑(1βˆ’ui)\begin{align*} \hat n &=\frac{k}{1 - \exp(\overline{L}_u)}, \\ \text{where} \quad \overline{L}_u &= \frac {1} {N} \sum_{i=1}^{N} \ln(1 - u_i) \tag{3} \end{align*}

Algorithmically, the estimator (3) performs the following steps:

  1. For each lookup i∈{1,…,Ni \in \{1, \dots, N}, which consists of kk smallest distances to random targets:
    • select the maximal distance did_i;
    • calculate logarithmic metric ln⁑(1βˆ’ui)\ln(1-u_i) for the normalised distance ui=di/(2Lβˆ’1)u_i = d_i/(2^L-1).
  2. Calculate average Lβ€Ύu\overline{L}_u on the NN metrics.
  3. Estimate the network size.

As expected, the estimator (3) always produces a result greater than kk, because Lβ€Ύu\overline{L}_u is negative. The two edge cases when uisu_i\text{s} are nearing to 00 (n^β†’βˆž\hat n \to \infty) or to 11 (n^β†’k\hat n \to k) are also intuitively understandable.

5. Efficiency of the estimate

MLEs are known for their strong theoretical properties. For large sample sizes, they are consistent (converge to the true parameter), asymptotically normal, and asymptotically efficient (achieve the lowest possible variance).

Treating nn as a continuous parameter, the variance of the efficient estimator [5] is given by CramΓ©r–Rao Lower Bound (CRLB):

Var(n^)β‰₯1I(n)\text{Var}(\hat{n}) \ge \frac{1}{I(n)}

where the Fisher Information is obtained by averaging of the second derivative of the log-likelihood (2):

I(n)=βˆ’E[βˆ‚2βˆ‚n2ln⁑L(n)]I(n) = - E \left[ \frac{\partial^2}{\partial n^2} \ln L(n) \right]

The log-likelihood for N observations is:

ln⁑L(n)=Nβ‹…ln⁑Γ(n+1)Ξ“(nβˆ’k+1)++(nβˆ’k)βˆ‘i=1Nln⁑(1βˆ’ui),\begin{align*} \ln L(n) = N \cdot \ln \frac {\Gamma(n+1)} {\Gamma(n-k+1)} + \\ + (n-k) \sum_{i=1}^{N} \ln (1 - u_i), \end{align*}

where Ξ“(x)\Gamma(x) is the gamma function. Differentiating this two times, obtain:

I(n)=βˆ’N[Οˆβ€²(n+1)βˆ’Οˆβ€²(nβˆ’k+1)]I(n) = - N \left[\psi'(n+1) - \psi'(n-k+1) \right]

where Οˆβ€²(x)=βˆ‚2βˆ‚x2ln⁑Γ(x)\psi'(x) = \frac{\partial^2}{\partial x^2} \ln \Gamma(x) is the trigamma function [7].

So, the result for minimal variance is:

Var(n^)min=1I(n)==1N[Οˆβ€²(nβˆ’k+1)βˆ’Οˆβ€²(n+1)](4)\begin{align*} \text{Var}(\hat{n})_{min} &= \frac{1}{I(n)} = \\ &= \frac{1}{N \left[ \psi'(n-k+1) - \psi'(n+1) \right]} \end{align*} \tag {4}

Next, the difference of trigamma functions in (4) we approximate with:

Οˆβ€²(nβˆ’k+1)βˆ’Οˆβ€²(n+1)β‰ˆ1nβˆ’kβˆ’1n(5)\psi'(n-k+1) - \psi'(n+1) \approx \frac {1} {n-k} - \frac {1} {n} \tag {5}

The approximation (5) is quite good, especially when nn is large and kk is small relative to nn. Indeed, we get (5) using a known series representation [7] and substituting the sum by an integral:

Οˆβ€²(nβˆ’k+1)βˆ’Οˆβ€²(n+1)==βˆ‘j=nβˆ’k+1n1j2β‰ˆβˆ«nβˆ’kn1x2 dx\begin{align*} \psi'(n - k + 1) - \psi'(n + 1) = \\ = \sum_{j=n-k+1}^{n} \frac{1}{j^2} \approx \int_{n-k}^{n} \frac{1}{x^2} \, dx \end{align*}

Finally, in practice it is often convenient to normalise estimation error to the estimated value. Substituting (5) into (4) and dividing the result by n2n^2, we obtain an expression for normalised variance, which defines the potential precision of the MLE estimator:

Οƒmin2=Var(n^n)minβ‘β‰ˆ1N(1kβˆ’1n),wheren>k\begin{align*} \sigma^2_{min} &= \text{Var}\left(\frac{\hat{n}}{n}\right)_{\min} \approx \frac{1}{N} \left( \frac{1}{k} - \frac{1}{n} \right), \\ &\text{where} \quad n > k \tag {6} \end{align*}

It is worth noting that in practice when network size nn is much larger than kk the normalised variance of efficient MLE is mainly determined by the values of NN and kk, rather than nn. For example, having k=8k=8 and N=10N=10, the standard deviation Οƒmin\sigma_{min} increases only from 10% to 11%, as network size grows from 50 to 10610^6 nodes. Without the dependence on 1/n1/n in (6), the standard deviation of the MLE estimator can be calculated as

lim⁑nβ†’βˆžΟƒmin=max⁑nΟƒmin=1Nk(7)\lim_{n\to\infty} \sigma_{min} = \max_{n} \sigma_{min} = \frac{1}{\sqrt {N k}} \tag {7}

6. Results of statistical simulation

The simulation [8] was performed to compare the precision of the MLE estimator (3) with the theoretical lower bound (6). We process 10,000 estimations for each combination of parameters k=8,20,N=10,20,40k = {8, 20}, N = {10, 20, 40} and network sizes n=25,102,103,104,105n = {25, 10^2, 10^3, 10^4, 10^5}. Each estimation receives NN statistics of kk-th order from nn independent random variables uniformly distributed on (0,1)(0, 1). The normalised errors are then used to calculate the sample variance Οƒ2\sigma^2 which is the subject of comparison with the theoretical lower bound Οƒmin2\sigma_{min}^2 (6).

(N,Β k)(N,\ k) n=25n=25 n=102n=10^2 n=103n=10^3 n=104n=10^4 n=105n=10^5
(10,Β 8)(10,\ 8) 0.097240.09724 0.111580.11158 0.114680.11468 0.114530.11453 0.114220.11422
(20,Β 8)(20,\ 8) 0.067230.06723 0.077730.07773 0.079730.07973 0.079610.07961 0.080270.08027
(40,Β 8)(40,\ 8) 0.047530.04753 0.054100.05410 0.056130.05613 0.056460.05646 0.056320.05632
(10,Β 20)(10,\ 20) 0.034120.03412 0.065020.06502 0.070560.07056 0.071060.07106 0.071590.07159
(20,Β 20)(20,\ 20) 0.024050.02405 0.045520.04552 0.049520.04952 0.049660.04966 0.050000.05000
(40,Β 20)(40,\ 20) 0.016550.01655 0.031800.03180 0.035630.03563 0.035260.03526 0.035380.03538

Table 1: Simulation results for standard deviation Οƒ\sigma of MLE.

Figure 1 Figure 1: Standard deviation of MLE estimate in comparison with the theoretical lower bound.

The solid curves in Figure 1 show the theoretical lower bounds for standard deviation Οƒmin\sigma_{min}. They are plotted according to (6) as a function of network size. The simulation results, related to the sample deviation Οƒ\sigma, are shown as points for fixed network sizes. The results prove the efficiency of the MLE estimate. For the worst case (k=8,N=10,n=25k=8, N=10, n=25), the MLE deviation (10%) differs by less than 1% from the lower bound. At increasing kk and NN, the estimate precision and lower bound are practically indistinguishable. For combination (k=20,N=40,n=25k=20, N=40, n=25) the MLE deviation (1.7%) differs by less than 0.01% from the lower bound.

7. Conclusion

The optimal MLE estimator differs from the estimator obtained in [2] by two significant aspects:

  1. Only the maximal distances (i.e., statistics of order kk) are used from results of NN kk-lookups (sets of nodes closest to random targets). The statistics up to kk are irrelevant to the estimate. We can observe the minimax principle here.
  2. The logarithms of normalised distances ln⁑(1βˆ’ui\ln (1-u_i) (not the distances uiu_i directly) are used in the calculation of the sample mean to get the grouped observation.

In practice, when network size nn is larger than kk, the normalised variance of efficient MLE is mainly determined by the values NN and kk (inversely proportional to), rather than nn. The maximum of the CRLB is reached at big nn, i.e., the minimax principle is observed again.

The results of statistical simulation prove the efficiency of the MLE estimate. For k=8,20k=8, 20, Nβ‰₯10N \ge 10 and n>100n > 100, the variance of the MLE and the theoretical lower bound are practically indistinguishable.

8. Appendix. Joint PDF of the first k order statistics from a sample of size n

We derive formula (1) for the joint PDF of the first kk order statistics, U(1),U(2),…,U(k)U_{(1)}, U_{(2)}, \dots, U_{(k)}, from a random sample U1,U2,…,UnU_1, U_2, \dots, U_n of size nn drawn from a standard uniform distribution on the interval (0,1)(0, 1).

We aim to find the probability P(u1<U(1)<u1+du1,…,uk<U(k)<uk+duk)P(u_{1}<U_{(1)}<u_{1}+du_{1},\dots ,u_{k}<U_{(k)}<u_{k}+du_{k}) for infinitesimal intervals. This event occurs if one sample falls into each of the kk intervals (ui,ui+dui)(u_{i},u_{i}+du_{i}), and the remaining nβˆ’kn-k samples are greater than uku_{k}.

The nn samples must be distributed into k+1k+1 categories: kk specific small intervals and one large interval (uk,1)(u_{k},1). The number of distinct ways to arrange the samples into these categories is given by the multinomial coefficient:

(n1,…,1,nβˆ’k)=n!1!…1!(nβˆ’k)!=n!(nβˆ’k)! \binom{n}{1, \dots ,1, n-k} = \frac {n!} {1! \dots 1!(n-k)!} = \frac {n!} {(n-k)!}

Let f(u)=1f(u)=1, for u∈(0,1)u\in (0,1) is the standard uniform PDF.Β Then the probability of a single observation falling into an interval (ui,ui+dui)(u_{i},u_{i}+du_{i}) is f(ui)dui=1β‹…duif(u_{i})du_{i}=1\cdot du_{i}. The probability of an observation falling into the interval (uk,1)(u_{k},1) is ∫uk1f(u)du=1βˆ’uk\int _{u_{k}}^{1}f(u)du=1-u_{k}. So, the probability for one specific arrangement is the product of these probabilities:

du1β‹…du2…dukβ‹…(1βˆ’uk)nβˆ’kdu_{1} \cdot du_{2} \dots du_{k} \cdot (1-u_{k})^{n-k}

The joint probability is the total number of arrangements multiplied by the probability of a single arrangement. The joint PDF is the coefficient of the volume element du1…dukdu_{1}\dots du_{k}:

fU(1),…,U(k)(u1,…,uk)=n!(nβˆ’k)!(1βˆ’uk)nβˆ’k,for0<u1<β‹―<uk<1\begin{gather*} f_{U_{(1)}, \dots, U_{(k)}}(u_1, \dots, u_k) = \frac{n!}{(n-k)!} (1 - u_k)^{n-k}, \\ \text{for} \quad 0 < u_1 < \dots < u_k < 1 \end{gather*}

9. References

1. Petar Maymounkov, David MaziΓ¨res. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. πŸ”—

2. Eli Sohl. A New Method for Estimating P2P Network Size. πŸ”—

3. Wikipedia. Method of moments (statistics). πŸ”—

4. Wikipedia. Maximum likelihood estimation. πŸ”—

5. Wikipedia. CramΓ©r–Rao bound. πŸ”—

6. Wikipedia. Order statistic. πŸ”—

7. Wikipedia. Trigamma function. πŸ”—

8. Iurii Kyrylenko. Statistical simulation of the MLE estimator for Kademlia network size. πŸ”—