Heuristic Algorithm for Univariate Stratification Problem

In sampling theory, stratification corresponds to a technique used in surveys, which allows segmenting a population into homogeneous subpopulations (strata) to produce statistics with a higher level of precision. In particular, this article proposes a heuristic to solve the univariate stratification problem - widely studied in the literature. One of its versions sets the number of strata and the precision level and seeks to determine the limits that define such strata to minimize the sample size allocated to the strata. A heuristic-based on a stochastic optimization method and an exact optimization method was developed to achieve this goal. The performance of this heuristic was evaluated through computational experiments, considering its application in various populations used in other works in the literature, based on 20 scenarios that combine different numbers of strata and levels of precision. From the analysis of the obtained results, it is possible to verify that the heuristic had a performance superior to four algorithms in the literature in more than 94% of the cases, particularly concerning the known algorithms of Kozak and Lavallee-Hidiroglou.


Introduction
Official statistical institutes make available, with great frequency, a wide range of data and statistics of great relevance to governments, with regard, for example, to the planning and implementation of their public policies, as well as to the citizen willing to accompany them.For the most part, these data are produced from surveys based on probabilistic sampling [40].
According to Lohr [40] and Cochran [14], sampling techniques allow the collection and production of statistics of interest in less time, with lower cost and greater accuracy.However, when are considered administrative issues, and there is a requirement of the desired minimum precision about the statistics produced, it is necessary to incorporate stratification in the planning/design phase of the sample in search of the least possible variability.Stratified sampling is well used together with other techniques such as simple random sampling and cluster sampling [40].
When solving the stratification problem (univariate in this article), the  observations of an  variable (stratification variable), known for the entire population, are distributed among  population strata by determining  − 1 cutoff points, called strata boundaries.To determine such cutoff points, one of the following objectives functions should be considered: (i) setting a sample of size , minimizing a variance expression (or the coefficient of variation) associated with variable , or (ii) setting a level of precision (the target coefficient of variation) associated with variable , minimize the total sample size  that will be allocated to  strata.
Determining the cutoff points considering one of these objectives (versions) is equivalent to solving a nonlinear integer optimization problem that is difficult to solve computationally.Therefore, in recent decades, several methods have been proposed, basically classified as approximation or optimization [29], and most of them has considered objective (i).
This article considers a hybrid approach that works to identify the cutoff points of the strata through a stochastic optimization method and then, with the obtained cutoff points, the exact method proposed by Brito et al. [9], perform the optimal allocation of the sample to the strata.More specifically, it is proposed a heuristic algorithm based on the Biased Random Key Genetic Algorithms (BRKGA) meta-heuristic [25], which we call STRATMH, for the univariate stratification problem using objective function (ii) as described above.It is essential to underline that the proposed heuristic differs from the classical BRKGA, particularly concerning the forms of representation and construction of the non-codified solutions, dispensing with solutions (de)coding procedures.More specifically, each chromosome produced in the initial generation and from the application of the crossover and mutation operators corresponds to the solutions of the stratification problem: the strata sizes.Another differential concern is the crossover operator, proposed specifically for the problem and which has the effect of a local search.
Adopting such a method in this approach constitutes a differential from the other approaches proposed in the literature for the stratification problem, in which the sample sizes allocated to the strata are calculated using classic allocation methods from the literature, which do not consider a criterion of optimization.As a result, such methods generally produce non-integer sample sizes, which are rounded a posteriori, producing sample sizes corresponding to a local optimum or even an infeasible solution (which do not meet the restriction of the level of precision -coefficient of variation).The exact approach guarantees the global optimum for the allocation problem and the lowest cost for carrying out the survey sampling since there is a direct relationship between the number of units that must be interviewed in the sample and such cost.
The article is organized as follows: Section 2 presents the univariate stratification problem, considering these two possible objectives, as well as a literature review regarding the main methods applied to its resolution.Section 3 brings a discretization proposal for this problem and the proposed heuristic.Finally, Section 4 presents the results and analyzes considering a set of 31 populations and 20 different scenarios in relation to the number of strata and levels of precision.

Univariate stratification problem
Consider a population  with  units defined from a set  = {1, 2, . . .,  } and the stratification variable .Then, to obtain a good estimate   for the variable  and a set of  variables investigated in the survey, the population of total size  is divided into  subpopulations, with  ℎ units (where ℎ = 1, . . .,  and  1 + • • • +   =  ) called population strata [40] and denoted by  1 ,  2 , . . .,   .
When solving the univariate stratification problem, one must determine the cutoff points  1 ,  2 , . . .,  −1 and the values of  ℎ (where  1 + • • • +   =  and 2 ≤  ℎ ≤  ℎ ∀ ℎ = 1, . . ., ), which correspond to the sample allocation, that is, the size of the sample to be independently selected in each stratum, to obtain the minimum variance (or the coefficient of variation) given that the sample size  is fixed, which corresponds to solving Problem I; or obtain the minimum sample size for a fixed level of precision   , which corresponds to solving Problem II.Problems I and II are defined as follows: Subject to: Problem II Subject to: It is necessary to determine the cutoff points and the sample sizes  ℎ (decision variables) that minimize the respective objective functions and meet the constraints of these problems to obtain the optimum concerning Problem (I) or (II).The nonlinearity of Problem (I) is present in the objective function in (4), and the nonlinearity of Problem (II) is associated with the restriction of the precision level in (9), which guarantees that the resulting coefficient of variation of the total estimator ((  )) is equal to or less than the target coefficient of variation (  ).
According to several approaches in the literature proposed for solving (I) and (II), after determining the strata, it is common to apply the Neyman allocation (12) [40] to determine the sample sizes of the Problem (I).The equation (13) used in Problem (II) is obtained from the substitution of (12) in the term on the left side of equation ( 9) and algebraic manipulations.
When using this allocation or other allocation methods in the literature [4,14], non-integer sample sizes  ℎ , in general, are obtained.It is necessary to apply a rounding procedure to ensure that the integrality constraint is satisfied.The issue of doing so, is that such rounding can produce sample sizes that do not meet constraints ( 5) and ( 6) or ( 9) and (10), which are mandatory.
In order to address this issue, the heuristic proposed in Section 3 for solving the Problem (II) considers, in the phase of allocation of sample sizes to strata, the application of the exact method proposed by Brito et al. [9].This method guarantees the overall optimal concerning sample allocation.More specifically, values  ℎ ∈  + (ℎ = 1, . . ., ) such that the objective function of (II) has a minimum value and are fulfilled the constraints ( 9)- (11).Adopting this method in the second stage of the heuristic guarantees that the smallest possible sample size allocated to the strata for each stratification obtained is the global optimum in the allocation stage.A differentiating point concerning the various approaches proposed in the literature, which use classic allocation methods in this stage, which do not consider the optimization criterion, producing non-integer sample sizes, which need to be rounded, losing the global optimum.Thus, the heuristic solutions meet the precision constraint and ensure lower research costs related to the number of units that must be interviewed in the sample.

Literature review
Due to the relevance of the univariate stratification problem and its computational complexity, several algorithms are proposed in the literature, classified as approximate or optimization, according to [29].Approximate algorithms are simple to implement, require few computational operations, and are not focused on searching for optimal cutoff points.These points, together sample size, allow the minimization of the objective function of Problem (I) or (II), which implies low stratification efficiency.
On the other hand, algorithms based on optimization often make possible to determine solutions of superior quality if compared to those obtained from approximated methods, although it demands intensive computation.In [19], the authors present a review of several methods from 1950 to the present.Next, we present other references associated with the main stratification methods.
Regarding the class of algorithms that deal with Problem (I), in [16], there is an algorithm based on the application of the Dalenius-Hodges rule.In [28], considering an extension of Ekman's rule [22], the strata are determined so that the variance of the stratification variable is minimal, with pre-fixed  and  and application of the Neyman allocation.In [27], a simple and easy-to-implement algorithm is proposed, which uses the general term formula of a geometric progression to determine the strata limits for asymmetric populations.
In [49], an algorithm based on the modified Newton method is proposed, which determines the limits of the strata and produces a local optima about the variance expression, based on the Neyman allocation and the choice of a good starting point.In [37], the authors evaluate the algorithms proposed in [27,35,38].Finally, in [32], under the hypothesis that  has a normal or triangular distribution and that the sampling is done with replacement, an algorithm is proposed that determines the cutoff points of the strata, using the Dynamic Programming technique.This algorithm can be applied to minimize a variance expression, considering the use of uniform, proportional, or Neyman allocation.
In [8], an exact algorithm based on concepts of graph theory and the minimization of the expression of variance and application of proportional allocation is proposed.In [36], a comparative study between the algorithm proposed in [35,37] and the algorithm proposed in [30] is presented.In [29], the authors compare optimization methods and approximate methods, considering the stratification of asymmetric populations.Finally, in [33,34], algorithms based on the Dynamic Programming technique are proposed and simultaneously perform the stratification and allocation of the sample.
In [42], under the hypothesis that the stratification variable has a Weibull distribution, the stratification problem is solved using the Dynamic Programming technique and Neyman allocation.In [30], a genetic algorithm that considers the minimization of variance is proposed, based on four possible types of sample allocation.In [17,18], an approach based on dynamic programming for the two-variable stratification problem uses proportional and Neyman allocations.Finally, in [20], the bivariate stratification problem is also solved via dynamic programming, evaluating different probability distribution functions.
In [7], an algorithm is proposed based on the ILS (Iterated Local Search) method and Neyman's allocation.In [10,11], are proposed algorithms based, respectively, on the BRKGA (Biased Random Key Genetic Algorithm) and GRASP (Greedy Randomized Adaptive Search Procedure) methods, using the exact method proposed in Brito et al. [9] in the allocation phase.
About the set of works that address the Problem (II), in [38], an algorithm for stratification of asymmetric populations was proposed that uses the power allocation [4] to determine the sample size.In [45], is proposed a generalization of the algorithm presented in [38], which incorporates two models that account for the discrepancy between the stratification variable and a research study variable.In [35], an algorithm called Random Search is proposed, which uses optimization concepts to determine the cutoff points of the strata and the Neyman allocation.In [1] several methods based on Kozak's [35] and Sethi's [47] algorithms are compared, concluding that the former, with initial strata of equal size, is more reliable and efficient than other approaches taken into account, and [2] introduces the R package stratification that implements methods used in [1], in addition to allowing to consider stratum-specific anticipated non-response and a model for the relationship between stratification and survey variables.
In [39], an algorithm is proposed that solves the multivariate stratification problem and uses a penalized objective function, which is optimized by applying the Simulated Annealing method.In [12], the authors propose a brute force algorithm, limited to small populations ( value), and an algorithm based on the VNS (Variable Neighborhood Search) method, where, in the sample allocation step, both do use of the exact method proposed by Brito et al. [9].As a result of this method, the authors developed the R package stratvns.In [3], a genetic algorithm is proposed to solve the multivariate stratification problem, with possible application to the univariate case, which originated the R package SamplingStrata.In [43], the stratifyR package is designed by implementing a dynamic programming algorithm proposed by Khan et al. [31][32][33] to solve the optimal strata boundaries and the optimum sample sizes for the univariate problem.The stratification package [2] implements a generalized Lavallée-Hidiroglou method [38] of strata construction using the Kozak or Sethi algorithm.
Combining hybrid metaheuristics with other heuristics or exact methods has become a widely used approach in optimization problems.A comprehensive survey on the topic is presented in [5].More specifically, the strategy of using exact methods to improve genetic algorithms has proven successful in several applications, such as in [6], where a branch and bound procedure is probabilistically integrated into a genetic algorithm in order to enhance solutions obtained for a variant of the permutation flow shop scheduling problem.A hybrid metaheuristic combining genetic algorithm and linear programming to address the simultaneous order acceptance and scheduling problem is proposed in [44], achieving near-optimal solutions.In [15], multi-objective genetic algorithms are combined with iterated linear programming to improve computational efficiency by dividing the search space into sub-spaces for discrete and continuous variables, demonstrated through an example of optimal control valve placement in water distribution networks.In [26], a BRKGA is combined with a novel placement strategy and linear programming to optimize the unequal area facility layout problem and achieves improved solutions compared to other benchmark approaches.The time-dependent vehicle routing and scheduling problem with CO 2 emissions optimization is addressed in [50], combining genetic algorithm and dynamic programming to improve the efficiency and effectiveness of vehicle scheduling plans with lower CO 2 emissions.Finally, in [21], a linear programming-assisted genetic algorithm is applied to solving the flexible job shop lot streaming problem, where the algorithm searches over discrete and continuous variables and uses linear programming to refine promising solutions periodically, resulting in superior performance compared to embedded and parallel genetic algorithms.
It is worth mentioning that we have used the R packages: stratification [2], stratvns [12], and SamplingStrata [3], in the numerical experiments section, and their results compared to the BRKGA metaheuristic.

Discretization
Before detailing the proposed heuristic, the discretization also considered in [8,10] will be presented and used as a basis for representing the solutions produced by the heuristic.The discretization consists of considering only feasible candidates to cutoff points the ones available in   without repetition.
As defined in Section 2, the problem addressed in this article, in its original form, corresponds to an integer nonlinear optimization problem with a linear objective function and a nonlinear constraint associated with the coefficient of variation, which is a function of the allocated sample sizes to the (integer) strata and the cut points used to define the  strata.Initially, without considering the discretization of the problem, any real values in the interval [min(  ), max(  )] are, theoretically, candidates for the  − 1 cutoff points.
That is, in principle, there are infinite values in this interval that can be used for population stratification.However, depending on the distribution of the variable , many subsets of real values of , chosen as cutoff points, can produce strata with only one observation or even empty ones, considering equations ( 1)-(3) (Sect.2).
Additionally, due to the way strata are defined, it is natural to use, as cutoff points, a subset of   values, a fact that allows determining, as explained below, the exact maximum number of solutions to the problem that must be evaluated and, in the case of enumerating all these solutions, to determine the optimal global solution, something that cannot be guaranteed, in principle, using any cutoffs points in [min(  ), max(  )].Another advantage of such discretization is ensuring the generated solutions' feasibility.Finally, discretization allows the application of approaches based on metaheuristics, notably known for their ability to produce reasonable quality solutions for combinatorial optimization problems with low computational cost.
From the definition of the stratification problem and the equations ( 1)-( 3) presented in Section 2, it is possible, initially, to take the values of the vector   and gather it into a set  considering only the distinct values.
Then, from , it is possible to determine a possible partition of   by the  population strata, more specifically, how many and which   values are in each stratrum.Assumed the basic and necessary assumption (associated with constraints ( 6) and ( 10)) that || ≥ 2, guaranteeing at least one solution to the problem, this corresponds to determining a non-negative integer solution that satisfies constraints ( 14) and ( 15): In these constraints, each  ℎ corresponds to the number of observations of  that are in the ℎ-th stratum.Constraint (14) guarantees that the sum of the total observations ( ℎ ) in each of the strata corresponds to  , and constraint (15) implies that each stratum has at least two distinct observations.
From each solution ( 1 , . . .,   ) that satisfies ( 14) and ( 15), the corresponding values of  ℎ and  2 ℎ are obtained.Then, using equations ( 13) and ( 12), the values of  ℎ (already rounded) are determined.In the next step, the values of  ℎ ,  ℎ , and  2 ℎ are used to calculate the objective function of the Problem (II) and to assess whether constraints ( 9) and ( 10) are attended.
Due to the complexity of the stratification problem and the impossibility of applying the AFB to stratify populations with a more significant number of observations, in this Section 3, we present a heuristic for the stratification problem corresponding to the optimization model defined in Problem (II).The proposed heuristic was based on discretization and on the application of BRKGA principles.

BRKGA
The biased random-key genetic algorithm (BRKKA) is a stochastic optimization method, also known as metaheuristics [25,41], that has been applied in several optimization problems, for example, in [23,24] and which has many similarities with genetic algorithms (GAs), which are a particular class of evolutionary algorithms [41].
In each generation (iteration) of the BRKGA,  solutions (vectors) associated with the optimization problem in question and corresponding to the BRKGA population are produced, evaluated (via objective function), and updated.Furthermore, to produce new solutions during BRKGA generations, ensuring diversity and quality, are applied procedures called selection, crossover, and mutation operators in each generation of this method.
To apply the selection and crossover operators, it is necessary to calculate the value of the objective function of the problem in question at each generation for each  solution (current population), ordering them increasingly (minimization problem) according to the objective function value.Then, considering such ordering, the best   solutions are allocated to a set of elite solutions denoted by   .Finally, the other solutions ( −   ) are allocated to a set of non-elite solutions   .
The selection operator is applied to ensure that the best solutions (i.e., solutions with smaller objective function values) will be copied in the population for all generations.More specifically, are copied to the population   vectors of   , considered in the next generation.Finally, the other solutions for this new population are obtained from the application of the crossover and mutation operators.
The   solutions produced from mutation are also copied to the population evaluated in the following generation.These solutions are generated analogously to the first generation of BRKGA.Finally, to complement the population evaluated in the next generation, ( −   −   ) solutions are produced by applying the crossover operator between a solution of the elite set and a solution of the non-elite set, which are both randomly chosen.Notice that this procedure tends to keep some nice characteristics of the elite set to the solutions of the next generation.Figure 1 illustrates two generations in a row of BRKGA with the application of these operators, which is carried out during a fixed number of generations (maxGEN) of the BRKGA.Furthermore, the best solution is obtained at the end of all generations.For detailed information about this method, see [25,41].

STRATMH heuristic
Aiming to find reasonable quality solutions for the stratification problem associated with optimization Problem (II), we present a heuristic algorithm based on the BRKGA method and the discretization presented in Section 3.1.The proposed heuristic, called STRATMH, differs from the BRKGA in terms of the forms of representation and construction of the solutions that make up the population evaluated in each generation and the crossover operator.
It is worth mentioning that the way we carry out a population from one generation to the next one, as described in Section 3.2 and represented in Figure 1, is the main difference of the heuristic developed here if compared to the genetic algorithm paradigm, as implemented in Ballin and Barcarolli [3].

Representation and construction of solutions
Each solution built by the heuristic is represented by a vector  with  values  ℎ , corresponding to the number of observations of  in the ℎth stratum, such that the constraints ( 14) and (15), presented in Section 3.1 and here again, are fulfilled: In order to ensure that every solution built satisfies these restrictions, the following procedure applies: (ii) Randomly select the number of observations from the ℎth stratum; that is, choose a random value In this procedure, except for the last stratum, the choice number in the other strata is defined based on a random selection, considering the minimum (same) and maximum values per stratum.It is possible to verify that the use of (i), (ii), and (iii) guaranteed the fulfillment of ( 19) and (20).The satisfaction of constraint ( 19) is given by the definition of step (iii).In the case of the restrictions in (20), by the definition of (i) and by the recurrence relation in (ii), it is observed that each  ℎ (ℎ = 1, . . .,  − 1) generated randomly will be greater than or equal to 2; missing to verify that   ≥ 2 also occurs.Since  ℎ ≥ 2 (ℎ = 1, . . .,  − 1), we have that . Finally, considering this last inequality and the hypothesis that || ≥ 2 (as per Sect.3.1), it has to Table 1 presents two feasible solutions concerning ( 19) and ( 20) produced from applying the procedure mentioned previously, considering  = 4 and || = 15.
From this procedure, each  vector generated corresponds to a possible   stratification.More specifically, considering each vector , the values of  ℎ and  2 ℎ are calculated, which are used in equation ( 13), together with the target cv (  ), in order to obtain the value of , the objective function of the optimization model defined in Problem II.This procedure is used in the first generation (iteration) of the heuristic to produce the  initial  vectors (stratifications), and as a mutation operator, to produce, in the other generations, a quantity   of new solutions, that is, vectors with possible stratifications.

Crossover and mutation operators
In each generation, after calculating the objective function (sample size) for each of the  vectors , these vectors are arranged in a non-decreasing order.Then, an elite set denoted by   is defined, consisting of  and  vectors corresponding to the   smallest sample sizes, with the ( −   ) remaining vectors associated with the non-elite set denoted by   .Finally, following the same principle as BRKGA, the elite set vectors, corresponding to the best stratifications, are added to the next generation's population, together with   mutant vectorscorresponding to new stratifications obtained by applying the construction procedure described in Section 3.3.1.
In order to obtain the ( −   −   ) remaining solutions evaluated in the next generation, is applied a crossover operator for each pair (  ,   ), obtained from the random selection of a vector   ∈   and a vector   ∈   .As for a crossed pair, two new vectors (stratifications) are produced; this operator must be applied ( −   −   )/2 times.
The crossover exchanges the values of each of the corresponding  positions of the chromosomes   and   , redistributing the absolute difference , such that  = |  [] −   []|( = 1, . . ., ), between these values, proportionally to the values of   and   associated with the other strata, adding or subtracting such difference by the other positions of this vector.At each exchange and redistribution, two new stratifications are produced for which the objective function's value is calculated.At the end of the  exchanges, the two best solutions produced (with smaller sample sizes), among the 2 solutions, are added to the set formed by the ( −   −   ) solutions of the crossover.
In the application of the crossover operator, carried out in  iterations, the values of   and   , associated with the th position corresponding to the th iteration, are exchanged, being necessary to update the values in the other positions of these vectors, so that the sums of the elements of   and   are equal to ||, guaranteeing the feasibility of the solutions.More specifically, new child chromosomes are produced with each exchange and update, denoted by  1 and  2 .Such exchanges imply a change in the number of observations of the strata, that is, in new stratifications of the population defined from the stratification variable .Now, define  = +1 if   −  > 0, and  = −1, otherwise.In each change, always based on the elements of   and   that are in the th position, and to update, we apply the expressions: round for all  ̸ = , to produce the remaining chromosome elements that correspond to new child vectors.
Thus,  1 = {67, 200, 33} and  2 = {120, 150, 30} (2nd change).Analogous operations are performed in the 3rd iteration.Figure 2 summarizes the application of the crossover operator to this example.Also, in Appendix A, Algorithm 2 presents the pseudocode that implements all steps of this operator.About Algorithm 1, in lines 1 and 2, the percentages of elite solutions and solutions built on mutation are defined.In line 3, the initial  solutions constructed are stored in a   matrix.In each generation, between lines 7 and 20, new solutions are produced by evaluating the objective function (calculation of sample size) and applying the crossover and mutation operators.In line 9, the corresponding values of  ℎ ,  2 ℎ are calculated for each line of  ( vector).Then, using equations ( 13) and ( 12),  and  ℎ are calculated and rounded.If the values of  ℎ do not satisfy the constraints ( 9) and (10) of Problem (II), configuring an infeasibility, the exact method proposed in [9] is applied to obtain integer  and  ℎ ; otherwise, rounded values are considered.
Algorithm 1. STRATMH (L, cvt, p, PE, PM, maxGEN).Calculate, for each solution of  ,  ℎ ,  2 ℎ and the value of  ℎ and  (using ( 12), ( 13) and exact method)  This method uses as an input data the values of  ℎ ,  2 ℎ and   , producing as a result, sample sizes  ℎ (ℎ = 1, . . ., ) in which simultaneously minimize the objective function of Problem (II) and meet your restrictions.This method is applied in the case of infeasibility and in the best solution produced in the last generation of the heuristic because it significantly increases the computational cost of the heuristic if applied in all solutions.
To define the sets   and   (lines 11 and 16), the lines of  are ordered about the values of  obtained for each stratification.Between lines 12 and 15, the best update occurs solution to the problem.Then, to produce a new set of solutions evaluated in the population (matrix  ) of the next generation, the crossover and mutation operators (lines 17 and 18) are used.Finally, considering the elite   solutions (line 11) and the solutions  ′ (mutation) and  ′′ (crossover) obtained, respectively, the new population is defined (line 19).
The heuristic is performed until the maximum number of generations is reached (maxGEN) or until there is no reduction in the value of the objective function (sample size) for a pre-defined amount of generations (30% of the total generations as identified in preliminary experiments).Finally, in line 21, the exact method proposed in [9] is applied to the best solution obtained at the end of all generations.

Numerical experiments
This section presents the computational results where the STRATMH algorithm and four algorithms from the literature were evaluated: (i) Generalization of the Lavallée and Hidiroglou method [45] of strata constructionit can be used with (i) Sethi's (LHSE) [47] or (ii) Kozak's [35] algorithm (LHKO); (iii) VNS algorithm [12] and (iv) genetic algorithm proposed by Balin and Barcarolli (BB) [3].These four algorithms are implemented and available in R packages, and the STRATMH algorithm, implemented in R, is available in GitHub (StratMH function -available in StratMH package [13]), as shown in Table 2.In addition, Appendix B brings examples of the use of STRATMH.
The remainder of this section provides information about the populations used in the experiments, and Section 4.1 presents the results and analysis of the experiments.In the experiments reported in this section, 31 populations described in Table 3 were used, 28 of them from the literature used in [3,[10][11][12]37], and available in the packages of the R language: GA4Stratification (although removed from the CRAN, can be found at: https://cran.r-project.org/src/contrib/Archive/GA4Stratification/),stratification [2], stratifyR [43], and sampling (available on the R CRAN).In addition, the authors of this article generated three new populations using functions from the EnvStats package.
Table 3 presents information on populations and Table 4 presents some associated descriptive statistics, such as name, population size ( ), the cardinality of , minimum and maximum values for variable , and the value of its asymmetry coefficient (AS).

Computational results
The experiments with the five algorithms were carried out in RStudio, using a computer with 16 GB of RAM and equipped with an AMD FX-6300 3.5 GHz processor.Regarding the functions presented in Table 2, associated with the four algorithms in the literature, they have used the default values of its parameters.
In the case of the parameters used in the STRATMH heuristic, a preliminary calibration experiment was carried out, based on the analyzes and recommendations made in [25] and taking as a reference a set of values for each of its parameters, as shown in Table 5.Such an experiment is essential since the value considered in the set of parameters is a factor that contributes significantly to the good performance of any algorithm based on a stochastic optimization method.
To determine the ideal combination of the parameters of the proposed heuristic, seven populations were used in this experiment, which we marked with an asterisk in Table 4.The heuristic was performed 10 times to stratify each of these populations, for each  ∈ {3, 4, 5, 6, 7},   = 5% and   = 10% and considering 320 combinations of the parameters , maxGEN,   ,   (Tab.5) -making, per population, 32 000 executions (number of strata × target cv × combinations of parameters × number of executions), being stored, by execution, the final value of the objective function, that is, smaller sample size.In the first stage of the experiment, considering each population, the number of strata, and target cv, the best solution was the one associated with the smallest sample size (best solution) obtained in the 3200 runs (combination of parameters × 10 trials).Then, the ratio between the number of runs that each combination produced a sample size equal to the best solution and the total runs of the heuristic using that combination was calculated -in case 10.The closer to 1 such ratio, the more times the combination produced the best solution.
Then, considering such proportions for each of the populations × number of strata × target cv's (7 × 5 × 2), the proportions were decreasingly ordered, taking, by population, number of strata () and target cv's (  ), the combinations of parameters associated with the first five proportions in such ordering -making a list of 350 combinations.Finally, considering all the combinations, the most frequent combination in this list was chosen as the winning combination, reaching the combination:  = 50, maxGEN = 50,   = 0.3  = 15, and Considering this combination for the STRATMH algorithm and the default parameters of the four algorithms in the literature, the five algorithms were applied in each of the 31 populations (Tab.4), being evaluated 20 different scenarios corresponding to the number of strata () between 3 and 7 and target cv's (  ) of, respectively, 3%, 5%, 7.5%, and 10%, making a total of 620 solutions.In addition, their sample sizes and processing times were evaluated for comparison and analysis of the algorithms.
It was observed that the sample size produced, about 31% of the solutions, when evaluating the solutions produced by the BB algorithm, concerning some populations and scenarios, corresponded to a lower number of strata than that previously established; a fact resulting from this algorithm to favor, in the stratification, the smallest possible sample size to the detriment of the number of strata.Additionally, in the case of the LHSE algorithm, considering the solutions produced for some populations and scenarios, non-convergence was observed in about 5% of the solutions.More specifically, the algorithm produced a sample size whose associated coefficient of variation was greater than the target cv (  ).
From these observations, in order to enable a correct comparison between the five algorithms, four types of analysis were performed: (i) the first analysis is performed using the 620 solutions produced by the LHKO, VNS, and STRATMH algorithms; (ii) the second, consider the intersection of the set of 587 valid solutions produced by the LHSE algorithm and the 620 solutions produced by the LHKO, VNS and STRATMH algorithms; (iii) the third considers the analysis of the five algorithms, considering the intersection of sets of valid solutions (total of 396) produced by the BB and LHSE algorithm together with the other algorithms, that is, the cases in which the number of strata produced by the BB algorithm corresponds to the number previously fixed and that the LHSE algorithm converged; (iv) considers an experimental analysis of the BB and STRATMH algorithms by applying hypothesis testing to validate the results statistically.

Analysis I -LHKO, VNS and STRATMH algorithms
Table 6 brings the percentage of best solutions for the three algorithms by the number of strata and target cv's, and Table 7 brings the overall percentage of best solutions for each algorithm.Here, the best solution is the one corresponding to the smallest sample size.From these tables, it is observed that the STRATMH algorithm performed better than the LHKO and VNS algorithms, with percentages of better solutions greater than or equal (in bold) to these two algorithms in 19 of the 20 analyzed scenarios, with a percentage lower only than that of the algorithm VNS in the scenario associated with  = 7 and   = 10%.Considering the overall percentage of solutions (Tab.7), STRATMH reached 96.0%, followed by the VNS algorithms with 91.0% and LHKO with 74.4%.
In addition to the percentages of best solutions, the relative efficiency (Eff) of the STRATMH algorithm was evaluated about the VNS and LHKO algorithms, that is, the ratio between the sample sizes produced.Considering each of the 620 sample sizes (solutions) obtained, the value of expressions in ( 21) and ( 22 These values were used to build the boxplots presented in Figures 4 and 5, which show the distributions of the relative efficiencies of the STRATMH algorithm about the LHKO and VNS algorithms by the number of strata and coefficients of target variation.From the analysis of these figures, it is observed that the STRATMH algorithm's most significant gains were concerning the LHKO algorithm.Furthermore, the STRATMH algorithm, in most cases, has a relative efficiency better than or equal to these algorithms.Finally, the smaller the value of   , that is, when a higher level of precision is required, the greater the gains of the STRATMH algorithm.9 brings the overall percentage of each of the algorithms, considering all the 587 valid solutions produced by the LHSE algorithm.From Table 8, it can be seen that, in 18 of the 20 scenarios (in bold), STRATMH produced percentages greater than or equal to those of the other algorithms, followed by the VNS and LHKO algorithms.The scenarios where STRATMH performed less than VNS correspond to  = 6, 7, and   = 10%.Additionally, the algorithm produced, in 18 of the 20 scenarios, a percentage of the best solutions above 93%, with only a percentage of the order of 79% ( = 7 and   = 3%) and a percentage of the order of 90 being observed % ( = 6 and   = 5%).Furthermore, compared to the second-best algorithm (VNS), STRATMH produced the highest percentages of best solutions in 12 of the 20 scenarios.Finally, the biggest gains of this algorithm about the other algorithms, translated in percentage terms, are observed for   = 7.5%.
In Table 9, note that STRATMH has superior gain with an overall percentage of the order of 96.0% of the best solutions, followed by the VNS algorithms with around 92.0% and LHKO with 75.0%.

Analysis III -LHKO, LHSE, VNS, BB and STRATMH algorithms
In this last analysis, five algorithms were evaluated considering only the feasible solutions produced by the BB and LHSE algorithms (a total of 396).From the analysis of Table 10, which brings the percentage of best solutions per algorithm, the number of strata, and   , and Table 11, which brings the overall percentage of best solutions for each of the algorithms, it is observed that the STRATMH algorithm presented, again, superior  performance to other algorithms.Table 10 shows, in 18 of the scenarios (in bold), percentages of best solutions (greater than or equal to the other algorithms), with STRATMH showing a lower percentage only in the scenario associated with  = 6 and   = 5%, when compared to VNS, which had the second-best performance.

Analysis IV -Experimental Analysis -BB and STRATMH algorithms
An additional experiment was carried out to evaluate the STRATMH and BB (Ballin and Barcarolli) algorithms by applying hypothesis testing to validate the results statistically.More specifically, the Student's -test [48] was used to verify whether there is a significant difference between the STRATMH and BB algorithms regarding the quality of the solutions produced.In the -test, the null hypothesis ( 0 ) states that the means of two populations are the same, while the alternative ( 1 ) states that the means are different.Considering the mean and standard deviation of the results of the function to the objective of the stratification problem, that is, the sample sizes produced, the -test (bilateral) can be applied to verify if the difference between the STRATMH and BB algorithms is significant for a particular population and the values of  and   considered in the experiments and a significance level of 5%.If the difference is significant and STRATMH presents a    better (or worse) result (sample size), then it is said to be significantly better (or worse) for a given population about the BB algorithm; otherwise (-value > 5%), it is said that the algorithms present equivalent (or similar) performances.
The STRATMH and BB algorithms were executed 30 times, based on each of the 31 populations and the 20 scenarios of the previous experiments (combination of  = 3, 4, 5, 6, 7 and   = 3%, 5%, 7.5%, and 10%)making a total of 37 200 (2 × 30 × 31 × 20) executions.For each population and each scenario, the values of  obtained in the 30 executions of the two algorithms were stored.
From the application of this test, where, initially, we would have, per scenario, 31 evaluated populations, making a total of 620 solutions (in each solution 30 executions), discarding the cases (population × combination of  and   ) where the BB algorithm did not converge (did not produce feasible results), 408 solutions were evaluated.Tables 12 and 13 bring the percentages of the best, equivalent, and worst solutions of the STRATMH algorithm in aggregate form (general) and by scenario.From these tables, it is possible to observe a significant superiority of the STRATMH over the BB algorithm regarding the percentage of significantly better solutions.In particular, the highest percentages of better solutions are observed for target cvs of 3% and 5%.Furthermore, among 20 evaluated scenarios, only in one case ( = 3 and   = 10%) the BB algorithm performed better than the STRATMH algorithm.
Finally, Table 14 presents the average and median processing times of the STRATMH, BB, and VNS algorithms.Such analysis was restricted to the three algorithms.Although they are based on stochastic optimization methods (metaheuristics), they demand intensive computation, implying more processing time to produce reasonable quality solutions.The LHSE and LHKO algorithms are faster, with execution time of the order of a few seconds; however, in general, they produced solutions of lower quality than those achieved by STRATMH and VNS, as observed in .
From the analysis of Table 14, it can be observed that STRATMH is the algorithm that generally requires the least computational time to produce the smallest sample size.The median STRATMH time was less than 4 s, regardless of the scenario.Likewise, a value of fewer than 5 s in average time was observed, at less than   = 3%.

Conclusions
This article presented a hybrid method with a new heuristic algorithm to solve the univariate stratification problem.This is an important problem in the field of statistics and has high computational complexity.In order to solve this problem and produce reasonable quality solutions, an algorithm based on BRKGA called STRATMH was proposed, which considers the method proposed by Brito et al. [9] in the allocation step.
The STRATMH was evaluated and compared with four algorithms from the literature by carrying out experiments considering 31 populations and 20 scenarios, which contemplated different numbers of strata and target coefficients of variation.
From the analyses, it was possible to observe a superior performance of this algorithm compared to the four algorithms, and this statement is corroborated by the percentage of best solutions (smaller sample sizes) produced by STRATMH in all evaluated scenarios and the three analyzes performed.
In Analysis (I), which involved the LHKO and VNS algorithms, STRATMH produced 96% of the best solutions.In Analysis (II), where the same algorithms were considered in Analysis (I) + LHSE algorithm, STRATMH also produced almost 96% of the best solutions.In Analysis (III), where the solutions of the five algorithms were considered, a percentage of 94% was observed for STRATMH.Finally, in Analysis (IV) there is statistical evidence that the proposed algorithm STRATMH has better results than BB algorithm, although both are bioinspired.
Regarding the Analysis of the algorithm's efficiency, translated by the computational time required to produce the solutions, where the comparison between the computational times of STRATMH and the VNS and BB algorithms, also based on metaheuristics, was considered; where STRATMH proved to be the most efficient, taking into account the average and median times.
The results and analyses presented in this work indicate that the STRATMH algorithm is a good alternative for solving the univariate stratification problem to minimize the sample size considering the number of strata and pre-fixed level of precision.
Appendix B brings two examples of the application of the function that implements the STRATMH heuristic in two of the 31 populations used in this work.Coefficient of variation of the total estimator (level of precision)

Figure 2 .
Figure 2. Crossover operator applied to an example.

Figure 3 .
Figure 3. Illustration of the application of the STRATMH heuristic in three consecutive generations.
) was calculated for each population and scenario ( and   ), with  being the index corresponding to the th solution and  LHKO  ,  VNS  and  STRATMH  correspond to the minimum sample sizes to reach the predefined target cv, considering  strata and the LHKO, VNS, and STRATMH algorithms.The higher the Eff value, the more significant the difference between the sample sizes produced by the STRATMH algorithm relative to these algorithms.

4
Notes. = Number of feasible solutions produced by BB algorithm.

Table 1 .
Examples of solutions produced.

10 :
Sort the solutions (lines) of  increasingly relative to the  values Assign  to  best and store the associated  ℎ ,  2 ℎ and  ℎ values in  best 16:Assign to  the ( − ) remaining solutions of  17: Generate, through the construction procedure,  solutions and assign to  ′ 18: Crossover ( −  − )/2 times using  and  solutions (obtaining matrix  ′′ ) 19:  ←  ∪  ′ ∪  ′′ 20: end while 21: Apply the Exact method using ,  ℎ and  ℎ 2 ( best ).Update  and  best .22: Return  and  best

Table 2 .
Information about functions used in algorithms.

Table 3 .
Description of the 31 populations.
* Populations used in the experiments to calibrate the STRATMH in Section 4.1.

Table 4 .
Basic information of the 31 populations.

Table 6 .
Percentage of best solutions produced by algorithm, number of strata and   .

Table 7 .
Percentage of times that algorithm produced the best solution.

Table 8 .
Percentage of best solutions produced by algorithm, number of strata and   .
* Number of populations where LHSE algorithm converged.

Table
Percentage of times that algorithm produced the best solution.

Table 8 brings
, by the number of strata and target cv's, the percentage of best solutions concerning the sample sizes produced by the LHKO, LHSE, VNS, and STRATMH algorithms, and Table

Table 10 .
Percentage of best solutions produced by algorithm, number of strata and   .
* Number of populations where BB algorithm produced a feasible solution, that is, attending constraints of the fixed strata number, and the LHSE algorithm converged; -BB algorithm did not produce feasible solution.

Table 11 .
Percentage of times that algorithm produced the best solution.

Table 12 .
Percentage of Better (B), Equivalent (E) and Worse (W) solutions of the STRATMH algorithm -General.

Table 13 .
Percentage of Better (B), Equivalent (E) and Worse (W) solutions of the STRATMH algorithm -20 Scenarios.

Table 14 .
Algorithms STRATMH, BB and VNS -Mean and Median Time (seconds) by  and   .Notes.-BB algorithm did not produce a feasible solution.

Table B . 3 .
Examples of the application of the stratMH function.Appendix C. Stratified sampling notationTable C.1.Stratified sampling notation.