### Dataset

The model under consideration utilizes the UCI Dataset^{46}. This dataset is specifically designed to support studies focused on CKD detection. It comprises 25 attributes, with 14 comprising nominal attributes, 11 comprising numeric attributes, and 1 comprising the class attribute. These attributes include the data of 400 individuals, whereby 250 individuals are categorized as having CKD and 150 individuals are classified as NKD. Table 1 provides a complete summary of the dataset used in this investigation.

### Methodology

This study introduces an innovative methodology to address the complexities of forecasting patient ailments. This research aims to advance the precision of disease estimates by applying sophisticated medical data classification methods. Our primary objective is to boost the classification efficacy of the CKD dataset by employing a rigorous feature selection process. In the subsequent parts, we shall explain the comprehensive approach utilized in our study.

Stage. 1 Pre-processing:

In the first step of our inquiry, we apply a dataset regarding CKD provided by the UCI machine learning repository. The admission of noise, incompleteness, and data disputes within real-world medical datasets is generally recognized, principally attributable to these databases’ different sources and sizes. A thorough pre-processing approach is implemented to the CKD dataset to boost the overall data quality. This method involves modifying data, addressing missing data, and normalizing data. This technique not only improves the accuracy of the predictions but also ensures the sanctity of the data set.

Stage. 2 Feature Selection:

A thorough feature selection process must be conducted to achieve the best possible classification accuracy. This study employs the BGWO approach to satisfy the required objective. The BGWO algorithm was pivotal in identifying and selecting the essential data criteria in the available CKD data set. The input for the subsequent classification step comprised carefully chosen features.

Stage.3 Data Classification:

ELM was applied in the last stage of the study to classify the existence of CKD. The ELM model uses the criteria provided and runs the learning process to determine whether a person has CKD by using a medical data set. The algorithm has gained universal acceptance due to its outstanding adequacy performance in machine learning and the fact that it can extract complicated patterns from learning machines. This could be attributed to the entities’ innate capabilities as they are suitable for classifying medical data, ensuring reliable and robust results. The main objective of the research was to improve the accuracy of CKD diagnosis by a modified method that dealt with the problem of disease classification properly. The objective is accomplished by employing a comprehensive methodology that includes pre-processing methods, optimal feature selection options, and ELM application in classification. The goal was to create an accurate and effective model for detecting CKD from healthcare data sets. At the beginning of the first phase of data preparation, the full process is shown in Fig. 1, and the process ends with classification based on ELM. This method can improve CKD accuracy significantly and change how CKD patient care is offered.

### Data pre-processing

#### Missing value imputation

The dataset offered a major challenge in this inquiry because a considerable portion of the data was missing. As a justifiable solution to this problem, we implemented the Predictive Mean Matching method in the line of multiple imputations. The approach was easily accomplished using the MICE package in R. The programming language’s broad functionality allowed us to address the challenge of missing values successfully. The rigorous approach has given us a highly cleaned dataset that was specifically tailored, and a copy was stored in a CSV format. This study used the well-sanitized dataset in the next steps, where we developed different Python models, ensuring the credibility of our predictions and classifications. The details presented below follow the sequential steps outlined in the study^{47}.

Sequential steps for PMM are as follows:

**Step 1:** In step one, using a set of regression models, we resolved to estimate, for all subjects with appropriate “*y*” components for which data was missing, the expected means expressed as \(“\widehatyi”\) . The task also stipulated that computations be conducted to estimate the posterior projected means expressed as \(“yo^*”\) . Herein, “*o*” is a notation considering the context of variables that have data missing.

**Step 2:** In this phase, we selected a set of “*K*” donors that are possibly ideal, with the intended conceptual meaning being that the distance \(d\left(0,i\right)=|yo^*-\widehatyi|\) is the minimal. In this context, “donor” denotes variables with no missing values.

**Step 3:** Following the third step of the procedure, a single donor was randomly picked from a list of possible donors. The observed values of the selected donor were subsequently employed to impute the missing values in the recipient’s variable “\(o\).”

#### Data transformation

In the context of our data pre-treatment pipeline, we have handled the conversion of nominal features. These features frequently consist of values that are essentially non-numeric, such as “yes/no,” “good/poor,” “present/not present,” or “normal/abnormal.” To facilitate their integration into our analytical methodologies, we transformed them into a binary representation denoted by the digits “1” and “0.”

To achieve this change, we utilized the label encoding technique, a function conveniently accessible within the Sklearn package in the Python computer language. This methodology transformed nominal categorical variables into a numerical representation, assuring congruence with the following data analysis and modeling stages.

#### Data normalization

Data normalization is a fundamental approach used in database architecture to organize and structure data to minimize duplication and enhance data quality.

The training and test sets have been standardized to a consistent scale in this stage. Additionally, the data has been adjusted using min–max scaling to achieve a consistent scale. The Equation for min–max normalization, as stated in reference^{48}, is denoted as an equation. One throughout the literature.

$$X_norm=\fracX_0-X_minX_max-X_min$$

(1)

The variable \(X_norm\) Signifies the normalized value of variable \(X\) following transformation. The symbol \(X_0\) signifies the present value of variable \(X\). \(X_max\) Signifies the maximum value inside the dataset. The variable \(X_min\) Signifies the minimum value within the dataset.

#### Dataset splitting

After normalizing the dataset, we partitioned it into two datasets: 80% of the data was used for training, and the remaining 20% was used for testing. The partition was based on the stratified split method. This procedure balanced the cases split for the CKD and non-CKD parameters concerning both datasets. Thus, the model was trained and tested with an actual proportion of classes and had a proportional representation. This ensured the model had an equitable distribution and offered better accuracy and reliability.

### Feature selection

In this module, algorithms use the BGWO method to select a discriminating subset of features according to a certain criterion. Feature selection is a necessary step and forms a method of predominant interest. The traditional feature selection process consists of four main steps: subset formation, subset evaluation, stopping criteria, and previous knowledge or validation of the results. The first step in the subset formation method is to form a candidate feature subset to evaluate, which is generated using the BGWO algorithm based on the processes used by the wolves. Next, each subset formed is evaluated relative to the current subset using a certain criterion. The subset formed is replaced if the new subset formed is better than the previous subset.

This subset formation and evaluation action is repeated until a stopping criterion is met. After this, the subset selected as the most optimum is validated using either previous knowledge or tests on past datasets.

#### Feature subset optimization

Feature selection is a critical aspect of ML, and it significantly impacts dataset quality. The exclusion of unnecessary features adds value by speeding the training time, making model development easier, and improving the understanding of data^{49}. Subset optimization is enhanced when the BGWO algorithm is adopted, which helps conceptualize the features of a model as ‘Grey Wolves.’ Both the interpretability of the model and its performance are enhanced. One of the main reasons why the subset optimization of features is so important is that, in some cases, peaks are reached. This tendency of features reaching maximums increases overfitting, which results in high numbers of features with redundant information. The BGWO algorithm, therefore, has facilitated the increased accuracy of data understanding and the identified features that contribute significantly.

### Grey Wolf optimization algorithm

The approach is a population-based computational optimization technique rooted in evolutionary computing, like the prestamped precession in grey wolves^{25}. The acquisition of the social architecture of a grey wolf pack inspired the computational technique. Normally, the pack comprises 5–12 members with comparatively high intelligence. Within the social structure of the group, the grey wolves are classified into four distinct groups, namely alpha (\(\alpha\)), beta (\(\beta\)), delta (\(\delta\)), and omega (\(\omega\)), based on the prevailing hierarchy. Alpha individuals within a social group are responsible for making predation, rest, and activity choices, whereas beta individuals provide assistance and support. Deltas have a hierarchical relationship with alphas and betas while possessing the ability to exert influence over omegas, who are obligated to comply with the directives of superior wolves.

The model of grey wolf predation has two distinct processes, as stated by^{25}. Initially, the wolves encircle the target, as seen by

$$\overrightarrowX(t+1)=\overrightarrowX_p(t)+\overrightarrowA\cdot \overrightarrowD$$

(2)

where the variable \(t\) signifies the iterations. The vector \(\overrightarrowX\) signifies the position of the wolf while \(\overrightarrowX_p\) Signifies the position of the target. Additionally, \(\overrightarrowA\) Refers to the coefficient constant. The vector \(\overrightarrowD\) is specified by

$$\overrightarrowD=\left|\overrightarrowC\cdot \overrightarrowX_p(t)-\overrightarrowX(t)\right|$$

(3)

where \(\overrightarrowC\) Represents the coefficient vector. The vectors \(\overrightarrowA\) and \(\overrightarrowC\) are defined by

$$\overrightarrowA=2a\cdot \overrightarrowr_1-a$$

(4)

and

$$\overrightarrowC=2\cdot \overrightarrowr_2,$$

(5)

The value of \(a\) exhibits a linear drop from a value of two to zero while the number of iterations grows. The vectors \(\overrightarrowr_1\) and \(\overrightarrowr_2\) Are randomly generated within the range of [0, 1]. Within the framework of the GWO method, the designations of “alphas,” “betas,” and “deltas” are assigned to the candidate solutions based on their relative performance. Alphas are regarded as the most optimal solution, betas as the second-optimal solution, and deltas as the third-optimal solution. Individuals classified as alphas, betas, and deltas possess a significant amount of knowledge about the location of food resources. Once optimal positions are achieved, it becomes necessary for other search entities, including the omegas, to revise their places as well. To enhance their predatory efforts, wolves must undertake positional updates, especially those occupying the omega position within the pack hierarchy.

$$\overrightarrowX(t+1)=\frac\overrightarrowX_1+\overrightarrowX_2+\overrightarrowX_33$$

(6)

where the vectors \(\overrightarrowx_1\), \(\overrightarrowx_2\) and \(\overrightarrowx_3\) Are computed by:

$$\overrightarrowx_1=\left|\overrightarrowX_\alpha -\overrightarrowA_1\cdot \overrightarrowD_\alpha \right|$$

$$\overrightarrowx_2=\left|\overrightarrowX_\beta -\overrightarrowA_2\cdot \overrightarrowD_\beta \right|$$

$$\overrightarrowx_3=\left|\overrightarrowX_\delta -\overrightarrowA_3\cdot \overrightarrowD_\delta \right|$$

(7)

The first three optimal solutions for every iteration are denoted as \(\overrightarrowX_\alpha \), \(\overrightarrowX_\beta \), and \(\overrightarrowX_\delta \) . The values of \(\overrightarrowA_1\), \(\overrightarrowA_2\), and \(\overrightarrowA_3\) can determined by Eq. (5). The vectors \(\overrightarrowD_\alpha \), \(\overrightarrowD_\beta \), and \(\overrightarrowD_\delta \) This can be derived by:

$$\overrightarrowD_\alpha =\left|\overrightarrowC_1\cdot \overrightarrowX_\alpha -\overrightarrowX\right|$$

$$\overrightarrowD_\beta =\left|\overrightarrowC_2\cdot \overrightarrowX_\beta -\overrightarrowX\right|$$

$$\beginarrayc\\ \\ \endarray \overrightarrowD_\delta =\left|\overrightarrowC_3\cdot \overrightarrowX_\delta -\overrightarrowX\right|$$

(8)

The vectors \(\overrightarrowC_1\), \(\overrightarrowC_2\) and \(\overrightarrowC_3\) Are designed using Eq. (7) . The process is repeated when the wolves effectively apprehend the prey.

### Binary Grey Wolf optimization

In the context of the GWO algorithm, wolves can dynamically alter their locations to locate and capture prey effectively. However, some tasks, such as feature selection, provide a binary space issue where the solution is constrained to values of either zero or one. This poses a challenge for the conventional GWO algorithm. Therefore, the Binary GWO algorithm has been suggested to conduct feature selection in challenges that include solutions presented in binary form. This study employs two position update methods, namely Position Update Algorithm 1 (PUA1) and Position Update Algorithm 2 (PUA2), as proposed in^{25}.

$$x_d^t+1=\left\{\beginarraycx_1^d,\hspace0.25em\hspace0.25em\hspace0.25em\hspace0.25em \, \texti\textf \, \textr\texta\textn\textd \, <\frac13\\ x_2^d,\hspace0.25em\hspace0.25em\hspace0.25em\hspace0.25em\frac13\le \, \textr\texta\textn\textd \, <\frac23\\ x_3^d,\hspace0.25em\hspace0.25em\hspace0.25em\hspace0.25em \, \texto\textt\texth\texte\textr\textw\texti\texts\texte\endarray\right.$$

(9)

And

$$x_d^t+1=\left\{\beginarrayc1, \, \texti\textf \, \texts\texti\textg\textm\texto\texti\textd \, \left(\fracx_1+x_2+x_33\right)\ge \, \textr\texta\textn\textd \, \\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \texto\textt\texth\texte\textr\textw\texti\texts\texte \, \endarray\right.$$

(10)

where \(rand\) represents a random number within the range of [0, 1] that conforms to a uniform distribution. The variable \(x_d^t+1\) represents the updated position of a \(d\)-dimensional binary wolf the \(tth\) iteration. The sigmoid is formally specified as

$$sgimoid(x)=\frac11+e^-10(x-0.5)$$

(11)

The variables \(x_1\),\(x_2,\) and \(x_3\) are binary vectors that symbolize the outcome of wolf movement in the direction of the alpha, beta, and delta grey wolves, respectively. They are designated by

$$x_1^d=\left\{\beginarrayc1,\hspace0.25em\hspace0.25em\hspace0.25emif\hspace0.25em\left(x_\alpha ^d+bstep_\alpha ^d \, \right)\ge 1\\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \,\,\,\,\,\hspace0.25em\hspace0.25em\hspace0.25emotherwise\endarray\right.$$

(12)

$$x_2^d=\left\{\beginarrayc1,\hspace0.25em\hspace0.25em\hspace0.25em\hspace0.25emif\left(x_\beta ^d+bstep_\beta ^d\right)\ge 1\\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \,\,\hspace0.25em\hspace0.25em\hspace0.25em\hspace0.25emotherwise\endarray\right.$$

(13)

$$x_3^d=\left\{\beginarrayc1, if \left(x_\delta ^d+bstep_\delta ^d\right)\ge 1\\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \,\, \, \, \, \, otherwsise\endarray\right.$$

(14)

The alpha, beta, and delta wolf’s positions are denoted as \(x_\alpha ^d\) , \(x_\beta ^d\), and \(x_\delta ^d\) Respectively. Additionally, the values \(\textbstep_\alpha ^d\), \(\textbstep_\delta ^d\), and \(\textbstep_\beta ^d\) are specified by

$$bstep_\alpha ^d=\left\{\beginarrayc1, if cstep_\alpha ^d\ge rand\\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, otherwsise\endarray\right.$$

(15)

$$bstep_\beta ^d=\left\{\beginarrayc1, if cstep_\beta ^d\ge rand\\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \,\,otherwsise\endarray\right.$$

(16)

$$bstep_\delta ^d=\left\{\beginarrayc1, if cstep_\delta ^d\ge rand\\ 0, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \,otherwsise\endarray\right.$$

(17)

The variables \(\textcstep _\alpha ^d\),\(\text cstep _\beta ^d\), and \(\textcstep _\delta ^d\), are defined as follows.

$$cstep_\alpha ^d=\frac11+e^-10\left(A_1^dD_\alpha ^d-0.5\right)$$

(18)

$$cstep_\beta ^d=\frac11+e^-10\left(A_1^dD_\beta ^d-0.5\right)$$

(19)

$$cstep_\delta ^d=\frac11+e^-10\left(A_1^dD_\delta ^d-0.5\right)$$

(20)

The values of \(A_1^d\),\(D_\alpha ^d\) , \(D_\beta ^d\) and \(D_\delta ^d\) δ are computed using Eqs. (6), (12), (13), (14).

Similarly, with this BGWO, the data is updated by data (with optimum features) from every place. Algorithm 1 delineates the BGWO pseudocode.

The solution in this research is denoted as a one-dimensional vector, whereby its dimension corresponds to the number of features. In the context of this binary vector, the values 0 and 1 represent the following:

The solution in the present investigation is represented as a vector with one dimension, where the dimension aligns with the number of features. Within the framework of this binary vector, the numerical values 0 and 1 correspond to the following meanings:

0: The feature has not been chosen.

1: The feature has been chosen.

The process of feature selection inherently has a dual-objective aspect. One primary goal is to decrease the number of features, while the other is to improve classification precision. To achieve both goals simultaneously, the fitness function incorporates the following equations, applying the KNN classifier described in^{25} and ^{50}.

$$fitness=\propto \rho _R\left(D\right)+\beta \frac$$

(21)

The parameters \(\propto\) and \(\beta\) are defined as \(\alpha =[\text0,1]\) and \(\beta =(1-\alpha\)), respectively, are adopted from^{25}.The term \(\rho _R\left(D\right)\) designates the rate of error of the KNN classifier. Furthermore,\(|S|\) represents the nominated the features subset, whereas \(|T|\) denotes the whole of features included in a data set.

After successfully integrating the optimal feature selection segment, disease detection is conducted using the classifier. The ELM technique utilizes a classification methodology to ascertain the existence or non-existence of CKD by analyzing medical data.

### CKD data classification

The features selected through BGWO are employed in the CKD classification phase. In this phase, we emphasize training an ELM model to classify CKD.

#### Extreme learning machine (ELM)

The ELM^{51} is a highly adaptable feed-forward neural network often employed for many computational tasks, including classification, regression, and clustering. The ELM is capable of having either a single or multiple hidden layers. While a single hidden layer can suffice for simpler problems, providing rapid training and reduced computational demands, it may not perform adequately for more complex datasets, where multiple layers could capture deeper patterns and interactions within the data. The proposed model consists of input notes receptive to the hidden nodes and other notes that form the final output. Similar to other neural networks, rectified linear units activate the hidden nodes. The key feature of our algorithm is that the hidden node parameters are fixed. These parameters include biases and weights. They can either be kept unaltered, or they can be transferred as they are. This differs from the back-propagation algorithm, a common approach used to train neural networks. While effective, back-propagation is limited because weights require continuous updates; the algorithm does not consider the weights’ magnitudes and tends to get stuck in local minima. In addition, we included adjustment of weights and biases’ magnitude to prevent over-fitting.

The dropout techniques lock the training phase to ensure that the method does not generalize the testing and training around the vectors. However, during testing, all the input node weights are returned & those arriving at the hidden unit nodes are weighted and multiplied. This prevents the number of weights connecting the input and hidden nodes from being changed. The ELM is illustrated in Fig. 2. On the other hand, ELM is better at faster learning when compared to the networks that have been trained on back-propagation. Finally, in using a validation tool, we can watch the learning process and ensure that the complexity of the model allows the model to generalize testing to new data.

Regarding a set of \(H\) random samples denoted as \((pi,t_i)\), where \(p_i=[p_i1,p_i2,\dots ,p_in] ^T\in Q^n\) and \(t_i=[t_i1,t_i2,\dots ,t_im] ^T\in Q^m\) .The basic single-hidden layer feed-forward neural network (SLFN) with \(G\) hidden nodes and an activation function \(f(.)\) may be mathematically stated as:

$$\sum_i=1^Gw_if\left(a_i\times p_j+c_i\right)=o_j, (j=\text1,2,\dots .H)$$

(22)

The weight vector \(a_i\) linking the \(ith\) hidden node with the input nodes, denoted as \(a_i=[a_i1,a_i2,\dots ,a_in] ^T\) Input nodes are denoted as \(w_i=[w_i1,w_i2,\dots ,w_in] ^T\).In this context, this weight vector links the \(ith\) hidden node to the output node. The variable \(c_i\) Signifies the threshold value related to the \(ith\) hidden node. Additionally, the variable \(o_j=[o_j1,o_j2,\dots ,o_jn] ^T\) Signifies a vector of outputs for the \(jth\) node, which is created by the SFFN.

Within the SLFN domain that uses \(G\) hidden nodes and an activation function \(f(.)\), these networks can accurately estimate a collection of \(H\) illustrations without error. The condition \(\sum _j=1^G\parallel o_j-t_j\parallel =0\) represents an accurate estimate, indicating that the total of the discrepancies between the output values \(o_j\) of the network and their respective goal values \(t_j\) It is equal to zero. This noteworthy accomplishment is made possible by the presence of appropriate weight vectors.\(w_i\), input vectors \(a_i\), and hidden node thresholds \(c_i\), which guarantees the fulfillment of this zero-error criterion.

$$\sum_j=1^Gw_if\left(a_i\times y_i+c_i\right)=t_j (j=\text1,2,3,\dots H)$$

(23)

The Equation mentioned above may be concisely stated as below:

where

$$M\left(a_1,\dots ,a_G,c_1,\dots ,c_G,y_1,\dots ,y_G\right)=\left[\beginarraycccf\left(a_1\times y_1+c_1\right)& \cdots & f\left(a_G\times y_1+c_G\right)\\ \vdots & \cdots & \vdots \\ f\left(a_1\times y_H+c_1\right)& \cdots & f\left(a_G\times y_H+c_G\right)\endarray\right]_H\times G$$

(25)

$$w=\left[\beginarraycw_1^T\\ \cdot \\ \cdot \\ \cdot \\ w_N^T\endarray\right]_G\times n$$

(26)

$$T=\left[\beginarrayct_1^T\\ \cdot \\ \cdot \\ \dott_N^T\endarray\right]_G\times n$$

(27)

The term “\(M”\) represents the output matrix derived from the hidden layer. In matrix \(M\), each column, expressed as the \(kth\) column, corresponds to the output produced by the \(kth\) hidden node concerning the inputs \(y_1\),\(y_2\) and so forth up to \(y_H\). The resolution of the linear system may be mathematically epitomized as:

In the given context, the symbol \(M^-1\) denotes the Moore–Penrose generalized inverse of the matrix \(M\).

The ELM’s output function is defined as below:

$$g(y)=p(y)w=p(y)M^-1T$$

(29)

In the context of ELM training, three vital parameters are of significance. These parameters include the training set, which is signified as \(K=\left[\left(y_j,t_j\right)\right| y_j\in Q^n, t_j\in Q^m, j=\text1,2,\dots ,H]\), the output function of hidden nodes, denoted as \(f(a_i,c_i,y_i)\), and the number of hidden nodes, referred to as \(G\). The ELM training procedure may commence once all parameters have been properly set.

The Extreme Learning Machine starts its training process by generating random values for the *G* pairs of hidden node parameters \(a_i,c_i\).. The output matrix *M* is then created using Eq. (24). Since the model constitutes input data along with these randomly generated parameters, the ELM can then evaluate the output weight vector *w* with the help of Eq. (28). Once the training process is completed, the model can be applied to predict the results for the test data tuples using Eq. (29). In this way, the ELM training process can be defined as follows:

The training set A is provided by \(A=\left\a_i\in X_n, d_i\in X_m, i=\text1,2,\dots N\right\\) with activation function \(f(x)\) and the number of hidden neurons \(N\):

Initially, random values are assigned to the input weights \(w_i\) and biases \(b_i\).

Then, a computation is held to determine the resulting matrix *M* of the hidden layer.

The output weight vector *w* can be computed as follows: \(w=M\times T\)

This structured training process allows ELM to effectively analyze and classify the collected data, thereby allowing accurate predictions of the findings for any new samples.

link