Scientific journal
European Journal of Natural History
ISSN 2073-4972
ИФ РИНЦ = 0,301

THE FORMATION OF A SET OF INFORMATIVE FEATURES BASED ON THE FUNCTIONAL RELATIONSHIPS BETWEEN THE DATA STRUCTURE FIELD OBSERVATIONS

Artemenko M.V. 1 Kalugina N.M. 1 Dobrovolsky I.I. 1
1 South-West State University
The methods of forming the set of informative features – tuple linguistic variables to solve diagnostic tasks in a decision support system diagnostic decision making in medicine. It is proposed to use the parameters of the approximating polynomials, algebraic and logical functions, correlations and criteria exploration of clustering for the formation of a variety of signs and calculation of informativeness based on rank sorting. Formulate the paradigm of the formation of each alternative node of the hierarchical decision tree differential diagnostic private sets of informative indicators.
informativeness
approximately polynomial
differential diagnosis
method of hierarchies
tuple linguistic variable

Modern medical service of the population based on the information and computer technologies to support various stages of treatment and diagnostic process [6]. The development of the theoretical basis and software tools of artificial intelligence for solving tasks of classification and pattern recognition, forecasting led to the creation of various specialized automated systems of support of acceptance diagnostic solutions (ASSADS) for the tasks of clinical medicine and training of health workers [1, 2, 6, 16].

Design specialized ASSADS in medicine is based on the formation of adequate and effective knowledge base on the basis of decisive diagnostic rules synthesized and tested on clinically confirmed material, each element of which is characterized by a certain multiple of the recorded monitored and managed characteristics of the biological object or process. The problem of forming the set of informative features is important because the quality of its resolution depends on the efficiency of further diagnosis, as with the use and without the use of automated ASSADS.

From a medical point of view, the formation of extensive information, many signs bear semantic load as the formation of the tuple linguistic variables for the symptoms of a particular disease or condition of the body.

Feature build ASSADS for clinical medicine is the use in real conditions small amounts of training and examination (control) of samples of research results state of the biological object or process. Necessary and sufficient conditions imposed on the volume of the investigated material from the point of view of classical evidence-based medicine almost unrealizable in terms of the analysis of open systems (which are objects), vagueness and inaccuracies of recorded data in conditions of uncertainty. In addition, the same system of signs may have an acceptable informative for solving a recognition task and a completely unsuitable for another [13].

Formation tuple linguistic variables (many informative features) is a subject of many studies, fundamental of which are the work of G.S. Foreheads (e.g. [10]). Consider a number of methods of forming a tuple (as previously studied and proposed by the authors) based on the methodologies: the analytic hierarchy process (ordering is based on has go obtained grades – weights), regression analysis and self-organization of structural-parametric identification of mathematical models of the method of group accounting of arguments, or logical functions (identified, for example, logic algorithms, artificial neural networks [5]).

In the beginning of the study the characteristics set non-formalized way, with the help of experts (the Delphi technique or the fuzzy Delphi method) [7], recommended for the analysis of biomedical information due to its registration) or forcibly, taking into account the personal experience and knowledge of the researcher and analysis of specialized literature.

Then, it is proposed to apply the following, proven in practice methods and algorithms [8, 9]: Full – exhaustive of various combinations of signs to achieve acceptable diagnostic effect, Add – sequentially adds features; Del – sequential elimination of symptoms until disappearance of the previous diagnostic effect; AddDel – the simultaneous execution of the procedures of the algorithms Add and Del; Prob – for each attribute are determined by weight and then applied the procedure of the above algorithms; fractal analysis applied to the tensor data (e.g., diagnosis of Parkinson’s disease); Grad is the same as algorithm AddDel, but the inclusion and exclusion of indicators in the resulting lot is not “one”, and “complex”.

(Note that as features are directly measured and latent or integral, as the latter can be used indicators of system organization whose application is considered in [3, 4]).

These algorithms analyze the characteristics of the data structure, which is suggested to use the coefficients of pair correlation and/or the distance to the cluster centers. In this case, it is recommended to apply criteria – quality indicators [7]: Given index, indices of density, total Giprobum, the index of the Davis – Bouldin. I.e., a small volume of the sample applied these algorithms and indicators of the quality of a certain value, the generated sets of linguistic variables consisting of specific symptoms. In this case, the researcher specifies the “freedom of choice decision-making” – the number of sets from which to exam the sample according to the external criteria retained are the most informative.

If the implementation of exploratory cluster analysis is impossible, it is proposed that a simple and semantically transparent method in the final set of linguistic variables retained those characteristics that have the least correlation with the left and the highest with “discarded”.

For deciding on the inclusion of symptom information, many are encouraged to use the methodology of decision making T.L. Saaty [14]. Create a matrix of preference of the elements of W, which elements to indices i and j differ by 9 degrees (the sign of i is preferable than attribute j): wi,j = 1 – equal preference, wi,j = 2 – the low degree of preference, wi,j = 3 – medium preference, wi,j = 4 – a preference above average, wi,j = 5 – moderately strong preference, wi,j = 6 – a strong preference, wi,j = 7 – very strong (obvious) preference, wi,j = 8 – a very, very strong preference, absolute preference, wi,j = 9 – absolute preference.

Analysis of the matrix allows conversion of the matrix to group the signs by clusters of preference with the IJ-conversion. Is a permutation of the row I with row J in the matrix of modified preferences so that around the main diagonal of the clustered matrix elements with the highest values. The stop condition of the process of permutation acts achieve the minimum sum-of-products of the element values of the modified preference matrix W* the distance of this element from the main diagonal according to the formula:

Artemenko01.wmf (1)

where Artemenko02.wmf N – the number of analysed characteristics before selection.

The degree of preference are proposed to determine by way of order signs on ranks of informativeness in descending order. The rank of informativeness metric for SPDR diagnostic character proposed to determine one way (or all – given the known algorithms of decision making on several alternative two).

Method 1. – By the maximum gradient of the functional differences (MGR) with or without taking into account latent integral indicator of systemic organization of functional States (proposed and approved by school A.V. Zavyalov – [3]);

Method 2. By analysing the structure and the parameters of the approximating polynomial Gabor [15].

Method 3. By analysing the structure and analysis of Boolean functions obtained by applying the algorithms and software logic, artificial neural networks [5].

Method 4. In terms of clustering quality [7].

In the first proposed method, for each alternative class is the matrix of pair connectivity (for example, the absolute value of the Pearson correlation coefficient) between variables, whose elements equal zero, if the calculated value is less than a certain threshold level. Classes for characteristics that are candidates for inclusion in the informative tuple linguistic variables are determined by the number of links – Artemenko03.wmf and Artemenko04.wmfand calculated differences (gradients) Artemenko05.wmf for which the signs of i in descending order the Ksi. For the ordered set of indicators are the ranks Rni by the formula:

Artemenko06.wmf (2)

The vector {Rn} is the matrix of preferences W, the values of the elements of which are calculated in accordance with the gradation proposed by T.L. Saaty (presented earlier) or cognitology or automatically – by the formula

Artemenko07.wmf (3)

where Artemenko08.wmf Artemenko09.wmf Artemenko10.wmf wi,i = 9.

The second method of forming a matrix of preferences of information content of signs involves the use of the approximating polynomial Gabor – formula (4), since the increase in the degree of the polynomial the accuracy of the approximation, they approximated the function increases and then decreases – this allows you to apply a polynomial in the self-organizing algorithms of the group method of accounting arguments (GMDH) [11, 12]. Note that the GMDH allows handling samples of small volume and building the Gabor polynomial at the interpolation nodes, the number of which is smaller than the maximum degree of the polynomial.

Artemenko11.wmf (4)

where Z = {z1, z2, …, zN} – a lot of arguments; Y(Z) is the response function (approximant); L is the number of terms in the polynomial; Ak, pi.k – the identified model parameter; N is the number of arguments

The information content of the indicator of the set {X} is proposed to define the following methods:

1 method – based nonlinear discriminant functions identified for class w1 and w0 (“ill” – “not ill”, “condition 1” – “condition 2” – i.e., assumes a binary hierarchical decision tree). According to the recommendations of [6] for a class w0 sets the value of the response function that lies in the range (–1 ± e) and having a uniform distribution Artemenko12.wmf where N0, N1 – volume training samples for class w0 and w1, respectively). Similarly formed response for a class w1 in the range (1 ± e) of the formula (4) and using the orthogonal algorithm GMDH is the structural-parametric identification of a polynomial (4).

Next, we determine the share of influence of each term in formula in each class:

Artemenko13.wmf (5)

where the operator Artemenko14.wmf – represents the modal value of ZZ.

Then, for each argument included in the k-th term is calculated the weight of multiplicanda by the formula

Artemenko15.wmf (6)

In the end, determines the value of additive-multiplicative effect of indicator xi on the response function (according to the parameters of the discriminant approximant) for each alternative class, according to the formula

Artemenko16.wmf (7)

Introduces a relative error of ”difference” ε < 0,5 (recommended of 0,01 ≤ ε < 0,1) and recalculated the values of the multiplicative effects in Artemenko17.wmf by the formula (8):

Artemenko18.wmf Artemenko19.wmf j ≠ i. (8)

Next, for each class (w1 and w0) signs (linguistic variables) out in descending order of values of Artemenko20.wmf. In the end, are formed two-tuple of signs for classes: Artemenko21.wmf and Artemenko22.wmf. According to the obtained tuples by applying the formula (2), replacing Gi for Artemenko23.wmf generated two sets of ranks Artemenko24.wmf and Artemenko25.wmf.

By Artemenko26.wmf and Artemenko27.wmf finalized many informative features according to a specific researcher volume NI ≤  N consisting of elements Artemenko28.wmf which are imported from the original set {X} according to Artemenko29.wmf and Artemenko30.wmf in descending order by serial connection in descending order of ranks. In case of alternative situations inclusions apply one of the following: «handcontol» (knowledge and experience of the researcher), Monte-Carlo, or by reducing the magnitude of ε, and repeat the procedure of ranking.

The information content of sign Inf(xj) proposes is determined by the formula

Artemenko31.wmf (9)

where Artemenko32.wmf – value of rank metric xj in w0 and w1, respectively.

2 method of forming the set of informative indicators, and the calculation of Inf(xj), based on preliminary identification of the approximating polynomial Gabor (4) for each indicator from the initial set {X}. In this case, the identification procedure is repeated N times for each class w0 and w1, sequentially forming the set {Z} = {X} – xj and responses Y(Z) = xj.

As a result, generated many approximants for alternative classes:

Artemenko33.wmf and Artemenko34.wmf

(M0 ≤ N, M1 ≤ N, M0 ≠ 0, M1 ≠ 0).

It should be noted that approximate with values of coefficient of determination less than a certain researcher thresholds in further analysis is not involved. If the result of selection produced an empty lot approximants, it consistently returned approximant with the highest values of determination coefficients. The minimum amount many approximativeness “freedom of choice” (the recommended value of 3 to 7).

Next, for each alternative class formed matrix Artemenko35.wmf and Artemenko36.wmf, the number of rows which are equal, respectively, М0 and М1, number of columns – number of indicators the set {X}, the value of the element matrices are calculated using formulas similar to (5)–(8). On the resulting matrices to form two vectors Artemenko37.wmf and Artemenko38.wmf (for each class), the values of which are calculated by formulas (10):

Artemenko39.wmf

Artemenko40.wmf (10)

For each class (w1 and w0) indicators xi are sorted in descending order of values of Artemenko41.wmf. Thus, a formed two-tuple of indices for alternative classes: Artemenko42.wmf and Artemenko43.wmf.

The job ε, the formation of tuples, and further application of formula (2), the formation of many informative features Artemenko44.wmf and calculating the information content is then the same as discussed in method 1 procedures.

In method 3 linguistic variables take values “true” (“1”) or false (0). With a certain accuracy (diagnostic performance in medical applications), the approximant of the response is represented by the formula (11) (indices and variables have counterparts in (2)).

Artemenko45.wmf (11)

where Artemenko46.wmf zb ∈ {ZB} – logic exception.

For possible applications of the approaches described in method 1 and method 2 from (11) to proceed to the analogue of the polynomial Gabor YВ* (ZВ*) for Boolean functions in the form of formula (12), based on analogues of arithmetic operations logical functions.

Artemenko47.wmf

pk = {0, 1}, Artemenko48.wmf (12)

Then apply formula (5) to(10) and conclusions from the consequences.

Method 4 proposes to implement the ordering of attributes (linguistic variables) with the subsequent calculation of grades, the inclusion in an informative tuple and the calculation of informativeness similar to the previously discussed methods on the basis of hyperobject H (and/or index density PD), considered in [7], conducting exploratory clustering procedure by calculating the value of changes in the quality of clustering Artemenko49.wmf as the exception from consideration of the analyzed characteristic by the formula

Artemenko50.wmf (13)

where Artemenko51.wmf Artemenko52.wmf – is a covariance matrix into the corresponding classes w0 and w1 in the initial set {X}; Artemenko53.wmf Artemenko54.wmf – is the correlation matrix of the classes w0 and w1 the set{{X} – xj} (excluded sign xj; det( ) – compute the determinant of the matrix.

Under covariance matrices here are the matrices calculated by the formulas

Artemenko55.wmf

Artemenko56.wmf (14)

where N0, N1 – is the number of objects in classes w0 and w1, respectively; Artemenko57.wmf, Artemenko58.wmf – coordinate vector of the i-th object in the respective clusters; Artemenko59.wmf, Artemenko60.wmf – vectors of coordinates of the centers of the classes w0 and w1.

Note that Artemenko61.wmf can take both positive and negative values – the latter option means that after breeding the quality of the classification according to the General hyperonym H deteriorated.

The disadvantage of this method is the analysis of exception characteristic as a single representative, rather than together with some other tuples. Procedure complete enumeration of different variants of demand in this case, large computational resources are usually, with negligible loss of diagnostic quality (or lack thereof) in the end.

In conclusion, we note that:

1. In the proposed methods, the information content characteristic is determined for each “branching” of the tree of decision-making about the object or process alternative classes. Thus, from the paradigm definition, equal informative tuples linguistic variables for the full set of alternative classes (and, subsequently, the synthesis of diagnostic rules), it is proposed to move to the paradigm of determining the informational content of the basis for each hierarchy, differential division.

2. If in the formula (2) to move from Artemenko62.wmf go to Artemenko63.wmf, then the binary characteristic value, go to the interval estimates of the characteristic values of membership functions in fuzzy sets or functions of belief in the theory of decision-making.

Thus, in the course of the study developed a new nonparametric methods of formation of informative tuples describing observable and/or controllable signs (linguistic variables) of the biological object (recorded, calculated, and latent, in numeric and logical metrics), which allows in conditions of semi-structured imprecise data necessary for the synthesis of diagnostic decision rules knowledge bases decision support systems in various segments of the automation of intellectual activities of decision makers on the basis of modern computer and information technologies.