Optimal classification

From Wikipedia, the free encyclopedia

Optimal classification is an arrangement of attributes for a set of elements in an Attribute-value system which minimizes the number of attribute queries necessary to identify a particular element within an element set.

The algorithm used for this purpose first sorts the elements in ascending order according to the value of each attribute as determined by its current position in the sequence of attributes. An empirical separatory value is then calculated for each characteristic in the target set and compared to the highest separatory value calculated for all preceding sets. This process repeats until an optimal sequence is found, all permutations have been evaluated or a time limit expires.

The number of elements, which are classified as well as the rapidity with which they can be identified, is dependent upon both the radix and the exponent of the data. A larger radix will allow faster identification by means of excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.

What follows is a rigorous explanation of the algorithm.

[edit] Truth Table Related Equations

G = V C

, where:

G is the group size or total number of elements in the group,
V is the highest value of logic in the group,
C is the highest number of characteristics in the group.

R = V K

, where:

R is the set size or total number of elements in the set,
V is the highest value of logic in the group,
K is the highest number of characteristics in the set.

[edit] Element Related Equations

[edit] Maximum Number of Pairs of Elements to Separate

Maximum number of pairs of elements to separate refers to a matrix in which each element is compared with every other element to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory, therefore the maximum possible numbers of pairs that can be separated is determined by the following equation:

$p_{max} = \frac{\left[{G (G-1)}\right]}{2}$ , where:

p_max is the maximum number of pairs to separate, and
G is the total number of elements in the group.

[edit] Order of Elements

The elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.

$e_i = \left[\sum_{j=0}^C \left[v_{i,j} V^{(C-j)}\right]\right]$ , where:

e_i is the element truth table value in the group,
C is the highest number of characteristics in the group,
V is the highest value of logic in the group,
v is the value of logic of each characteristic in the group,
j is the jth characteristic index, where:

j = 0..K and where:

K is the number of characteristics in the set,

i is the ith element index, where:

i = 0..G and where:

G is the number of elements in the group.

[edit] Characteristic Related Equations

[edit] Theoretical Separation

[edit] The General Identification Equation

$S_j = \frac{1-{V^{-j}}}{1-V^{-C}}$ , where:

S_j is the theoretical separatory value per jth characteristic,
C is the highest number of characteristics in the group,
V is the highest value of logic in the group and
j is the jth characteristic index in the set, where:

j = 0..K and where:

K is the number of characteristics in the target set.

[edit] Minimal Number of Characteristics to Result in Theoretical Separation

$t_{min} = \frac{\log G}{\log V}$ , where:

t_min is the minimal number of characteristics to result in theoretical separation,
G is the number of elements in the group and
V is the highest value of logic in the group.

[edit] Empirical Separation

(Please note that application of these equations can not be expressed entirely using conventional mathematical notation or normal MathCAD expressions without including a MathCAD function or MathCAD user program. The equations have been fully implemented using the Zbasic and the Visual Basic programming languages.)

$S_j = \frac{\left[(G^{2})-\sum_{i=0}^{R} n_i^{2}\right]}{2}$ , where:

S_j is the empirical separatory value,
j is the jth characteristic index,
G is the number of elements in the group,
R is the number of elements in the set, where:

R = V^K, where:

V is the highest value of logic in the group and
K is the number of characteristics in the set.

i is the truth table index value of the set, where:

i = 0..R and

$n_i = \sum_{j=0}^K v_{i,j} V^{(K-j)}$ , where:

$n i$ is the truth table value of the ith element in the set
v is the logic value of the jth characteristic.

[edit] Flag Recognition - Application Example

(A flag identification example from Neural Network Identification example.)

Although this example lacks an intuitive sense of optimization it still serves as a good example of how optimization can reduce the number of queries which must be made.

[edit] Flag Overlay Grid

Designated areas (characteristics) for sampling background colors (states) of all flags (elements).

The overlay is used to determine the color of each area for each flag and the color is recorded in the table as the logical state of the area. The table data is then submitted to the optimization program and processed until an optimal empirical separatory value is obtained.

[edit] Original Order

FLAGS/LOC	A	B	C	D	E	F	G	H	I
Belgium	BLACK	YELLOW	ORANGE	BLACK	YELLOW	ORANGE	BLACK	YELLOW	ORANGE
France	BLUE	WHITE	RED	BLUE	WHITE	RED	BLUE	WHITE	RED
Germany	BLACK	BLACK	BLACK	RED	RED	RED	YELLOW	YELLOW	YELLOW
Ireland	GREEN	WHITE	ORANGE	GREEN	WHITE	ORANGE	GREEN	WHITE	ORANGE
Italy....	GREEN	WHITE	RED	GREEN	WHITE	RED	GREEN	WHITE	RED
Japan..	WHITE	WHITE	WHITE	WHITE	RED	WHITE	WHITE	WHITE	WHITE
Luxembourg	RED	RED	RED	WHITE	WHITE	WHITE	BABY	BABY	BABY
Netherlands	RED	RED	RED	WHITE	WHITE	WHITE	BLUE	BLUE	BLUE
Spain..	RED	RED	RED	YELLOW	YELLOW	YELLOW	RED	RED	RED

[edit] Systematic Query

Starting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact, since the colors in columns "D", "E" and "F" are the same for each remaining flag, we would not be able to eliminate any remaining flags until column "G" where the color BLUE would provide a unique answer to the final necessary query. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries using the systematic query method to establish the identity of the flag in our possession as belonging to the Netherlands.

[edit] Optimized Order

FLAGS/LOC	G	F	C	B	I	D	E	H	A
Spain..	RED	YELLOW	RED	RED	RED	YELLOW	YELLOW	RED	RED
Luxembourg	BABY	WHITE	RED	RED	BABY	WHITE	WHITE	BABY	RED
Japan..	WHITE	WHITE	WHITE	WHITE	WHITE	WHITE	RED	WHITE	WHITE
Italy....	GREEN	RED	RED	WHITE	RED	GREEN	WHITE	WHITE	GREEN
Ireland	GREEN	ORANGE	ORANGE	WHITE	ORANGE	GREEN	WHITE	WHITE	GREEN
Germany	YELLOW	RED	BLACK	BLACK	YELLOW	RED	RED	YELLOW	BLACK
Netherlands	BLUE	WHITE	RED	RED	BLUE	WHITE	WHITE	BLUE	RED
France	BLUE	RED	RED	WHITE	RED	BLUE	WHITE	WHITE	BLUE
Belgium	BLACK	ORANGE	ORANGE	YELLOW	ORANGE	BLACK	YELLOW	YELLOW	BLACK
Theoretical	85.71	97.96	99.71	99.96	99.99	100	.	.	.
Empirical	94.44	100	.	.	.	.	.	.	.
Best Sequence	7	6	3	2	9	4	5	8	1

[edit] Minimized Query

The results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The minimized query algorithm has optimized the order of characteristics and minimized the number of queries that are required to identify the flag. (Please note that there may be more than one optimal solution.)

[edit] References

[edit] Primary Reference

Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification."

Eugene Weston Rypka passed away on April 27, 2006. Gene was born on May 6, 1925 in Owatonna, MN to Charles Frederick and Ethel Marie Rypka. He served in World War II as a paramedic in Iwo Jima and received several medals and commendations. In 1958, Gene received a Ph.D. in Medical Microbiology from Stanford University. He had a long and distinguished career, including work with Russian scientists at Lovelace Medical Center and the University of New Mexico. Bicycle racing was a lifetime love and occupation, and in later years, he also studied martial arts.

[edit] See also

[edit] Potential applications

Medical, legal and other discipline troubleshooting charts.

[edit] External links