Optimal classification

From Wikipedia, the free encyclopedia

Optimal classification is an arrangement of attributes for a set of elements in an Attribute-value system which minimizes the number of attribute queries necessary to identify a particular element within an element set.

The algorithm used for this purpose first sorts the elements in ascending order according to the value of each attribute as determined by its current position in the sequence of attributes. An empirical separatory value is then calculated for each characteristic in the target set and compared to the highest separatory value calculated for all preceding sets. This process repeats until an optimal sequence is found, all permutations have been evaluated or a time limit expires.

The number of elements, which are classified as well as the rapidity with which they can be identified, is dependent upon both the radix and the exponent of the data. A larger radix will allow faster identification by means of excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.

What follows is a rigorous explanation of the algorithm.

Contents

[edit] Truth Table Related Equations

G = VC, where:
  • G is the group size or total number of elements in the group,
  • V is the highest value of logic in the group,
  • C is the highest number of characteristics in the group.
R = VK, where:
  • R is the set size or total number of elements in the set,
  • V is the highest value of logic in the group,
  • K is the highest number of characteristics in the set.

[edit] Element Related Equations

[edit] Maximum Number of Pairs of Elements to Separate

Maximum number of pairs of elements to separate refers to a matrix in which each element is compared with every other element to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory, therefore the maximum possible numbers of pairs that can be separated is determined by the following equation:

 p_{max} = \frac{\left[{G (G-1)}\right]}{2} , where:
  • pmax is the maximum number of pairs to separate, and
  • G is the total number of elements in the group.

[edit] Order of Elements

The elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.

 e_i = \left[\sum_{j=0}^C \left[v_{i,j} V^{(C-j)}\right]\right], where:
  • ei is the element truth table value in the group,
  • C is the highest number of characteristics in the group,
  • V is the highest value of logic in the group,
  • v is the value of logic of each characteristic in the group,
  • j is the jth characteristic index, where:
j = 0..K and where:
  • K is the number of characteristics in the set,
  • i is the ith element index, where:
i = 0..G and where:
  • G is the number of elements in the group.

[edit] Characteristic Related Equations

[edit] Theoretical Separation

[edit] The General Identification Equation
 S_j = \frac{1-{V^{-j}}}{1-V^{-C}}, where:


  • Sj is the theoretical separatory value per jth characteristic,
  • C is the highest number of characteristics in the group,
  • V is the highest value of logic in the group and
  • j is the jth characteristic index in the set, where:
j = 0..K and where:
  • K is the number of characteristics in the target set.

[edit] Minimal Number of Characteristics to Result in Theoretical Separation
 t_{min} =  \frac{\log G}{\log V}, where:
  • tmin is the minimal number of characteristics to result in theoretical separation,
  • G is the number of elements in the group and
  • V is the highest value of logic in the group.

[edit] Empirical Separation

(Please note that application of these equations can not be expressed entirely using conventional mathematical notation or normal MathCAD expressions without including a MathCAD function or MathCAD user program. The equations have been fully implemented using the Zbasic and the Visual Basic programming languages.)


 S_j = \frac{\left[(G^{2})-\sum_{i=0}^{R} n_i^{2}\right]}{2}, where:
  • Sj is the empirical separatory value,
  • j is the jth characteristic index,
  • G is the number of elements in the group,
  • R is the number of elements in the set, where:
R = V^K, where:
  • V is the highest value of logic in the group and
  • K is the number of characteristics in the set.
  • i is the truth table index value of the set, where:
i = 0..R and
 n_i = \sum_{j=0}^K v_{i,j} V^{(K-j)}, where:
  • ni is the truth table value of the ith element in the set
  • v is the logic value of the jth characteristic.

[edit] Flag Recognition - Application Example

(A flag identification example from Neural Network Identification example.)

Although this example lacks an intuitive sense of optimization it still serves as a good example of how optimization can reduce the number of queries which must be made.

[edit] Flag Overlay Grid

Designated areas (characteristics) for sampling background colors (states) of all flags (elements).


The overlay is used to determine the color of each area for each flag and the color is recorded in the table as the logical state of the area. The table data is then submitted to the optimization program and processed until an optimal empirical separatory value is obtained.

[edit] Original Order

FLAGS/LOC A B C D E F G H I
Belgium BLACK YELLOW ORANGE BLACK YELLOW ORANGE BLACK YELLOW ORANGE
France BLUE WHITE RED BLUE WHITE RED BLUE WHITE RED
Germany BLACK BLACK BLACK RED RED RED YELLOW YELLOW YELLOW
Ireland GREEN WHITE ORANGE GREEN WHITE ORANGE GREEN WHITE ORANGE
Italy.... GREEN WHITE RED GREEN WHITE RED GREEN WHITE RED
Japan.. WHITE WHITE WHITE WHITE RED WHITE WHITE WHITE WHITE
Luxembourg RED RED RED WHITE WHITE WHITE BABY BABY BABY
Netherlands RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE
Spain.. RED RED RED YELLOW YELLOW YELLOW RED RED RED

[edit] Systematic Query

Starting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact, since the colors in columns "D", "E" and "F" are the same for each remaining flag, we would not be able to eliminate any remaining flags until column "G" where the color BLUE would provide a unique answer to the final necessary query. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries using the systematic query method to establish the identity of the flag in our possession as belonging to the Netherlands.

[edit] Optimized Order

FLAGS/LOC G F C B I D E H A
Spain.. RED YELLOW RED RED RED YELLOW YELLOW RED RED
Luxembourg BABY WHITE RED RED BABY WHITE WHITE BABY RED
Japan.. WHITE WHITE WHITE WHITE WHITE WHITE RED WHITE WHITE
Italy.... GREEN RED RED WHITE RED GREEN WHITE WHITE GREEN
Ireland GREEN ORANGE ORANGE WHITE ORANGE GREEN WHITE WHITE GREEN
Germany YELLOW RED BLACK BLACK YELLOW RED RED YELLOW BLACK
Netherlands BLUE WHITE RED RED BLUE WHITE WHITE BLUE RED
France BLUE RED RED WHITE RED BLUE WHITE WHITE BLUE
Belgium BLACK ORANGE ORANGE YELLOW ORANGE BLACK YELLOW YELLOW BLACK
Theoretical 85.71 97.96 99.71 99.96 99.99 100 . . .
Empirical 94.44 100 . . . . . . .
Best Sequence 7 6 3 2 9 4 5 8 1

[edit] Minimized Query

The results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The minimized query algorithm has optimized the order of characteristics and minimized the number of queries that are required to identify the flag. (Please note that there may be more than one optimal solution.)

[edit] References

[edit] Primary Reference

Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification."

  • Eugene Weston Rypka passed away on April 27, 2006. Gene was born on May 6, 1925 in Owatonna, MN to Charles Frederick and Ethel Marie Rypka. He served in World War II as a paramedic in Iwo Jima and received several medals and commendations. In 1958, Gene received a Ph.D. in Medical Microbiology from Stanford University. He had a long and distinguished career, including work with Russian scientists at Lovelace Medical Center and the University of New Mexico. Bicycle racing was a lifetime love and occupation, and in later years, he also studied martial arts.

[edit] See also

[edit] Potential applications

Medical, legal and other discipline troubleshooting charts.

[edit] External links