Action Perception bead 2

(bridge) Send note to Paul Prueitt . (back to SBF3)

We ask that comments and discussion on this series be made in the Yahoo groups from, eventChemistry.

APC1 – On enumeration by the human of the cell values over an event class

APC2– Minimal Voting Procedure (original paper)

APC3 – Using the MVP to rout information

APC4 – Using eventChemistry to improve the framework specification

Action Perception Cycles 2

Minimal Voting Procedure

Directly from the Appendix, Foundations of Knowledge Science

Description of the Minimal Voting Procedure (MVP)

First Published, in Russia: 1997

(original notation)

Appendix: Description of the Minimal Voting Procedure (MVP)

To instantiate a voting procedure, we need the following triple < C, O1, O2 > :

· A set of categories C = { C_q } as defined by a training set O1.

· A means to produce a document representational set for members of O1 .

· A means to produce a document representational set for members of a test set, O2 .

We assume that we have a training collection O1 with m document passages,

O1 = { d₁ , d₂ , . . . , d_m }

Documents that are not single passages can be substituted here. The notion introduced above can be generalized to replace documents with a more abstract notion of an "object".

Objects

O = { O₁ , O₂ , . . . , O_m }

can be documents, semantic passages that are discontinuously expressed in the text of documents, or other classes of objects, such as electromagnetic events, or the coefficients of spectral transforms.

Some representational procedure is used to compute an "observation" D_r about the semantics of the passages. The subscript r is used to remind us that various types of observations are possible and that each of these may result in a different representational set. For linguistic analysis, each observation produces a set of theme phrases. We use the following notion to indicate this:

D_r: d_i à { t₁ , t₂ , . . . , t_n }

This notion is read "the observation D_r of the passage d_i produces the representational set { t₁ , t₂ , . . . , t_n }"

We now combine these passage level representations to form a category representation.

· each "observation", D_r, of the passages in the training set O1 has a "set" of theme phrases

D_r : d_i à T_k= { t₁ , t₂ , . . . , t_n }

· Let A be the union of the individual passage representational sets T_k.

A = È T_k.

This set A is the representation set for the complete training collection O1.

· The overlap between category representation T_q, and T_s, is one statistical measure of the "cognitive entanglement" between category q and category s. This fact will lead to a method for identifying the minimal intersections of structural features of the category representational sets.

J. S. Mill’s logics relies on the discovery of meaningful subsets of representational elements. The first principles of J S Mill’s argumentation are:

1. that negative evidence should be acquired as well as positive evidence

2. that a bi-level argumentation should involve a decomposition of passages and categories into a set of representational phrases

3. that the comparison of passage and category representation should generalize (provide the grounding for computational induction) from the training set to the test set .

It is assumed that each "observation", D_k, of the test set O2 is composed from a "set" of basic elements, in this case the theme phrases in A. Subsets of the set are composed, or aggregated, into wholes that are meaningful in a context that depends only statistically on the characteristics of basic elements.

The general framework provides for situational reasoning and computational argumentation about natural systems.

For the time being, it is assumed that the set of basic elements is the full phrase representational set

A = È T_k.

for the training collection O1.

We introduce the notation for the derived representational sets T*_q for each C_q , q = 1, . . , n .

Given the data:

T*_q for each C_q , q = 1, . . , n and

the representational sets T_k , from the observations D_k, for each passage, d_k, from the test set O2^, we generate the hypothesis that the observation D_k should be categorized into category q.

This hypothesis will be voted on by using each phrase in the representational set for D_k by making the following inquiries for each element t_i of the representational set T_k:

1. does an observation of a passage, D_k, have the property p, where p is the property that this specific representational element, t_i , is also a member of the representational set T*_q for category q.

2. does an observation of a passage, D_k, have the property p, where p is the property that this specific representational element, t_i , is not a member of the representational set T*_q for category q.

Truth of the first inquiry produces a positive vote, from the single passage level representational element, that the passage is in the category. Truth of the second inquiry produces a negative vote, from the single representational element, that the passage is not in the category. These votes are tallied.

Data structure for recording the votes

For each passage, d_k , we define the matrix A_k as a rectangular matrix of size m x h where m is the size of a specific passage representational set T_k, and h is the number of categories. The passages are indexed by k, each passage has it’s own matrix.

Each element t_i of T_k, will get to vote for or against the hypothesis that this kth passage should be in the category having the category representational set T*_q. Thus A_k is defined by the rule:

a_i,j= -1 if the phrase is not in T*_q

a_i,j= 1 if the phrase is in T*_q

Matrix A_k is used to store the individual + - votes placed by each agent (i.e., the representational element of the phrase representation of the passage.)

This linear model produces ties for first place, and places a semi-order (having ties for places) on the categories by counting discrete votes for and against the hypothesis that the document is in that category.

A second data structure to record weighted votes

A non-linear (weighted) model uses internal and external weighting to reduce the probability of ties to near zero and to account for structural relationships between themes.

Matrix B_k is defined:

b_i,j= a_i,j * weight of the phrase in T_k

if the phrase is not in T*_q

b_i,j= a_i,j * weight of the phrase in T*_q

if the phrase is in T*_q

This difference between the two multipliers is necessary and sufficient to break ties resulting from the linear model (matrix A_k).

Data structure to record the results

For each passage representation and each category, the tally is made from the matrix B_k and stored in a matrix C having the same number of records as the size of the document collection, and having h columns – one column for each category.

The information in matrix C is transformed into a matrix D having the same dimension as C. The elements of each row in C are reordered by the tally values. To illustrate, suppose we have only 4 categories and passage 1 tallies {-1214,-835,451,1242} for categories 1, 2, 3 and 4 respectively. So

cat1 à -1214, cat2 à -835, cat3 à 451 and cat4 à 1242.

By holding these assignments constant and ordering the elements by size of tally we have the permutation of the ordering ( 1, 2, 3, 4) to the ordering (4, 2, 3, 1).

( 1, 2, 3, 4) à ( 4, 2, 3, 1).

This results show that for passage 1, the first place placement is category 4, the second place is category 2, etc. The matrix D would then have (4, 2, 3, 1), as its first row.

(bridge) Comments can be sent to ontologyStream e-forum . (back to SBF3)