Evaluation of indexing, routing and retrieval technology

Back ... ... ... ... ... ... ... ... ... ... ... Send comments into the ontology stream. ... ... ... ... ... ... ... ... ... ... ... Forward

(:….:) … (*) (*) (*) (**) … (:..:)

ARL Final Report

Section 4

Study of cluster iterations and visualization

December 31, 2000

In reviewing the first three sections of this report, I find the concepts difficult. We must cover so much. First, we must cover what we have discovered about the knowledge creating methodology that is applied by Alexander Zenkin to the knowledge domain of number theory. Then we have to address what we think are the philosophical issues related to the notions of induction and super induction. Then we must develop a scientific foundation and a plan for applying the CCG techniques and concepts to natural language parsing.

The core issues are:

1) How do we replace the concept of number with some concept that has something to do with semantic tokens and linkages between these tokens?

2) How do we replace the original notion of super-induction so that the replacement has the same rigor as that discovered by Alexander Zenkin?

We have only begun preliminary work on the first core issue. The second core issue remains largely a proper statement of something that needs to be done, but that has not yet been done.

As we neared the completion of our contract time (December 2000) it became clear that the generalization of the concept of number that we needed is the cluster space itself.

The clustering process itself produces states, and each state in generated from the previous state using a specific iterated set of rules. These rules collectively can be thought of as an axiom, which, like the Peano axiom, is iterated. The visualization of a sequence of cluster spaces requires considerable work with computer programmers. However the literature has many examples of visualization of cluster spaces (specifically systems like Pathfinder and the Spires).

The first elements of the sequences should show a random scatter of the points. Figure one shows a random scatter of points to a circle.

Random scatter of 312 points on a circle

This is the scatter part of the scatter / gather methods. The gather part is generally either:

1) take two points randomly and evaluate the “nearness’ between the two corresponding documents and use this metric to move the two points a little closer or further apart on the circle.

2) take a one and a cluster at random and computer the “nearness” between the document and the cluster and move the point closer or further apart from the cluster on the circle.

The identification of a cluster is a critical issue, and perhaps this is where we might use some corollary to the CCG visualization. Again, it is difficult to be specific because there are many different ways to accomplish something. For example, whereas the first part of a gather process might be purely algorithmic (using the Prueitt voting procedure or the standard cosine measure); the second part might allow the use of visual acuity to select the boundaries of a cluster, and to make a judgment about the topics and themes that are central to that cluster. Once clusters are identified, then one might ask a user to review a cluster’s conceptual graph and the conceptual graph related to the point.

A question can be posed to the user regarding the “nearness” of the two conceptual graphs. The answer can then drive the gather process.

The question of nearness is problematic unless handled in a grounded fashion. The grounding that we suggest is Pospelov’s notion of an oppositional scale. Pospelov has made the conjecture that there are 117 categories of semantic linkage. The enumeration of a theory of semantic type by Pospelov is conjectured to be language independent, and to be related to the actual linkages that formed when the natural language systems go through the early formative process. The work of Stu Kauffman on autocatalytic sets establishes one way to justify the notion that semantic linkage in natural language in fact has a specific structure. Regardless of how one justified an enumerated set of dimensions, semantic linkage must be expressed through degrees of freedom required for each of these dimensions.

The analysis of the previous paragraph implied that the gather part of a scatter / gather method be made on a 117 dimensional sphere. The notion of nearness is thus specific to those oppositional scales that are seen to be relevant to a comparison such as between two cognitive graphs.

Such comparison is both situational and relative to a point of view. Thus the indexing of semantic linkage in text collections needs to be vetted by human introspection, and the consequences of this vetting process encoded into some type of compression. The gather process is exactly such a compression.

No matter how the limiting distribution occurs, visual acuity made about distributions in cluster space can reinforce interpretations of the meaning of cluster patterns that are made by the human. The limiting distribution reveals clusters that need to be validated as a relevant knowledge artifact in context. Knowledge validation is where we expect to seen formal notion related to Zenkin’s notion of super induction.

One requirement for CCG application to number theory is the existence of an infinite sequence of positive integers – generated by the Peano axiom and the imposed property of addition and multiplication. This sequence serves as an index on sequences of theorems. The objective of both induction and super induction is to find a way to demonstrate that the truth of an intricate relationship transfers. The transfer is between secession (Peano axiom), addition and multiplication properties of integers onto an evaluation of truth.

This transfer of evaluation will exist beginning at some point and then continue from that point in the sequence of theorems. The validated limiting distribution of a gather process has exactly the property we need. At some point the gather process does not change the clustered patterns. From this point on the clusters remain invariant.