Back ...
... ... ... ... ... ... ... ... ... ... Send comments into the ontology
stream. ... ... ... ... ... ... ... ... ... ... ... Forward
(:….:) … (*) (*) (*) (**) … (:..:)
ARL Final Report
Section 4
Study of
cluster iterations and visualization
December 31, 2000
In reviewing the first three sections
of this report, I find the concepts difficult.
We must cover so much. First, we must cover what we have discovered
about the knowledge creating methodology that is applied by Alexander Zenkin to
the knowledge domain of number theory.
Then we have to address what we think are the philosophical issues
related to the notions of induction and super induction. Then we must develop a scientific
foundation and a plan for applying the CCG techniques and concepts to natural
language parsing.
The core issues are:
1) How do we
replace the concept of number with some concept that has something to do with
semantic tokens and linkages between these tokens?
2) How do we
replace the original notion of super-induction so that the replacement has the
same rigor as that discovered by Alexander Zenkin?
We have only begun preliminary
work on the first core issue. The
second core issue remains largely a proper statement of something that needs to
be done, but that has not yet been done.
As we neared the completion of
our contract time (December 2000) it became clear that the generalization of
the concept of number that we needed is the cluster space itself.
The clustering process itself
produces states, and each state in generated from the previous state using a specific
iterated set of rules. These rules
collectively can be thought of as an axiom, which, like the Peano axiom, is
iterated. The visualization of a sequence of cluster spaces requires
considerable work with computer programmers.
However the literature has many examples of visualization of cluster
spaces (specifically systems like Pathfinder and the Spires).
The first elements of the
sequences should show a random scatter of the points. Figure one shows a random scatter of points to a circle.
Random scatter of 312 points on a circle
This is the scatter part of the
scatter / gather methods. The gather part is generally either:
1) take two points
randomly and evaluate the “nearness’ between the two corresponding documents
and use this metric to move the two points a little closer or further apart on
the circle.
2) take a one and
a cluster at random and computer the “nearness” between the document and the
cluster and move the point closer or further apart from the cluster on the
circle.
The identification of a cluster
is a critical issue, and perhaps this is where we might use some corollary to
the CCG visualization. Again, it is
difficult to be specific because there are many different ways to accomplish
something. For example, whereas the
first part of a gather process might be purely algorithmic (using the Prueitt
voting procedure or the standard cosine measure); the second part might allow the
use of visual acuity to select the boundaries of a cluster, and to make a
judgment about the topics and themes that are central to that cluster. Once clusters are identified, then one might
ask a user to review a cluster’s conceptual graph and the conceptual graph
related to the point.
A question can be posed to the
user regarding the “nearness” of the two conceptual graphs. The answer can then drive the gather
process.
The question of nearness is
problematic unless handled in a grounded fashion. The grounding that we suggest is Pospelov’s notion of an
oppositional scale. Pospelov has made
the conjecture that there are 117 categories of semantic linkage. The enumeration of a theory of semantic type
by Pospelov is conjectured to be language independent, and to be related to the
actual linkages that formed when the natural language systems go through the
early formative process. The work of
Stu Kauffman on autocatalytic sets establishes one way to justify the notion
that semantic linkage in natural language in fact has a specific
structure. Regardless of how one
justified an enumerated set of dimensions, semantic linkage must be expressed
through degrees of freedom required for each of these dimensions.
The analysis of the previous
paragraph implied that the gather part of a scatter / gather method be made on
a 117 dimensional sphere. The notion of
nearness is thus specific to those oppositional scales that are seen to be
relevant to a comparison such as between two cognitive graphs.
Such comparison is both
situational and relative to a point of view.
Thus the indexing of semantic linkage in text collections needs to be
vetted by human introspection, and the consequences of this vetting process
encoded into some type of compression.
The gather process is exactly such a compression.
No matter how the limiting
distribution occurs, visual acuity made about distributions in cluster space
can reinforce interpretations of the meaning of cluster patterns that are made
by the human. The limiting distribution
reveals clusters that need to be validated as a relevant knowledge artifact in
context. Knowledge validation is where
we expect to seen formal notion related to Zenkin’s notion of super induction.
One requirement for CCG
application to number theory is the existence of an infinite sequence of
positive integers – generated by the Peano axiom and the imposed property of
addition and multiplication. This
sequence serves as an index on sequences of theorems. The objective of both induction and super induction is to find a
way to demonstrate that the truth of an intricate relationship transfers. The transfer is between secession (Peano
axiom), addition and multiplication properties of integers onto an evaluation
of truth.
This transfer of evaluation will
exist beginning at some point and then continue from that point in the sequence
of theorems. The validated limiting
distribution of a gather process has exactly the property we need. At some point the gather process does not
change the clustered patterns. From
this point on the clusters remain invariant.