Evaluation of indexing, routing and retrieval technology

Back ... ... ... ... ... ... ... ... ... ... ... Send comments into the ontology stream. ... ... ... ... ... ... ... ... ... ... ... Forward

(:….:) … (*) (*) (**) (*) … (:..:)

ARL Final Report

Section 3

Strategy for organizing text data using CCG plus other techniques

December 31, 2000

Computer Cognitive Graphics (CCG) is a technique developed by Alexander Zenkin. The technique uses a transformation of sequence values into a rectangular grid. Visual acuity is then used to identify theorems in the domain of number theory.

Our purpose has been to generalize this work and to integrate the components of CCG technique with knowledge discovery methods more widely used for automating the discovery of semantically rich interpretations of text.

An application of knowledge discovery in text is underway. As a general capability a parse of text is to be made in the context of a number of text understanding tasks.

Test Corpus: We have several test collections, however only one is non-proprietary. This corpus is a collection of 312 Aesop Fables. We will demonstrate some work on identification of the semantic linkages between the 312 elements of this collection in the last section (section 4).

In a text parsing system, the interpretation of text is oriented towards retrieval of similar text units, the routing of text units into categories and the projection of inference about the consequences of meaning.

As part of a commercial activity, TelArt Inc. is developing behavioral criterion for commercial clients regarding proper evaluation of text vectorization, Latent Semantic Indexing, CCG – derived, and other informational indexing systems. Most of the direct work on criterion is proprietary in nature. However, the movement from one domain to another domain of interest will produce a variation of behavioral criterion. Thus client criterion can be generalized to fit many other domains outside the domain of content evaluation. This moves us towards the notion of ultrastructure as developed by Jeff long at the Department of Energy.

Thus we envision the development of generalized behavioral criterion for evaluation of Indexing Routing and retrieval technologies. Behavioral criterion for intelligence vetting is an interest to TelArt Inc.

TelArt is, of course, interested in a full-scale program involving principals at New Mexico State University and universities in Russia on situational logics and text parsing systems.

Use case type analysis of a process of text parsing

A.1: Convert a word sequence into a sequence of stemmed words.

A.1.1: Take the words in each document and filter out words such as “the” and “a”.

A.1.2: Make replacements of words into a stemmed class:

running à run

storage à store

etc.

A.1.3: The set of stemmed classes will be referred to as C.

A.2: Translate the stems to positive integers randomly distributed on a line segment between 0 and 1,000,000,000.

A.2.1: This produces a one-dimensional semantically pure representation of the word stem classes.

A.2.2: The topology of the line segment is to be disregarded.

A.2.3: A new topology is to be generated by adaptively specifying the distance d(x,y) for all x, y in C.

A.3: Introduce a pair wise semantics

A.3.1: Following a training algorithm, points are moved closer together or further apart based on either algorithmic computation or human inspection

A.3.2: Topologies can be produced by various methods.

A.3.2.1: Generalized Kohonen feature extraction, evolutional computing techniques.

A.3.2.2: From one point topological compactification of the line segment, and cluster using the Prueitt feature extraction algorithm.

A.4: Pair wise semantics can drive a clustering process to produce a new semantically endowed topology on the circle.

A.4.1: Features of the new space are identified by the nearness of representational points

A.4.2: Prueitt feature extraction is seen to be equivenant to Kohonen feature extraction

A.4.3: Prueitt feature extraction produces a specific distribution of points on the unit circle.

A.4.3.1: The Prueitt distribution is defined as a limiting distribution which has semantically relevant clusters

A.4.3.2: As in associative neural networks, and other feature extraction algorithms, the Prueitt distribution is driven by a randomization of selection (of the order of pairing evaluation), and thus is not unique to that data set.

A.4.3.3: Iterated feature extraction produces a quasi- complete enumeration of features, encoded within a topology.

A.4.3.4: The limiting distribution has a endowed semantics.

A.5: The clustering process can be mediated by human perceptual acuity

A.5.1: Phrase to Phrase association by human introspection as coded in a relational matrix.

A.5.2: Latent Semantic Indexing for co-occurrence, correlation and associative matrices.

A.6: The limiting distribution is viewed via CCG methodology

A.6.1: The CCG methodology is linear and assumes a regular limiting distribution.

A.6.1.1: The CCG technique finds a coloring patterning and modules (related in the number of columns in the grid), such that a principle of Super distribution can be used to prove a specific theorem using the limiting distribution.

A.6.1.2: Coloring pattern and modules is specific to the theorem in question

A.6.1.3: Given a limiting distribution that is metastable (a pattern is repeated) or stable (one condition is show to exist always beyond a certain point), then a theorem exists.

A.6.2: A non-linear CCG technique might have a grid of dimension higher than 2, as well as transformation operators that are statistical.

Technical Aside

Locally, the clustered tokens on a number line can be rough partitioned using neighborhoods (see Figure 1)

Figure 1: Topology induced by clustering

The circles in Figure 1 indicate boundaries of categories. These categories overlap producing areas a ambiguity. However rough sets and the Prueitt voting procedure anticipates the ambiguous boundaries and addresses this issue in a non-fuzzy fashion.

We can map the categories to an n dimensional space. In this case n = 8.

x --> (0,1,0,0,0,0,0,0) or (0,0.7,0,0,0,0,0,0)

y --> (0,0,0,0,1,0,0,0) or (0,0,0,0,0.6,0,0,0)

z --> (0,0,0,0,0,1,1,0) or (0,0,0,0,0,0.3,0.2,0)

Where the first representation is a Boolean representation of inclusion and the second representation measures show a degree of closeness to the center of the distribution.

The question of how to assign category memberships to new text elements is addressed in the tri-level routing and retrieval architecture and voting procedure as well as in Russian Quasi Axiomatic Theory.

Domains

There are some important and general problems in automated recognition of patterns in data streams. When the stream is a stream of text, then the text can be converted or transformed into tokens that have been shaped by linguistic and semantic considerations. The discovery of proper shaping methodology is what is at issue. Shaping is an induction.

For example, an illness might be recognized from a written description of symptoms. The text of the descriptions can be converted to number sequences. Shaping might come from relevance feedback, or some type of feed forward mechanism such as thought to account for the shaping of action during motor control, see the neuropsychological and neural architecture work by James Houk and Andy Barto.

The development of an information base from a text database is at least a three step process:

1) the transformation from text to tokens,

2) the shaping of token identification, and

3) the development of relationships between tokens

The first step, in text vectorization scatter / gather methods, is to randomize the location of numbers by a uniform projection into a line segment. The projected numbers become tokens for the text units. The nearness of tokens on the number line is discarded. Some transformations might then be applied to produce clusters of values around prototypes of concepts that are expressed in the text (in theory).

Our hope is that CCG images can be involved in a visualization of these prototypes. The clusters, once identified can be used in the definition of a tri-level category policy to be used to place next text elements within an assignment relative to these categories.

Some Uncertainties

The objective of our study is to see if we can find a representational schema, transformation of topologies, and some relationship that becomes clear as a process of analysis moves our attention deeper into a sequence of numbers or generalizations of the notion of number. As we neared the completion of our contract time (December 2000) it became clear to Prueitt that the generalization of number that we needed is the cluster space itself. The clustering process itself produces states, and each state in generated from the previous state using a specific iterated axiom (like the Peano axiom). .

The cluster iteration is dependent on a scatter / gather process. This process is well known in the text vectorization community. We see no way (at least not now) to start with anything other than a scatter of representational points into a line segment – or perhaps a n-dimensional sphere (see discussion regarding Pospelov oppositional scales).

Obtaining the even distribution of numbers is a critical process. However, there are degrees of uncertainties that actually play in favor of eventual positive utility. No matter how the distribution occurs, visual acuity made about distributions in cluster space can reinforce interpretations of the meaning of cluster patterns that are made by the human.

Reinforcement learning within an image space?

At the beginning of our current project (February, 2000), we did not know how we would handle the CCG visualization task. We felt that the visualization can come in any of at least three points.

1) In the clustering process itself

2) In the provision of semantic linkages between clusters

3) In the reinforcement of positive and negative learning

Our thinking is that the identification of tokens should be followed by the identification of a relational linkage between tokens. However, how does one impose such a relational linkage given the nature of human concepts? One answer is a reinforcement architecture.

A mapping could be developed between CCG images, derived in some way (see for example, as specified in the use cases in the last section), and objective physiological, psychological, and other medical parameters. The mapping might be complex, in the sense envisioned by the tri-level voting procedure, Russian Quasi Axiomatic Theory, or it could depend simply on visual acuity. The CCG – technique separates and then organizes regular patterns using a two-dimensional grid and color-coding. We looked for a separation and organization process that makes sense of non-regular patterns – such as those found in word distributions. These are visualized as point distributions on a circle.

It might be that normal distance metrics on CCG – type images might produce an ability to automatically parse new text for concepts – simply using a distance in image space. Any correspondence between image distances and conceptual linkages would be of extreme utility. In any case, we need a means to reinforce learning. We need an interface between user and computational systems. This interface must be used in critical situations where judgments are made and consequences determined. The interface must preserve the notion that human interpretant and judgment is an essential aspect of the experience of knowledge.

Other Domains

An interpretation of the data invariance in data from astronomical sources might be made. In this case a relationship would be found between various types of invariance in the data, as identified by a CCG type image, and scientific interpretation of that invariance. A reference system could be developed that extrapolated between visualization.

This domain appears simpler than the domain of text parsing and a Russian application of CCG visualization to scientific data streams is expected.

EEG data has been considered, however, the problem has been the acquisition of data as well as the development of a methodology for elucidating knowledge from knowledgeable persons in the field of EEG analysis.

Content Evaluation

Content evaluation seems to be the most difficult and to have the most potential for a true scientific breakthrough and for aggressive commercialization. However, we are well aware of the considerations that come from any proper understanding of the deep issues.

Content evaluation for full text data bases might follow a specific approach:

1) a inventory of CCG type images is recognized by a human as having value

2) the images are used as retrieval profiles against CCG representations of text not yet recognized

3) an associative neural network would take reinforcement into account in the modification of semantic interpretation

Several theorists have identified a problem. This problem is stated well by Alex Zenkin when he talks about expert knowledge and the role of well-defined goals to the CCG based vetting process. We hold that this problem can be solved using ecological systems theory based in part of the work of Robert Shaw, J. J. Gibson and A. N. Whitehead.

Comments from Alex Zenkin

VISAD works with any set of objects that is described by any set of tokens (signs, properties, etc.).

The set of tokens must include semantics on the set of objects.

The required semantics is defined by common goals of the problem under consideration (purchase, sale, advertisement, production, market, etc.). It is obvious, that objects will have different sets of tokens depending on the goal.

The requirement for goal-oriented information means that it is impossible to describe all objects of a large patent base by an unified set of tokens.

Not knowing terminal aims and semantic features of objects, we can construct a set of only formal syntactical, grammatical, and statistical tokens.

However, Zenkin sees the following possibilities.

First of all, we may formulate the terminal goals

1) Why we need to analyze the data?

2) What kind of problems will the new knowledge help us solve?

We may extract from the patent base a class of related objects (say, by means of key words).

A set of tokens related to terminal goals must be constructed by professionals in the problem domain defined by the terminal goals.

We may use VISAD in order to:

1) Visualize the given class of objects.

a. a. The class will have a distinctive CCG – type image that is used as an icon

b. b. The icon will have possible dynamics. Zenkin has suggested that Lefebvre type gestured images be used as a control language

c. c. Prueitt has developed some notion on state / gestures response interfaces to contextualized text databases.

2) Define more accurately the initial set of tokens

d. a. It is an interactive process of experts with the problem and of learning their professional knowledge and intuition.

e. b. Our experience shows that professionals frequently find out new ideas and solutions during VISAD usage.

f. c. Vladimir Lefebvre's has useful ideas on reflexive dynamic interaction "expert – an experts group", "expert – visualized problem"

g. d. Vladimir Lefebvre's "faces representation" for qualitative analysis of situations might be useful here.

3) Automate (with a visual CCG-support) classification of the given class of objects

4) Automate (with a visual CCG-support) creation of notions on classes

5) Creating a mathematical (logical) model of the kind "What will be IF …" or "reason - consequence", "cause - effect", etc

6) Automated (with a visual CCG-support) recognition (purposeful search) of objects in image-structured information base.

Additional comments on VISAD by Alex Zenkin

VISAD usage allowed us to make clear well-known information on logical and mathematical paradoxes (about 20 objects described by about 20 tokens). IN recent work in Russia, Zenkin discovered some new tokens, made a new classification of some paradoxes, and formulated notions on necessary and sufficient conditions of paradoxicality as a whole. These notions disclose the nature of paradoxicality.

Main stages of this work are described in Zenkin’s paper "New Approach to Paradoxes Problem Analysis" published in "Voprosy Filosofii" (Problems of Philosophy), 2000, no. 10, pp. 81-93 (see Annotation, today so far in Russian.

The VISAD and its usage to analyze large bases of text unstructured DATA is a large independent project.

Note from Paul Prueitt to Alex Zenkin November 17, 2000

CCG analysis of EEG data is a complex task. The data set that Prueitt received from Pribram’s lab has what is an unknown format of the numbers. Figuring out this format has stopped his attempt to put EEG data together.

If this data format problem is figured out, we still have the difficult issue of how to make sense of the syntactic structure that might be imposed

For this reason, Prueitt has turned to the unstructured text domain, since he has already solved the conversion of text to numbers and categories problem.

Prueitt’s hope was that Zenkin would be able to find both a source of data and the expertise to make a small experiment in Russia. We hope that Zenkin is able to solve both these pragmatic problems and thus produce an prototype application of CCG to unstructured scientific data.

Criterion for knowledge creation

We have developed a metaphoric linkage between Zenkin’s CCG application to number theory and parsing and interpretation applications to other data sets. The linkage is related to the notion of induction, Zeno’s paradox, Cantor’s diagonalization conjecture, and Godel’s complimentarity / consistency arguments.

In our view, knowledge creation occurs when a jump occurs from one level of organization to another level of organization. This jump is referred to as an “induction”. The scholarship on this notion of induction is concentrated in the area called ecological physics (J. J. Gibson, Robert Shaw) and perceptual measurement (Peter Kugler). We hold that inductive jumps require the existence of scales of organization in physical systems (see the work of Howard Pattee).

In the case of mathematical finite induction, the jump occurs when one assumes an actual infinity, and uses this assumption in the characterization of the positive integers. In the case of Zenkin’s principle of super induction, the jump allows the truth of the presence of one condition to be transferred to the proof of a theorem.

Zenkin’s principle of super induction is still an induction that occurs about formal structures that are regular and linear, even through perhaps intricate.

One requirement for CCG application to number theory is the existence of an infinite sequence of positive integers – generated by the Peano axiom and the imposed property of addition and multiplication. This sequence serves as an index on sequences of theorems. The objective of both induction and super induction is to find a way to demonstrate that the truth of an intricate relationship transfers. The transfer is between secession (Peano axiom), addition and multiplication properties of integers. This transfer will exist beginning at some point and then continue being true from that point in the sequence of theorems.

With any representation of concepts in text, we have a different situation. There is no relevant truth-value that can be assigned to some relationship between numbers, or so it seems at first glance.