OntologyStream Inc Briefing & Elementary Tutorial on
categoricalAbstraction (cA) August 2, 2002 |
|
This tutorial is on what some regard as a deconstruction of the mathematics and technology of text and data indexing. The result is deemed a type of Synthetic Perception about the invariance in data and the development of human annotation of what humans regard as meaningful. Synthetic Perception is seen as a by-pass of current data-mining and text-indexing methods
In the first step of a knowledge technology deployment one objectively evaluates proposed technology and methodology and develops an understanding of what is available in the marketplace. One must know what is available in computer systems and from innovations in various supporting communities.
A bottom line is that the human experience of knowledge be shared. At core this is the nature of human communication. A disinterested third party cannot have a vested economic interest, particularly one that is controlling of all other interests. The community at large loses. Society loses.
Only in an economically unencumbered environment can one expect that concept base indexing and the propagation of knowledge be accomplished. The compensation for knowledge sharing has to be accrued to the individuals who are sharing this knowledge.
The tutorial will provide a full disclosure of a fundamental discovery of categorical abstraction and the inventions of stratified theory in the context of knowledge technologies.
Due to the
2001 War on Terrorism, the government is prepared to
scale up IT support for homeland security.
The problem is in the capture and management of meaning and the
inability to extract meaning using purely computational means. In a knowledge war, our conventional IT
'weapons' are weak. In theory, they
could be much stronger.
One must look
to natural science and shift from word-based
indexing to concept-based indexing.
Natural scientists surmise that a concept can be identified and
processed directly, and with revolutionary effects. But
it is not a simple matter to bring this principle into full
application.
The situation
is analogous to World War II. Scientists
surmised that munitions are much
stronger if they were based on a completely different principle -- a
nuclear
reaction rather than a chemical reaction.
The Manhattan Project acted on that theory. Later
peaceful uses of nuclear power where developed.
We propose a
new Manhattan Project for Knowledge Technology
(KT) that will give the Nation a decisive tool in the knowledge war. As the War is won, we expect to see human
productivity increase while the National leads the World in addressing
the root
causes of this unfortunate war.
What we
propose is a project that allows leading scientists to
concentrate on the problem and allows engineers to bring the solution
to
fruition. It requires stripping away
some of the institutional impediments that have been retarding this
development.
Initial Work Outline: Any “conceptual space” associated a knowledge system can be mapped following these steps.
1) The map is produced by collecting together between 100 and 200 samples of the description of technologies, theories and frameworks (using web searches and in house materials).
2) Once this sample is available as a text collection, we write a simple parser to produce a categorical abstraction input file (datawh.txt) that is discussed in the tutorial.
3) The OSI Knowledge Operating System (OSI KOS) is used to record human perceptions about the structure in this input file and render topic maps (XML) for inspection and comment.
The mapping process, of course, has various levels of resolution. OntologyStream Inc develops a broad and definitive model of the IP (Intellectual Property) and conceptual space that surrounds” the basic sub concepts of all knowledge system technology. This work is preparatory for
1) The full mapping of all conceptual themes related to the knowledge system’s subconcepts in both the US PTO database and in the Internet public literature.
2) The production of dedicated patents covering the work of those developing the deployment plan and observing the progress in the deployment.
3) Informed decisions over technology acquisition during integration.
The BCNGroup Charter defines a science-based Intellectual Property management structure that serves the community of innovators supporting the knowledge technologies.
This management structure is designed to serve business partners as well as innovation processes, and is the basis for coordinated development of knowledge technology and knowledge science.
System description
Knowledge
Operating System (KOS) can be used to track global trends and
anticipate events.
The system brings together a variety of innovative components to
provide an
"intelligence-vetting engine" that globally tracks emerging events
related to any topic or event type. The
necessary condition for seeing the event is that a sensor must produce
data
structure indicative of the event.
Categorical Abstraction bypasses the complexity issue in massive
data
sources by replacing massive data with a very much smaller set of
categories. The categories can be used to
retrieve all
of the data that will produce individual categories, so we have a true
"compression" of data into information... and the compression may be
as much as a million to one. Novelty detection falls out of this
process since
a single occurrence of a data element will appear as a new category if
such a
data element is in fact novel. Event chemistry opens this cA
substructure into
a biologically feasible information routing and retrieval architecture
based on
the Prueitt tri-level architecture { memory, awareness, anticipation }.
Machine inference and situational simulation
as well as sense making is enabled
Collection – KOS will automatically monitor information flow in the Internet on a 24/7 basis, looking for new items that are related to the subject and events of interest. KOS software agents look for new and novel relationships not previously anticipated and send data structures into centralized information vetting areas staffed by humans.
For example, all Internet published scientific conferences in a specific area of interest might be mined to produce trend analysis of intellectual property. Concept mining in KOS is language-independent.
Annotation - Human annotation of trends can be done in one’s native language and incorporated into a single, multi-language knowledge base. (By avoiding translation of any inputs, meaning is preserved and the significant costs and delays of translation are avoided.) Topic maps in any language are constructed using the OASIS standard.
Storage - Our collected information is stored in formats that allow an object to be directly related (“structurally coupled”) to any other object. The relating structures are independent second order control mechanisms that support query, routing, and anticipatory responses.
A tri-level architecture is used that separates: (1) known events, (2) categorical abstractions (cA), and (3) aggregations of categorical abstractions (event chemistry or eC) under the constraints of a knowledge base.
Extraction - A number of mature technologies extract information from data sources. This information is diffused into a substructure layer (cA) of information that serves the agile reconstruction of information in context (eC). The agile technology allows us to look at an object of investigation from many different perspectives.
The extraction process is represented as categorical invariance in the patterns of word expressions and processed into a central, sparse knowledge base. Information compression, routing and inference processes are achieved using a technique (I-RIB) that renders visually as graph structures expressing structural relationship and in auditory form.
Interaction with analysts - A number of analytical tools automatically develop ontologies and taxonomies. KOS logical and structural relationships are rendered visually or optionally with sounds. The overall structure of a given area of knowledge is rendered visually. Various scenarios are generated and humans are assisted in exploring variations in strategies and in storing work product related to these investigations. The system itself can be used to analyze analyst performance.
Simulation and sense making – In KOS, the cognitive load is allocated to the human. The conventional approach, of formal deduction or computational expert systems, is replaced with an extension of quasi-axiomatic theory coupled with human experience within the KOS visual and auditory environment and within the community of KOS analysts.
What-if scenarios and conjectures about causes take a new form where humans in small communities develop meaning in dialog about KOS-defined event structures.
Display – KOS provides breakthrough capabilities for identifying and monitoring early indicators of significant change. Visualization of the categories of data and relationship among these data enhance the ability of analysts to anticipate emerging events.
The remainder of this briefing will provide a precise discussion of our methodology, including the OSI categoricalAbstraction (cA) engine and event Chemistry (eC) engines. We start with a download of software at
www.ontologystream.com/cA/tutorials/download/Elementary.zip
which is 298K in a WinZip file and 1 Meg when unzipped.
Figure 1: The unzipped file for this tutorial
The tutorial Data folder starts only with a small file called datawh.txt containing the following text:
IPaddress eventType
109 E1
110 E1
101 E2
101 E3
109 E4
105 E3
105 E2
106 E3
107 E3
102 E5
103 E6
104 E5
108 E1
109 E3
We will not use the Splitter.exe in this tutorial.
Figure 2: The SLIP conjecture with atoms = IPaddress
Open SLIPWH.exe and type in “a = 0” and then “b = 1” in the command line to produce the link conjecture seen in Figure 2. There are spaces around the “=” sign in these commands. Type “Help” to see the list of knowledge operating system (KOS) verbs available from here. Call 703-981-2676 to ask a question.
Now issue the KOS commands “pull” and “export”. Again, the command “Help” will give more information about the KOS commands.
Close the SLIPWH.exe and look into the Data folder. You will find:
You should open each one and see if you understand how each was created and what each of these data structures may be used for.
Links.txt contains
101 E2 E3
102 E5
103 E6
104 E5
105 E2 E3
106 E3
107 E3
108 E1
109 E1 E3 E4
110 E1
The first line in
Links.txt indicates that the address 101 has been associated with E2
and E3 by the SLIP conjecture.
It is important to note that two things have happened up to this point. First some process created the original datawh.txt. The second thing that has happened is that the SLIP conjecture is used as a conjectural convolution over this datawh.txt file. There are many ways to create the datawh.txt files and many ways to create conjectural convolutions. But we are taking a simple example. The process that produces the datawh.txt file is completely up to the user of the OSI browsers. However, at this point the OSI software only offers the simplest of the convolutions based on simple link analysis.
Paired.txt contains:
101**105
101**105
101**106
101**107
101**109
102**104
105**106
105**107
105**109
106**107
106**109
107**109
108**109
108**110
109**110
These are pairs of elements from the column that was selected to produce atoms. The nine unique elements that are either first or second position in these pairs is the set:
{ 101, 102, 104, 105, 106, 106, 108, 109, 110 }
103 is not in this list because E6 occurs only once in the eventType column and so there is never a fulfillment of the SLIP conjecture that involves 103. The generalization to the class of convolution operators might be anticipated by how the SLIP conjecture produces atoms based on finding all instances where the conjecture holds true.
One way to generalize this is to use Frank Schank type frames with slots and fillers. Atoms are then things that are found to occupy slots.
Now open the browser SLIPCore.exe and issue the KOS commands “import” and “extract” to produce the screen in Figure 3a. Click once (not twice) on the A1 node to produce the scattering of the atoms.
a b
Figure 3. The Core Browser (a) before clicking on the A1 node and (b) after
The nine atoms in the set of “all” atoms is scattered onto the surface of a two dimensional sphere. (This is generalized to allow “dimensions of information processing” to n dimensional spheres.)
The clustering on the sphere is similar to a Kohonen feature extraction algorithm (artificial neural network) but is simpler and may be more powerful. The clustering algorithm is a form of scatter-gather techniques that have been used in the text understanding research field for perhaps two decades. However, the way that the scatter-gather is done is unique to Prueitt’s work and is simpler and more powerful.
The technique itself has one additional innovation based on the conversion of the ASCII into a line in number base 64. This conversion allows one to very quickly answer a set membership question. This question is about whether or not a pair randomly selected from the set of all possible pairs of atoms is in the set of pairs in Pairs.txt. There are 9 atoms in this case, so there are a possible 81 pairs that can be produced.
The Pairs.txt file has 15 members. In this case the set membership problem is easy. A random pair has a probability of 15/81 of being in the Pair.txt file.
However we have data sets that contain 10,000 members. In this data the set of possible pairs is greater than 10,000,000. In this data set the atoms are ASCII stings that have between 10 and 40 characters. This means that the membership question requires that a string match be made between a text string of between 20 and 80 characters and a string of ASCII that is 10,000*80 = 800,000 characters long.
The critical computational requirement is this membership question, and database solutions are far to slow. We need to resolve 10 million membership questions to complete the average scatter gather task.
This membership question can be addressed in many ways, but perhaps the most optimal way is to pad out each of the 10,000 members to have a length equal to the maximum length over the entire set, say 80 characters and map this to memory.
One needs less than one Meg of memory to accomplish this task. Then the set membership question can be answered in less than 20 fetches (the contents of one of the 80 character data “cell”), compare (the contents of the cell to the pair that one is asking the membership question of.) steps! Why? (The answer is not proprietary; it is just best to figure this out for oneself. Hint: 2^20 > 800,000)
The scatter-gather requires that this single membership question be asked and answered perhaps 10,000,000 times. This required about four hours in FoxPro in our early studies. However, when we used the In-Memory map we find that this take less than 30 seconds.
It has been a few months since Prueitt saw this resolution of the membership question, and he has developed a few other approaches to this, but each of these other approaches are slower than the algorithm suggested above.
Figure 4: Creating subcategories
Moving
on to the clustered data.
By clustering the data in the tutorial set we find that there are three clusters. Type in “random” in the command line of the Core Browser, and then type in “cluster” to see the clusters form. The three clusters can be moved into subgroups by using a command
“a, b -> categoryName”
as done is Figure 4.
See the text in the Command Line in Figure 4, and perhaps type Help into the command line.
The clustering process in the scatter gather is fast and finds a unique decomposition of the elements from the bag of atoms into a set of categorical primitives, such as seen in the next figure where the size of the set of atoms is 2395.
Figure 5: From http://www.ontologystream.com/cA/papers/EventDetection.htm
However, in our case we have only three categorical primitives (called stratified primes).
Figure 6: We have 6 simple compounds each developed by a single link and 9 atoms
These
three primes are composed of five simple compounds (each
defined by one link) and the nine atoms.
On can see this by double clicking on the A1 node to produce the
screen
in Figure 6 (the KOS browsers have four re-sizable windows and you can
readjust
your sizes.
Then
close the browser and the software remember the previous
settings.)
Figure 7: The five simple compounds
Now we can alter the datawh.txt file so that one gains a better intuition about what is possible with cA and eventChemistry.
Close all of the KOS browsers. And put the folder (Figure 1) on the desktop. Make a copy and name it what ever you wish. Having both on the desktop and no programs running is helpful. Now go into the Data fold of the copy and delete all of the files and the folders in the data folder except the datawh.txt file. Open this file up and add a new line:
105 E4
Be sure to put one tab between “105” and “E4”. Now repeat the steps to produce a set of atoms. You will get the new categorical Abstraction seen in Figure 8:
Figure 8: The categorization with the new data element added to datawh.txt
The new compounds are given in Figure 9.
Figure 9: The updated compounds
Now make a third copy of the folder and delete all of the data except of the datawh.txt file again. Take all but the first line and copy in several times into the same file (not introducing blank lines.) Now redevelop the cA and eventChemistry. You will find that the Core Browser will have the exact same data.
Figure 10: The cA and eventChemistry after adding duplicate data
This is a property of the fact that all categories are already present before the additional data is added. The additional data is not introducing any new categories. One can compare Figure 10 to Figure 6 and see that everything is the same.
Root_KOS
loads and saves tabular files to/from memory, and responds to
scripted commands.
Loading
and saving tabular files is important because SLIP Browsers
work with data that is in a system file or in a special In-Memory
Referential
Information Base (I-RIB) data structure.
The SLIP Warehouse and SLIP Technology Browsers transform data
that
resides on text files. (See Figure 1).
Both depend on a Root_KOS kernel.
Figure 11: KOS Browsers work
together
The
Warehouse Browser takes any tab delineated audit
log and
quickly produces a data mart and a combinatoric expression of link
analysis
within the values of an event log. Both
of these files are produced using In-Memory Referential Information
Base
techniques. Once the files are produced
then the Technology Browser can use them.
The
Technology Browser uses In-Memory RIB techniques
to develop
event detection. The Event Browser
(currently underdevelopment) is launched from the Technology Browser
and some
data resources are exchanged. SLIP
atoms are identified in the Technology Browser and those atoms involved
in an
event are conveyed to the Event Browser.
The
I-RIB uses system files, memory maps and mathematical objects such
as lines and arrays. OSI’s I-RIB
technology is derivative of mathematical formalism called structural
holonomy.
Figure 12: Some early event
chemistry drawings
Structural
holonomy is the innovation of Paul Prueitt and is based on
his interpretation of the work of renowned neuroscientist Karl Pribram. This formalism is of the nature of
mathematics and yet has grounding in experimental literatures from
biology and
neuroscience. The basic work has been
privately peer reviewed from over a decade.
Figure 13: One
of the early event compounds
The
underlying concepts of structural holonomy is being made public
domain in order to facilitate the development of I-RIB systems in
support of
knowledge technologies that depend on very fast data aggregation
methods.
Figure 14: The Root_KOS User
Interface and data structures
OSI’s
I-RIB technology is a full data base management system that is
designed to minimally cover the requirements of many classes of data
aggregation, artificial neural network, evolutionary programming and
data
warehousing systems. The smallest
version of the I-RIB technology is part of the 172K SLIP Warehouse and
the 348K
SLIP Technology Browser. Both of these
systems are operational.
The
Root_KOS is marketing as a development environment, complete with a
minimal I-RIB system. Yearly site
license is $19,500 for the development environment. Root_KOS
is applicable to elementary research in Natural Language
Processing, full text mining and warehouses, and certain types of
eBusiness
functions. The development of
deployable applications are streamlined.
Any
KOS Browser can recognize scripted commands by (1) command line
input (2) certain mouse events on nodes of arcs in the graph window, or
(3)
mouse events on the node tabs at the top of the graph window. The exit
and help
functions are the most basic of all scripted commands.
Scripting
provides for user gestures as response to the presentation of
state information from an internal ontology or finite state machine. Algorithmic responses to a human gesture
produce changes in
1)
The ontology or
finite state machine,
2)
Changes in
auxiliary resources such as system files and
3)
A textual
response in the response line and/or a sound response.
KOS
Browsers have:
A.1:
Modal transforms that switch the view in display areas and
change the
system’s response to script commands
A.2:
Graphs that are generated from a process model. The
elements of the graph are active
objects.
A.3:
Allow the selection and movement of objects within the graph.
A.4:
Selection and movement will change the state of corresponding
ontologies
or finite state machines.
A.5:
Selected objects will refer to contents. So
for example, selecting a node and typing “Open Event Browser”
(or “oeb”) will take the contents of the node and move this content to
a new
instance of the Event Browser.
Selecting a node and typing cluster” (or “c”) will cluster the
atoms
corresponding to that node.
A.6: Grammatical
interface composed of a command/response line with history.
KOS
Browsers have a formal Program Interface (PI) that is completely
implemented through a categorized message stream interface. The machinery for the PI is complete in the
Root_KOS and extendable in any KOS Browser.
Project plumbing in
the root KOS
provides maximal notational affordance by using extensible graph based
formalism.
Notation on states and gestures extend the KOS foundational notion: The notation for states, gestures and locations is:
The
set
of world states S = { sj }.
The
set
of gestures G = { gi }.
The
set
of locations L = { Lk
}.
A formal theory of states and
gestures provide a computational means to discuss event paths in human
computer
interactions, particularly for telelearning, distance learning, and
virtual knowledge
collaboration. These states can be used as part of the mechanics of
what we
call a "knowledge processor".
The notion of location allows for
the development of enterprise KOS.
The formal theory of states and
gestures is built from a precise formalization of computer game and
user
interactions in Multiple User Domains (MUDs) (such as the Palace 2-D
chat
environment) and Role-Playing Games (RPG) (such as ChronoTrigger or
Zelda.)
Gestures and states can be modeled
as the fundamental units of a specific finite state machine. Thus,
under any
such designation, the related model of event sequences is computable. The KOS, however, must represent the
invariance in the data AND the mental states of humans.
This means that our effort to representation
the contents of mental events within the context of discourse, should
be split
into two parts. These two parts are
1)
The
aggregation of data invariance and
2)
The
representation of user response history in terms of the set of
aggregated
invariance.
The aggregation of invariance is
used to represent the computer processes, and the user response history
is used
to represent the natural world. The
communication between finite state machines is used to control and
allow user
(community) focus on a specific set of issues.
This control of focus provides universal “Informational
Transparency
with Selective Attention”.
We can assume that a set of states, S = { sj
},
are created before hand and the gestures G = { gi } are a
set of
stored responses. This would mean that S and G are actually the atoms
of finite
state machine. However, the existence of event states may be actually
implicit
unless actually instantiated as part of an event. In this case, it is
proper to
refer to S and G as virtual finite state machines. In
the course of working with the KOS Browsers, users select
gestures in response to a presented state from the computer. This
pairing of
state and gesture is a represented as a mutual decision event.
In the Event Chemistry being developed by OSI,
the finite
state machine is a virtual combinatoric expression of the aggregation
of
abstractions called “SLIP atoms”. This
abstraction plays a role similar to the abstractions we have in
physical
chemistry. Each atom has a specific set
of linkage potential. The manipulation
of the linkage potential (called “valence: in physical chemistry) leads
to the
formation of chemical compounds.
The notions of an “atom” and “chemical valence”
are useful
abstractions. In the Event Browser we
use a different set of abstractions to visualize the parts of events
and the
means that these parts have to be a compound in a larger event.
Figure 15:
event
chemistry
An abstraction from the signatures of real
occurrences leads
to event chemistry. We have developed an additional
capability. This capacity is to provide a unique situational analysis
based on
the relationship of part to whole as revealed in the formal literature
of
Russian applied semiotics and related literature in the West.
Definition: The term
formal system refers to a four-term expression:
M = < T, P, A, R>
where T = set of basic elements, P
= syntactic rules, A = set of axioms, and R = semantic rules. The interested reader can refer to Situational
Control for a more detailed treatment of formal systems.
For our purposes we need only
refer to a figure from page 37.
Figure 16: Taken from Pospelov Situational
Control
In figure 16, the set of base
elements T are combined in various ways to produce three types of sets,
axioms,
semantically correct aggregates, and syntactically correct aggregates.
We
should remember that mathematical logic is founded on a similar
construction
and therefore that most of the results of mathematical logic will
somehow apply
later on to the theory of knowledge processing that we are
constructing. For
example, the set of axioms can be specified to consist of independent,
non
contradictory and self evident statements about the set of base
elements T.
The axioms are our analog to the
quantum mechanical substrate that gives rise to the mechanisms that in
turn
produce some of the material and energetic constraints that are
involved in the
creation of awareness. Rules of inference can also be formulated to
maintain
notions about true or false inference from assignments of a measure of
truth to
the axioms.
The set of syntactically correct
aggregations of elements of base elements can be defined either by
listing or
by some implicit set of rules.
The semantically correct
aggregations could then be interpreted as those syntactically correct
aggregations that have an assignment of true as a consequence of the
inference
rules.