Back ...
... ... ... ... ... ... ... ... ... ... Send comments into the ontology
stream. ... ... ... ... ... ... ... ... ... ... ... Forward
(:….:) … (*) (**) (*) (*) … (:..:)
ARL Final Report
Section 2
The Application of CCG generalization to the problem
of text parsing
December 31, 2000
ARL has expressed an
interest in text parsing. Thus it is
important to specify how we see test parsing fitting into an architecture that
deploys a generalization of Russian CCG technology.
The diagram for this
architecture has the following parts:
1) a preprocessor for data streams
2) an image library
3) an visualization interface and control system
4) indexing engine,
5) routing and retrieval engines
6) a viewer or reader interface with feedback loop to 2, 3, 4
and 5
Component 6 is where
we might have a decision support interface.
Components 2, 3 , 4, and 5 are really the components of a knowledge
warehouse. These components are built
up over time. In the data mining terminology,
component 1 is called a data-cleaning component.
The preprocessor
simply must do what is necessary to put the incoming information into a regular
data structure. The cleaned data
structure can have many different types of forms, however in the text-parsing
task all of these forms have deficits.
Again, this is due to the indirect relationship that the textual
information has to the experience of awareness or knowledge. We seek to transfer the interpretability of
natural text into an image framework.
The viewer interface
is used in real time on problems of some consequence. Thus there is a systems requirement that a feed forward loop be
established from one decision-making event to the next. The feed forward loop must touch components
2, 3, 4 and 5.
This architecture
allows at least three places where human perception can make differences in the
systems computations. The first is in
the development of the image library (perhaps a token library is a more general
concept). The second is in the indexing of the library components and perhaps
some data streams. The third is in the
decision support system (component 6).
In Section 1, we made
several claims. One of these is that the
flow of information from an object of investigation can be vetted by a human
computer system if and only if all real aspects of the syntactic and semantic
representational problems are addressed.
The flow of information from an object of investigation is likely to
suffer from the data source being somewhat indirect, as in EEG data and
linguistic data. Data sources such as
astronomic data are more direct and thus more like the formal data sources such
as number theory.
The partial success
of statistical methods on word frequencies attests to the fact that a partial
solution to these problems leads to an imperfect result. The glass is either half empty or half
full. We do not know how to make this
judgment, because today there is no completely satisfactory automated text
parsing system.
The indirect nature
of the data source would seem to imply that a human interpretant is necessary
before there can be really successful text parsing. Thus the notion of vetting is proper, since this notion implies causation
on a process that is mediated by a knowledgeable source and human
judgment. The goal of a CCG system for
text parsing is to transfer the interpretive degrees of freedom of text into an
image framework. Once in the framework
certain algorithm paths can produce suggestive consequences in new
context.
It is not yet known
if new methodology, entirely separate from the existing routing and retrieval
technologies, will give rise to new and more successful results. We have suggested that much of the
statically work on word frequencies is hard limited by the nature of
anticipation. The statistical sciences
can tell perhaps everything about the past, but cannot always predict the
future. Moreover, we have the problem
of false sense
making. The meaning of words is
enabled with ambiguity just for the purpose of predicting the meaning of words
in contexts that are bound in a perceptional loop. This loop involves both memory and anticipation.
One can revisit the
TREC and TIPSTER literatures, as we at TelArt will be doing over the next three
months. In this review, we find not
only statistical approaches, such as those made by David Lewis at AT&T and
Susan Dumas at BellCore, but also a few linguistic and semantic methods. These methods are being reviewed as part of
the Indexing
Routing and Retrieval (IRR) evaluation by conducted by TelArt for a commercial
client.
An understanding of
routing and retrieval techniques might assist the generalization of the CCG
technology. This generalization was
done to establish some broad principles that might be formative to a proper
text parsing system. The CCG
technology can then be seen to have the following parts:
1)
A representational schema
2) A
visualization schema
3) An
induction schema
New thinking on
indexing, sorting or arrangements of data atoms may also provide value to our
task. As we look for tools and methods
we will of course be somewhat hampered by the now proprietary nature of new
technologies. However, this is just
part of the task we set for ourselves.
The application of
CCG technology to the problem of text parsing requires that our group have a
command of all existing IRR technologies and theory.