Taxonomy
Note 3:
Use
of ORBs to instantiate Stratified Taxonomy
These notes
are a log of issues arising while
developing a
taxonomy methodology for the FCC
Friday,
November 14, 2003
Notes on how emergent
taxonomies are developed
Forming the atoms of the lower level of a two level Upper
Taxonomy
The FCC data initially was 24 documents having a
total of 3900 "good" sentences.
This is a small collection but was selectively gathered by a domain
expert.
The 3900 sentences were developed using a Visual
Text module that Amnon developed. The
principle is that with a low cost Visual Text development environment one can
hand tune the text analysis using a multi-pass parser having both highly
situational ontology services and linguistic services.
The sentences where produced as a long list in a
single file. This file was sent to me.
I took this list and imported the list into a FoxPro environment where I performed manually some techniques that are principled based on the text understanding literature review. This involved some steps that could have been done with Amnon’s Visual Text API better than what I did manually. About two days, or less, of work is required to tune a Visual Text module to the domain so that the highest quality selection of meaningful sentences is produced from any size collection.
It is my understanding that we will soon have all FCC public rules in a single ASCII text format and will repeat the process with this collection.
In a second process, we will provide a full text
key word index on the agency public rules collection.
For technical reasons related to how statistical
parameters are usually set on NdCore, and in Stratify, (but not in Entrieva
(Semio tagger)), I started at the top of the sentence list and put ten
sentences into each of 390 ".txt" files.
This data is enclosed in the zip file sentBoundaries.
In NdCore ™ there is a process that occurs after
the Output Array of simple trees is developed using a simple convolution over
the end nodes of the bag of branches produced from word level 5 grams.
http://www.bcngroup.org/area2/KSF/Notation/notation.htm#_Section_2.4:_Simple
ORBs (Ontology Referential Base) are based on
the NdCore ™ system and work done in 2001 on categoricalAbstraction (cA) and
eventChemistry (eC)
http://www.ontologystream.com/cA/index.htm
Nathan’s new Berkeley hash table management
system based ORB takes the word level 5 grams and convolves them to produce a
set of simple trees, one for each significant word in the set of word
occurrences. However, we convert at
this step to set theory rather than use the trees as graphs.
The notion that words occurring next to each
other is relevant to mapping linguistic variation is weakened so that we
translate the simple tree in the Output Array of simple trees to a bi-level
set-theoretic construction having a center, the root node, and a neighborhood,
the set of all nodes of the simple tree except for the center.
The center is then "regarded as a broad
term" in a broad term / narrow term subject matter taxonomy. (This is an intermediate result that is not
the final subject matter indicator taxonomy for ORBs- which will have a
slightly different form.)
For this discussion, one needs to review the
following research notes.
A manual process is used to identify potential
elements of the second layer of the two-layered Upper Taxonomy. These are called atoms of the Upper
Taxonomy. The elements of the top level
are called compounds of the Upper Taxonomy.
The manual process involved taking the middle
spectrum of the connectivity graph away, as well as the low frequency
connectivity graph. This graph is the
same as the complete convolution over the NdCore Output Array, if that was ever
done. I first saw this subtractive
method in the original Autonomy classifier engine (1997), but the principles
are in a published literature that goes back as least to 1950. This can be seen in the Method of
Descriptive Enumeration.
We use the SLIP browsers to do this subtraction process. This process took less than 2 mins, because the SLIP has a RIB backbone (one of the very few in the world).
For those readers that know how to use SLIP
browsers, the complete ORB is in center.zip. (816K).
Feel free to call me (703-981-2676) if you have questions.