Note 1: Taxonomy Development

Note2 Back to Home Note4

Taxonomy Note 3:

Use of ORBs to instantiate Stratified Taxonomy

These notes are a log of issues arising while

developing a taxonomy methodology for the FCC

Friday, November 14, 2003

Notes on how emergent taxonomies are developed

Forming the atoms of the lower level of a two level Upper Taxonomy

Notes on how emergent taxonomies are developed

The FCC data initially was 24 documents having a total of 3900 "good" sentences. This is a small collection but was selectively gathered by a domain expert.

The 3900 sentences were developed using a Visual Text module that Amnon developed. The principle is that with a low cost Visual Text development environment one can hand tune the text analysis using a multi-pass parser having both highly situational ontology services and linguistic services.

The sentences where produced as a long list in a single file. This file was sent to me.

I took this list and imported the list into a FoxPro environment where I performed manually some techniques that are principled based on the text understanding literature review. This involved some steps that could have been done with Amnon’s Visual Text API better than what I did manually. About two days, or less, of work is required to tune a Visual Text module to the domain so that the highest quality selection of meaningful sentences is produced from any size collection.

It is my understanding that we will soon have all FCC public rules in a single ASCII text format and will repeat the process with this collection.

In a second process, we will provide a full text key word index on the agency public rules collection.

For technical reasons related to how statistical parameters are usually set on NdCore, and in Stratify, (but not in Entrieva (Semio tagger)), I started at the top of the sentence list and put ten sentences into each of 390 ".txt" files.

This data is enclosed in the zip file sentBoundaries.

In NdCore ™ there is a process that occurs after the Output Array of simple trees is developed using a simple convolution over the end nodes of the bag of branches produced from word level 5 grams.

http://www.bcngroup.org/area2/KSF/Notation/notation.htm#_Section_2.4:_Simple

ORBs (Ontology Referential Base) are based on the NdCore ™ system and work done in 2001 on categoricalAbstraction (cA) and eventChemistry (eC)

http://www.ontologystream.com/cA/index.htm

Nathan’s new Berkeley hash table management system based ORB takes the word level 5 grams and convolves them to produce a set of simple trees, one for each significant word in the set of word occurrences. However, we convert at this step to set theory rather than use the trees as graphs.

The notion that words occurring next to each other is relevant to mapping linguistic variation is weakened so that we translate the simple tree in the Output Array of simple trees to a bi-level set-theoretic construction having a center, the root node, and a neighborhood, the set of all nodes of the simple tree except for the center.

The center is then "regarded as a broad term" in a broad term / narrow term subject matter taxonomy. (This is an intermediate result that is not the final subject matter indicator taxonomy for ORBs- which will have a slightly different form.)

Forming the atoms of the lower level of a two level Upper Taxonomy

For this discussion, one needs to review the following research notes.

Note on taxonomy use

Taxonomy Development

A manual process is used to identify potential elements of the second layer of the two-layered Upper Taxonomy. These are called atoms of the Upper Taxonomy. The elements of the top level are called compounds of the Upper Taxonomy.

The manual process involved taking the middle spectrum of the connectivity graph away, as well as the low frequency connectivity graph. This graph is the same as the complete convolution over the NdCore Output Array, if that was ever done. I first saw this subtractive method in the original Autonomy classifier engine (1997), but the principles are in a published literature that goes back as least to 1950. This can be seen in the Method of Descriptive Enumeration.

We use the SLIP browsers to do this subtraction process. This process took less than 2 mins, because the SLIP has a RIB backbone (one of the very few in the world).

For those readers that know how to use SLIP browsers, the complete ORB is in center.zip. (816K). Feel free to call me (703-981-2676) if you have questions.