Sunday, December 18, 2005
Lattice of ontologies
Communication
from Dick Ballard
Paul, John, and Colleagues:
I have been deeply involved with natural language dialog systems from 1970 - 1991. All the early work with Bork for NSF at UCIrvine 1970-78, Apple 1978-79, Interactive Productions Mark 1 1982-93, and Nichols SDI SETA ("Star Wars Battle Management") 1986-90, Logitech/Media Share ("BookReader Project) 1990-91 -- all involved creating, testing, and deploying 4 generations of natural language based Simulation and teaching platforms.
"Context" is indeed an artifact of natural language based reading, writing, and listening. But is very badly compromised by confusing two (or more) issues: (1) sub-language domain-based, noun phrase disambiguation and normalization versus (2) piece-wise topic specialization by context narrowing, ...[1]
ON SUBLANGUAGE NORMALIZATION
Sub-language work has a very old history coming out of Post-WWII computer-based efforts to translate (Soviet and Western European technical journals). That was led by military language labs and centers like that at New York University as described, for example, by Naomi Sager. The definitive book on this was Ralph Grishman's "Analysing Language in Limited Domains, Sublanguage description and processing", Lawrence Erlbaum Associates, 1986. It was a primary source for early Mark 1 & 2 Concept Finders & BookReader Project. The Mark 2 Builder Tool, EF.exe, and EditForm 2.0 integration process were all built originally using it for trial "multi-source concept integrations."
Generally automated language based integrations were abandoned when source reference collections exceeded 6-7 books (1991-92). Virtually every technology student or library crosses so many domains that no substantial progress has been made in systematizing domain-based assumptions. The last hope is to look experimentally at Google to see if they judge domain hits profitable enough to include such algorithms in their search spider analyses and if so, for which domains?
In the intelligence world the earliest engine was promoted commercially by Verity. BYTE magazine published an expose on the "Dark side of natural language processing" in an Issue devoted to Soviet Computing in the early 1990s, touting Verity's engine. At the same time (1990-91) we were commissioned to create Book Reader to superpass Verity's processing capabilities, because publishers like Media Share and Golden Book did not like Verity's high licensing cost. When finished, we licensed our tool to Golden Book, then folded it into the Mark 2 Concept-Finder and Mark 2 EF.EXE for Advanced Engineering tests on the TECAP Knowledge base built for the NAWC.
That project sought to catalog the total "Test and Evaluation Capabilities" of all Joint Service Range and Test Facilities to begin drawing them down in search of a "Peace Dividend." Natural language flunked that integration test badly. Technical noun-phrase construction is too open-ended an opportunity for idiosyncratic "word picture constructions". We went over instead to using the pattern of relationships tied to each concept gathered by our EditForm transcriptions of source content into patterns of thought. [2]
Relationships are tied fundamentally to those theories and natural laws accepted as a justified ontological commitments -- as noted famously by Quine. Their definition and validity are testable by science and are independent of language. Knowledge representations based on theory are forced then to evolve as new and better theories are proven. So theory-based semantics are compelled to "learn". By contrast, logic, language, and convention based definitions are judged standards for the property of never changing.
Even before our adoptions of theory-based semantics the naming of relationships was tied to language-based practices to binary concept endpoint name concatenations, active verbs, prepositional relationships, or property value names (concept/property/value). If these are used consistently, the number of relationship types seldom rises above 250 in most sources and so never climbs anywhere near the 100s of thousands to millions of noun phrase permutations.
ON TOPIC SPECIALIZATION BY CONTEXT NARROWING
It is difficult to give "Context" a firm conceptual meaning. This application assigns it the function of denoting any arbitrary level of granularity within a linguistic, geometric, ... composition of parts and wholes (look up "mereology," if you want the relevant theory). [3]
In linguistic description or argument, the structural problem is that propositions are asserted within sentence constructions whose conditional constraints need immediate local definition and elucidation. To define clearly a highly restricted or specialized concept within the readers scope of uninformed interest, writers must define an understandable trail of contextual narrowing -- book to chapter to section to paragraph to sentence. Unfortunately there are typically many such paths, so virtually every possible subject or topic traversed in moving from generality to a specialized concept are arguably viewed as yet another of seemingly unrelated context. If you have one context, you likely find hundreds more. [4]
Our horror stories include the MilFind project for Tom Carroll's publishing company, then in Georgetown next to the Canal (now in Bethesda, under Tom's son -- I believe) and numerous projects involving Lord Tomson's British company Jane's. Both represent reference publishing empires that encompass the entire US government organization, facilities, budgeting, contracting, and contractor org charts and phone numbers under freedom of information act disclosure (Carroll) and world-wide strategic assessments of all the worlds ships, aircraft, space vehicles, etc (Jane's).
Both are extraordinary authoritative sources we have worked with over the years, but we learned too they construct hypothetical organizational units and taxonomic divisions unknown to the subjects they describe every year -- simply to fit all their charts and tables into comparable chapter and page sizes. Likewise to provide seemingly up-to-date reports, Carroll advertises as a feature shipping updates every 6 weeks. But in fact only a small fraction are actually updated, because of long artwork costs and drafting update cycle times. Similarly the names they assign to organizational offices are unique only to the one page context they are printed on, so the published office names can and are ambiguous and unsynchronized within every snapshot published. The exact same name may be assigned to 20-50 offices, expecting page context to resolve the ambiguity. Not uncommon. [5]
At the time (1993-94), Carroll's computer records did not maintain either context or chart structures in electronic forms. Their draftsman's drawings were their only integrated storage record. To build a global form we had to disambiguate, systematize, and synchronize their electronic forms, but at the time their much abused drafting group refused to change their records to remove such global naming ambiguities.
CONTEXT NARROWING FORMALISMS AND USE-LOGICS ARE MEDIUM, NOT SUBJECT DEPENDENT
Thing's there and everywhere have certainly changed, but our lesson's learned set us on the road to understand that context narrowing produces form dependent context choices, that have almost nothing to do with subject associations. The dimensionality and media costs produce different micro-organizational charting formalisms and each with its own implicit form specific logic. That caused me to jump to mirror John Sowa's use of the notion of "mediating structures" [6] in the broadest sense of those words -- in our world independent of conventional linguistic or logical assumptions.
n-ary bundles in our semantic spaces of high dimension are turned by some into waterfall structures in 2 dimensions with very different internal representations. Given these wildly different geometries, which contextual definition is demonstrably, provably, or conventionally right?
Dick
[1] Paul: You are not implying here that context is NOT also an artifact of situated-ness, something that occurs in even non-living systems such as elementary particle interactions.
[2] Paul: In this sense, the pattern of relationships (defined in some fashion) is the context for the elements of a pattern.
[3] I feel that I know that you are talking about using a heuristic, that asks the question; it this part of the pattern of relationships “part” of the same concept as this other part of the pattern of relationships.
[4] Precisely. The requirement is to separate the parts of a pattern so that there becomes a one to one correspondence between patterns and “concepts” or “taxa” (if one is working in bioinformatics). In bioinformatics, the unit fir comparison and organization is called a “taxon”.
[5] UDEF naming conventions recognize this contextualization issue.
[6] We need a definition for mediating structure.