Procedure for removing the central linkage from an ORB

ORB Visualization

First Tutorial On SM-Indicators (12/17/03)

Working document on

Subject Matter Indictor neighborhood

Process Patent Application

Friday, December 05, 2003

This invention uses an exact model of the co-occurrence of words within sentence boundaries and expresses this co-occurrence as a navigatable graph. The navigation of this graph then produces a retrieval profile, the use of which pinpoints exactly where co-occurrence patterns are located. Variation of parameters allows the co-occurrence model to change in scope, from sentence level to multiple sentences and document level. A change in scope is called a change in resolution.

This invention uses a number of well-known techniques, such as stop words, to produce a model of co-occurrence of words and word patterns. The uniqueness of this invention is in the use of a topological neighborhood defined on the co-occurrence graph. Each element of the neighborhood has a center and a radius, where radius is the integer that counts the number of co-occurrence links starting at the center. Standard mathematical topology notation is used to illustrate a well-defined mathematical construction. A topology is a set of sets having properties related to intersection and union.

The topological construction itself is not the disclosed invention, but rather the invention is how this construction is used to produce semantically valid retrieval and metadata elements.

The model is used to provide an exact description of the co-occurrence of words within text. This model captures the linguistic variation related to different ways of speaking about various subjects. For example, in Figure 1 we have one form of linguistic variation centered on the word occurrence “orders”.

Figure 1: Subject Matter Indicator for the word-occurrence “orders”

(from the FCC ORB taxonomy )

There are many well known mathematical techniques based on the use of co-occurrence of words within larger linguistic structures. Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Indexing (PLSI) are the most well known. However in both cases, the exact model of the co-occurrence is not created. In the CCM model an exact co-occurrence is developed, but only at one level of resolution. Moreover, the CCM model is not used to produce “differential” retrieval profiles, those profiles being subject to partial, and differential, use by the user.

The process aspect of this invention describes a methodology for handling the SM-Indicator constructions both within a community and within a single user environment.

The mathematical construction, when its graphical representation is viewed by a human, supports cognitive priming. A cognitive reliance on the represented information develops from precise and exact information that is highly relevant to words, sentences or documents submitted by the user through a graphical user interface. The consequence of looking at the construction evokes, in a natural fashion, human decision-making dependant on the precise and exact information forming the elements of the construction.

The non-statistical nature of the SM-Indicators is unlike typical text understanding algorithms.

This present invention also involves a method for the distributed use of Subject-Matter Indicator (SM-Indicator) neighborhoods to assist communities in the development of fixed metadata constructions, based on the neighborhoods, and the use of emergent constructions called, within the context of the present invention, formative ontology. These knowledge management type uses for the present invention are available as the key element of a cyclic process model whereby human cognitive acuity plays an essential role in the production of the fixed meta data construction, having been reified by users or specialists, and in the identification of new patterns having potential significant to an individual or group.

SM-Indicators are refined and placed into a controlled vocabulary by editors and librarians. Once in this controlled vocabulary, the SM-Indicators are available for situational modification by editors, librarians and/or users.

As a further reduction of the mathematical construction the current invention provides a new type of machine inference and pattern completion. A type of pattern matching is possible that produces an inference about concepts that might be a subject within single documents or existing in a distributed fashion across many documents. Other inferential mechanisms can be based on the current invention, and are not regarded as part of the current disclosure. An example of how the current invention might be used is to produce a machine inference about how the authors of a document or documents might interpret a conjecture that was not explicitly addressed in published document repository.

Figure 1: A standard type interface using key words

The current invention uses a standard key word search engine in an unmodified form. The invention depends on having two components, a properly reified topology of SM-Indicator neighborhoods and a standard search engine.

The invention addresses the question of which words a user should use to obtain the best search results. Exact and precise co-occurrence information is provided within a easy to understand visual form.

Figure 2: a mock up of the advanced search interface

Key word search has well known limitations. Using only one word still has a precision recall failure when considered semantically. A single key word, “obstructive”, for example will return all documents that have one or more occurrences of this single word. In the illustrating case, more than 5,000 of the total of 36,118 documents in a repository (FCC ruling 1997 – 2003) are retrieved. Given the two words “purely” and “obstructive” the retrieval has nine documents.

The presence of “purely” in the SM-Indicator neighborhood with center “obstructive” informs the user that in fact one of more co-occurrences of these two words, within the same sentence, can be found. The use of exactly these two words in a key word search will produce all instances of documents having this property.

Figure 3: Retrieval of all documents

In Figure 3, nine documents are retrieved. These are all of the documents having both “purely” and “obstructive” occurring in the same sentence. The underlying construction guarantees that only 9 out of 37,188 documents (in this illustrative case) have precisely this property. So in this case the technical precision recall measure is 100%, as in any full text key word retrieval.

The current invention includes a simple and straightforward method in which thesaurus and/or machine ontology is used to disambiguate a single SM-Indictor neighborhood into two or more SM-Indicator neighborhoods – each having a more specific subject indication. A second method is used to produce an ambiguation of a SM-Indicator neighborhood, leading to the union of several SM-Indicators into one.

Figure 4: Two SM-Indicator neighborhoods having some relationship

The current invention goes not to the question of how key word retrieval is achieved. The current invention provides any underlying mathematical formulation and a process model for its more effective use. The formulation and the process model together provide a high fidelity between what the user is interested in and the documents retrieved.

The unanticipated aspect of the current invention occurs because so far machine learning and computational linguistics based test understanding systems have not relived on simple human cognitive acuity, depending rather on the false concept that the machine has recognition properties equal or surpassing human ability.

Prototype for Advanced I-ORB search

Thursday, December 04, 2003