Note 1: Taxonomy Development

Taxonomy Note 2: Taxonomy Development

Thursday, December 11, 2003

Purpose of Taxonomy

Purpose of Taxonomy: A fixed taxonomy is sought to drive long-term document metadata production.

The Community is the Origin of Taxonomy: The stakeholder community knows the subject matter. A comprehensive and commonly agreed on collection of subject matter indicators can be acquired only from the stakeholder community, and only by careful and complete enumeration processes.

Upper Taxonomy: A two level Upper Taxonomy is required to be stable because it will represent abstract classes of subject matter indicators. The rigid control over Upper Taxonomy works culturally because the elements of the Upper Taxonomy are not specific to only single instances.

Hidden taxonomy: The Hidden Taxonomy is however often only implicit, and often not seen, by humans, either when documents are placed into repositories or when repositories are searched and documents retrieved. Full text indexing will reflect the linguistic variation in the text. The measurement of linguistic variation in text is one technical means to capture implicitly a functional Hidden Taxonomy.

Reification: The relationship between Hidden Taxonomy and the Upper Taxonomy is reified by community participation, both in the creation of the Upper Taxonomy and in guidance imposed on hidden processes by active user community feedback. The term “reification” means “making machine knowledge representation human like”. Human centered reification is efficient only when some part of stakeholder community inspects and make refinements to the elements in the Hidden Taxonomy.

Kinds of Taxonomy: A review of the literature on subject matter taxonomy indicates that there are three kinds of metadata:

{ reuse metadata, retrieval metadata, tracking metadata }.

Each of these kinds of metadata are developed separately. One possibility for the first level of the Upper Taxonomy is in fact these three categories of metadata.

Value of a fixed Taxonomy: A fixed taxonomy is put into place to drive document metadata production. Fixing the taxonomy provides specific benefits to later retrieval, due to (1) standardization of terminology and (2) anticipatory responses from users. In other words, the stakeholder community develops anticipation about how subject matter is classified due to standardization over a controlled vocabulary.

Limitations of a fixed Taxonomy: However, there is a trade off that reduces work and communication efficiency when the controlled vocabulary flexibility becomes rigid, or if the taxonomy is not reflective of true subject matter content.

Drill down on types of Taxonomy

Evolution of Knowledge Flow models: If the fixed taxonomy has some tracking metadata, then one is able to see the patterns of information flow that can be sometimes, but not always, converted to classical pre-wired work flow (completely controlled by explicit rules and procedures). In many cases, the patterns of information flow can be captured into models. Knowledge management specialists call these models knowledge flow models.

If a fixed tracking metadata taxonomy is in use for a short period of time, then the knowledge flow model will become revealing and useful in realizing greater efficiencies in community communication and in realizing greater transparency over those processes where transparency is mandated by law.

Development of a dependable Retrieval Index: A fixed taxonomy has subject matter indicators that point to concepts and processes. Members of a community of practice anticipate that these concepts and processes are to be subject for future queries. The community will develop use practices that depend on this specific and fixed metadata organization of information. The more comfortable the community feels about subject matter taxonomy, the more skill will be developed by the community in using this taxonomy.

The notion of a controlled vocabulary is relevant here because the terminology being expressed in free text will change over time and will have terminology elements that managed by a library staff or information intermediators. The controlled vocabulary is often much larger than the subject matter taxonomy. Subject matter taxonomy should have an interface with the controlled vocabulary used in managing the functionality of full text search and retrieval. We are recommending that the notion of a Upper Taxonomy and Hidden Taxonomy be used.

The FCC taxonomy architecture will have a fixed "Upper Taxonomy" with two levels and a managed Hidden Taxonomy with mediated interface over implicit elements used by Verity and/or Autonomy. The Hidden Taxonomy will have a 300 – 400 term controlled vocabulary, and constructions using this controlled vocabulary will be mapped to elements in the second level of the Upper Taxonomy. This mapping can be put into place using the existing Autonomy and Verify products (that are already purchased, by FCC, for integration into iManage.)

Document type = reuse taxonomy. Materials available from the BearingPoint information and content audit evidence the importance of document type. A standard taxonomy refined for the document types are available from previous document management projects. These classifications are, taken as a whole, an example of a reuse taxonomy.