ORB Visualization
(soon)
Dr. Myers, at SAIC, asked
Chiliad Inc to look at the early 24-document collection from the FCC public
files.
The work is good.
We are thinking that a committee might be formed to discuss how the results are to be compared. Perhaps one of the agencies can spring for a small workshop in mid December. There will be some growing pains as we figure out how to increase the complexity of this working group.
I have not as yet met the Chiliad innovators, and hope to do this next week. It was Dr. Myers who brought this technology to the evaluation process. Several other companies have developed results using either the fable collection of the 24 document FCC collection.
http://www.bcngroup.org/zippedTestCollections/
The collection that Chiliad tested was developed from a very quick selection of 24 representative documents, so the collection size was initially "small". However the documents where long.
This allowed an approach that will be repeated with the collection of all FCC ruling from 1997 - 2003. The approach involved identifying "ALL" good sentence with a text analyzer develop quickly using the Visual Text product (Amnon has provided us with an evaluation license.)
Then all but the good sentences are thrown away.
Most text understanding systems work at the document level when there is statistics to compute. Well, at least one can say that most statistics needs many "documents". The original Stratify examination of the same 24 documents was not very interesting because the number of documents was small.
The solution: take the sentences in order and create new document with only ten sentences per document.
The result had two qualities. One, all of the documents contains only good sentences. For NdCore and the OntologyStream ORBs we start out with a word level 5-gram analysis. Our effort is focused on structural relationship between individual words and thus the statistical issues come up later on when there is a type of summarization into broad categories of structures.
This original procedure is stated here
http://www.ontologystream.com/beads/nationalDebate/four.htm
and the new procedure is stated here
http://www.ontologystream.com/beads/nationalDebate/five.htm
The harvest of the FCC archive focused on only .txt files, and since in most cases the rulings where published in pdf. WordPerfect, HTML and .txt; we got the vast majority of them.
Nathan has written a description of the process he took in the harvest. However, the exact number of files was not reported, small oversight, so we still need the exact number of file. I think he told me on the OntologyStream Groove space, a collaborative environment, that the number was 38,108. So I will use this number until corrected.
So the first test collection is called "The Original 38,108 files"
We expect that two "derived" collections will be created by Wednesday.
1) The Original 38,108 files with an XML metadata header that encodes the list of nouns, noun phrases, verbs and verb phrases contained within clearly defined sentence boundaries.
2) The Sentence List Collection
The sentence list will contain all of the clearly defined sentence sentences in the 38,976 text files, with a metadata tag after each sentence. The tag will not contain the file name but only the data.