MACCI stands for Multi-Agent Collaboration for Classification of Information. The official site of MACCI is at: http://lair.ils.unc.edu/macci/.
LUCAS stands for Library of User-Oriented Concepts for Access Services. In Lucas II, we developed and deployed three web service methods (operations) for term extraction, text classification, and clustering. The system mainly consists of three components: the Lucas II web services, which are deployed on Tomcat and Apache Axis server; the client, which passes the user selected parameters to the web services and gets the results back based on the SOAP protocol; and data access modules to access domain terms and document collections.
→ Details
Among existing implementations of various browsing methods, Scatter/Gather browsing is well known for its ease to use and effectiveness in situations where it is difficult to precisely specify a query (Cutting, Karger, Pedersen, and Tukey 1992; Hearst and Pedersen 1996). This project aims to implement a Scatter-Gather browser, a dynamic visualization for text navigation/search. Using visualization techniques, this browser will help users refine their search queries and narrow down search results interactively and visually. We are going to constraint ourselves to a smaller text corpus for proof of concept. We will modularize it to be able to attach to any text collections in the future.
→ Details
MedSifter, or Medical Sifter, stands for Smart Information Filtering Technology for Electronic Resources. Here is the MedSifter demo: http://lair.ils.unc.edu:8080/medsifter/demo.html.
The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. The National Library of Medicine has agreed to make the MEDLINE references in the test database available for experimentation, restricted to the following conditions:
1. The data will not be used in any non-experimental clinical, library, or other setting.
2. Any human users of the data will explicitly be told that the data is incomplete and out-of-date.
The OHSUMED document collection was obtained by William Hersh (hersh@OHSU.EDU) and colleagues. Click here for more information.
This is the document collection (including documents, topics, and relevance judgements) used in the TREC-9 Filtering Track.
The document collection for the TREC 2006 Genomics Track consists of full-text HTML documents from 49 journals who publish electronically via Highwire Press that granted permission for research use of their articles in the Genomics Track. The track continues in 2007. The documents have been obtained by a Web crawl of the Highwire site, with postprocessing to eliminate as much non-article material as we could. This material should only be used for research purposes and should not be posted on public Web sites.
The TREC 2007 Genomics Track protocol is available at http://ir.ohsu.edu/genomics/2007protocol.html.
The documents are in the directory. The data files themselves are located in the active user portion of the track Web site at http://ir.ohsu.edu/genomics/data/. This area is password-protected, with the password only available to those who have completed data usage agreements and/or are registered for TREC.
The GEDData repository has a number of datasets for bio-medical data analysis and Gene Expression applications, provided by the Institute for Infocomm Research in Singapore. The data collections are offerred in three formats, i.e. C4.5 format, Weka's ARFF format, and the Comma Separated Values (or CSV) format.
The data repository is avialable at http://sdmc.lit.org.sg/GEDatasets/.