A thesaurus is a semantic tool used for information retrieval, query expansion and indexing, among other purposes. It is basically a selection of the basic vocabulary in a domain supplemented with information about synonyms, homonyms, generic terms, part/whole terms, “associative terms” and other information (e.g. frequency and history of terms in a given database).
Peter Marc Roget (1779-1869) produced the first edition of Thesaurus of English Words and Phrases. (Roget, 1852/1992), which is recognized as the first thesaurus. The structure of this thesaurus was, according to Roget in his introduction, a "verbal classification.. . the same as that which is employed in the various departments of natural history".
In the modern sense is the thesaurus a child of information retrieval and information science. The year 1964 is important in the development of modern thesauri for information retrieval. Two thesauri where published: "Euratom-Thesaurus", the first published thesaurus applying the graphical method to display the paradigmatic relation between descriptors and "Thesaurus of Engineering Terms", which have been a model for later thesauri. Following the development of electronic bibliographic databases made thesauri very popular, and the thesaurus became a common follower of such databases, first in the sciences, then in the social sciences and also to a certain degree in the humanities (e.g. in Architecture and music).
According to Sparck Jones (1992, p. 1609) was the theory of semantic primitives influential in early thesaurus construction: "A thesaurus was seen as providing a set of domain-independent semantic primitives.". According to this theory can every word be broken up into primitive kernels of meaning, semantemes (also called semantic features or semantic components). Semantemes are terms that are used to explain other terms or concepts but cannot themselves be explained by other terms. The process of breaking words down into semantemes is known as componential analysis and has been most often used lo analyze kinship terms across languages. The components are often given in considerable detail.
“Most thesauri establish a controlled vocabulary, a standardized terminology, in which each concept is represented by one term, a descriptor, that is used in indexing and can thus be used with confidence in searching; in such a system the thesaurus must support the indexer in identifying all descriptors that should be assigned to a document in light of the questions that are likely to be asked. . . .
A good thesaurus provides, through its hierarchy augmented by associative relationships between concepts, a semantic road map for searchers and indexers and anybody else interested in an orderly grasp of a subject field.” (Soergel, 2004).
Important semantic relations used in thesauri
"Scope note" A definition of the term or an explanation of the meaning of the term and its use in a specific database.
Non-descriptor A synonym used as a lead-in term to a descriptor.
U: Use Reference to a descriptor. (Sometimes, e.g. INSPEC-thesaurus (1993), termed "lead-in terms “or “cross-references”). The relation between lead-in term and descriptor is one of synonymy.
UF: Used For "Preferred term" cross reference to "lead-in"-terms. (Synonym relation)
qualifiers: Devise used to distinguish between different meanings of a word (homograph/Homonym relationship).
Example: Letters (Alphabet);
BTG: Broader Term Generic
BTP: Broader Term Partitive
NT: Narrower term Sub-concept. Again a distinction may be made between generic and partitive sub concepts.
NTG: Narrower Term Generic
NTP: Narrower Term Partitive
TT: Top term. Symbolizes the highest hierarchical level in the thesaurus. (Generic or partitive relation).
TT: Geographical areas
Rotated index Alphabetical index, each word in a phrase is an access point (Syntagmatic relations).
Thesaurofacet Facet applied in a thesaurus. (Paradigmatic relations).
"The explosive growth of Web search engines, with their primitive algorithms, has had some rather unfortunate effects, to my mind. Some of these engines appear to have been developed by people who saw a need, but who had not the vaguest idea that there was already a history of development of tools to fulfill similar needs. There is little evidence that some of these developers had ever used either Dialog or a library catalog.
We should distinguish kinds of tools for facilitating access to full text on the basis of the attention they give to semantics. Older, exact-match (Boolean) systems give no attention to semantics. The search terms must appear in the document for it to be retrieved; if a term appears at all the document will be retrieved regardless of whether the term is important to the meaning of the document or not. Another approach relies on statistical information -- co-occurrence of words in the document, frequency, etc. Boolean and statistically-based systems have been found to have comparable retrieval performance, but to produce very different retrieval sets. That is, searches of the same database using a Boolean engine and a statistically-based one often produce about the same number of relevant hits, but there may be little overlap between the two sets of hits. " (Milstead (1998)
”It has come to be self-evident that a classification scheme is an indispensable tool when compiling a thesaurus. When the editor is forced to work solely within an alphabetical list of numerous descriptors, at the level of the individual term, there is a sense of working “blind”. In contrast, where a rigorous classification is developed, providing an overall picture of the subject area, the compiler has a better chance of building accurate and meaningful relationships between the terms. “ (Aitchison & Dextre Clarke, 2004, p. 10).
Aitchison, J. (1986). A Classification as a Source for a Thesaurus: The Bibliographic Classification of H. E. Bliss as a Source of Thesaurus Terms and Structure. Journal of Documentation, 42(3), 160-181.
Aitchison, J. & Clarke, S. D. (2004). The thesaurus: A historical viewpoint, with a look to the future. Cataloging & Classification Quarterly, 37(3/4), 5-21. Co-published simultaneously as: The thesaurus: review, renaissance, and revision. Ed. by Sandra K. Roe & Alan R. Thomas. New York: Haworth Information Press. (Pp. 5-21).
Aitchison, J.; Gilchrist, A. & Bawden, D. (2002). Thesaurus Construction: a Practical Manual. 4. ed. London: ASLIB.
DIN 1463 (1987). Erstellung und Weiterentwicklung von Thesauri: Einsprachige Thesauri. 2. Ausg. Berlin: Deutsches Institut für Normung e.V. (DIN 1462, teil 1).
relations in information retrieval. IN: Green, R., Bean, C.A. and Myaeng, S.H.
(Eds), The semantics of relationships: an interdisciplinary perspective,
Kluwer Academic Publishers, Dordrecht, pp. 143-160.
Foskett, D. J. (1975). Thesaurus. IN: Kent, Allan (ed.): Encyclopedia of Library and Information Science, Vol. 30. New York: Marcel Dekker. (Pp. 416-463).
Gilchrist, A (2003). Thesauri, taxonomies and ontologies - an etymological note.
Journal of Documentation 59(1), 7-18.
ISO 2788 (1986). Guidelines for the Establishment and Development of Monolingual Thesauri. 2.ed. International Organisation for Standardisation (ISO). (Også som dansk standard: DIS 2788, Retningslinier for opbygning og udvikling af ensprogede tesauruser. Hellerup: Dansk Standardiseringsråd, 1985).
Krooks, D. A. & Lancaster, F. W. (1993). The Evolution of Guidelines for Thesaurus Construction. Libri, 43(4), 326-342.
Miller, U. (2003a). Thesaurus construction. IN: Encyclopedia of Library and Information Science. New York: Marcel Dekker. (Pp. 2800-2810).
Miller, U. (2003b). Thesaurus and New Information Environment. IN: Encyclopedia of Library and Information Science. New York: Marcel Dekker. (Pp. 2811-2819).
Milstead, J. L. (1998). Use of Thesauri in the Full-Text Environment. Based on a paper presented at the 34th Clinic on Library Applications of Data Processing. (Cochrane, Pauline A., and Eric H. Johnson, eds. Visualizing Subject Access for 21st Century Information Resources; Proceedings of the 34th Annual Clinic on Library Applications of Data Processing, March 2-4,1997. Champaign, IL: Graduate School of Library and Information Science, University of Illinois, 1998. p. 28-38.) http://www.bayside-indexing.com/Milstead/useof.htm
Milstead, J. (1995). Invisible
thesauri: the year 2000. Online & CDROM Review, 19(2),
Rada, R. (1990). Maintaining Thesauri and Metathesauri. International Classification, 158-164.
Roberts, N. (1984) The Pre-History of the Information Retrieval Thesaurus. Journal of Documentation, 4(4), 271-285.
Roe, S. K. & Thomas, A. R. (Eds.). (2004). The Thesaurus: Review, Renaissance and Revision. New York: Haworth Information Press.
Roget, P. M. (1852/1992). Thesaurus of English words and phrases, classified and arranged so as to facilitate the expression of ideas and assist in literary composition. (Facsimile of the First Edition). London: Bloomsbury Books. Project Gutenberg's version: http://www.gutenberg.org/cache/plucker/10681/10681
Roget, P. M. (). Peter Roget’s classic structure coupled with Mawson’s modernization . http://www.bartleby.com/110/
Soergel, D. (2004). The Arts and Architecture Thesaurus (AAT). A critical appraisal. http://www.dsoergel.com/cv/B47_long.pdf
Sparck Jones, K. (1992). Thesaurus. Vol. 2, pp. 1605-1613 IN: Encyclopedia of
Artificial Intelligence. Vol. I-II. Ed. by Stuart C. Shapiro. New York: John
Wiley & Sons.
Van Slype, G. (1976). Definition of the Essential Characteristics of Thesauri. Brussels: Bureau Marcel van Dijk.
Will, L. (2006). Glossary of terms relating to thesauri and other forms of structured vocabulary for information retrieval. http://www.willpowerinfo.co.uk/glossary.htm
HILT - High-Level Thesaurus. A-Z of thesauri. http://hilt.cdlr.strath.ac.uk/Sources/thesauri.html
See also: Metathesaurus; Search thesaurus; Thesaurofacet
Last edited: 10-08-2007