A thesaurus is a semantic tool used for information retrieval, query expansion and indexing, among other purposes. It is basically a selection of the basic vocabulary in a domain supplemented with information about synonyms, homonyms, generic terms, part/whole terms, “associative terms” and other information (e.g. frequency and history of terms in a given database).

Peter Marc Roget (1779-1869) produced the first edition of Thesaurus of English Words and Phrases. (Roget, 1852/1992), which is recognized as the first thesaurus. The structure  of this thesaurus was, according to Roget in his introduction, a "verbal classification.. . the same as that which is employed in the various departments of natural history".

In the modern sense is the thesaurus a child of information retrieval and information science. The year 1964 is important in the development of modern thesauri for information retrieval. Two thesauri where published: "Euratom-Thesaurus", the first published thesaurus applying the graphical method to display the paradigmatic relation between descriptors and "Thesaurus of Engineering Terms", which have been a model for later thesauri. Following the development of electronic bibliographic databases made thesauri very popular, and the thesaurus became a common follower of such databases, first in the sciences, then in the social sciences and also to a certain degree in the humanities (e.g. in Architecture and music).

According to Sparck Jones (1992, p. 1609) was the theory of semantic primitives influential  in early thesaurus construction: "A thesaurus was seen as providing a set of domain-independent semantic primitives.". According to this theory can every word be broken up into primitive kernels of meaning, semantemes (also called semantic features or semantic components). Semantemes are terms that are used to explain other terms or concepts but cannot themselves be explained by other terms. The process of breaking words down into semantemes is known as componential analysis and has been most often used lo analyze kinship terms across languages. The components are often given in considerable detail.



“Most thesauri establish a controlled vocabulary, a standardized terminology, in which each concept is represented by one term, a descriptor, that is used in indexing and can thus be used with confidence in searching; in such a system the thesaurus must support the indexer in identifying all descriptors that should be assigned to a document in light of the questions that are likely to be asked. . . .

A good thesaurus provides, through its hierarchy augmented by associative relationships between concepts, a semantic road map for searchers and indexers and anybody else interested in an orderly grasp of a subject field.”  (Soergel, 2004).


Important semantic relations used in thesauri

"Scope note"           A definition of the term or an explanation of the meaning of the term and its use in a specific database.


Non-descriptor        A synonym used as a lead-in term to a descriptor.


U: Use                   Reference to a descriptor. (Sometimes, e.g. INSPEC-thesaurus (1993), termed "lead-in terms “or “cross-references”). The relation between lead-in term and descriptor is one of synonymy.


UF: Used For           "Preferred term" cross reference to  "lead-in"-terms. (Synonym relation)



qualifiers:              Devise used to distinguish between different meanings of a word (homograph/Homonym relationship).

                                   Example: Letters (Alphabet);

                                                Letters (Correspondence);    


BT: Broader term    Sometimes a distinction is made between “Generic broader terms” and “partitive broader terms:

                   BTG: Broader Term Generic 

                                                         Example:    Lion

                                                                                      BTG: Mammals

                   BTP: Broader Term Partitive

                                                         Example:    Zealand

                                                                                      BTP: Denmark


NT: Narrower term Sub-concept. Again a distinction may be made between generic and partitive sub concepts.

                   NTG: Narrower Term Generic

                                                         Example:    Mammals

                                                                                     NTG: Lions

                   NTP: Narrower Term Partitive

                                                         Example:    Denmark

                                                                                     NTP: Zealand            


RT: Related term  Other kinds of relations than Generic/Partitive and synonym/homonym relations.


TT: Top term.      Symbolizes the highest hierarchical level in the thesaurus. (Generic or partitive relation).

                                                         Example:    Zealand

                                                                                  BTP: Denmark

                                                                                  TT: Geographical areas

Rotated index      Alphabetical index, each word in a phrase is an access point (Syntagmatic relations).

Thesaurofacet      Facet applied in a thesaurus. (Paradigmatic relations).


"The explosive growth of Web search engines, with their primitive algorithms, has had some rather unfortunate effects, to my mind. Some of these engines appear to have been developed by people who saw a need, but who had not the vaguest idea that there was already a history of development of tools to fulfill similar needs. There is little evidence that some of these developers had ever used either Dialog or a library catalog.

    We should distinguish kinds of tools for facilitating access to full text on the basis of the attention they give to semantics. Older, exact-match (Boolean) systems give no attention to semantics. The search terms must appear in the document for it to be retrieved; if a term appears at all the document will be retrieved regardless of whether the term is important to the meaning of the document or not. Another approach relies on statistical information -- co-occurrence of words in the document, frequency, etc. Boolean and statistically-based systems have been found to have comparable retrieval performance, but to produce very different retrieval sets. That is, searches of the same database using a Boolean engine and a statistically-based one often produce about the same number of relevant hits, but there may be little overlap between the two sets of hits. " (Milstead (1998)



”It has come to be self-evident that a classification scheme is an indispensable tool when compiling a thesaurus. When the editor is forced to work solely within an alphabetical list of numerous descriptors, at the level of the individual term, there is a sense of working “blind”. In contrast, where a rigorous classification is developed, providing an overall picture of the subject area, the compiler has a better chance of building accurate and meaningful relationships between the terms. “ (Aitchison & Dextre Clarke, 2004, p. 10).






Birger Hjørland

Last edited: 10-08-2007