Controlled vocabulary

A controlled vocabulary is a list of terms or other symbols used in indexing. Terms not belonging to a controlled vocabulary are called free text terms, natural language terms, and sometimes keywords.

The idea of a controlled vocabulary is to reduce the variability of expressions used to characterize the document being indexed, e.g. by avoiding synonyms and remove ambiguity (homonyms). Lists of subject headings and thesauri are typical examples of controlled vocabularies. By principle one is only allowed to use terms from the controlled vocabulary in the indexing process. If a relevant term is missing from the controlled vocabulary, the indexer might suggest that the term is added to the list (in major systems this shall be formally accepted before the term may be used). In many databases there are fields for indexing terms with both controlled vocabularies (“descriptors”) and non-controlled vocabulary (“identifiers”). By principle are also classification codes to be regarded as a form of controlled vocabulary.

"A vocabulary is said to be controlled if it consists of a restricted subset of possible terms. Such a subset, in that it contains only those terms ‘‘authorized’’ for use, is sometimes called an authority list [cf., authority file]. In addition to terminological restriction, most CVs articulate semantic relationships between terms in the vocabulary, the most common of these being the inclusion of hierarchical relationship." (Svenonius, 2003, p. 822).

In PsycINFO is “occupational stress” a descriptor and thus a term from a controlled vocabulary. When indexing books titled “Burnout” this word must not be used in the descriptor field. This makes searches about occupational stress higher in recall, but if some users differentiate between burnout and (other forms of) occupational stress, then the precision are lower. In that case the user might supplement with free text searching in the title field and the identifier field, perhaps the abstracts or full text.

Cleverdon (1984, p. 39) reports the data shown in the table below (originally from Cleverdon, 1970):

	Recall (%)	Precision (%)
Natural language	74	64
Controlled language	58	59

Cleverdon is skeptical towards the use of controlled vocabularies and writes (1984, p. 39):

"The problems caused by the use of a controlled language thesaurus and variations in indexing can be overcome by eliminating these two activities and using, as input, an extract such as title and abstract, in natural (or free-text) language. Basically, a controlled index language represents a reduction in the totality of the potentially available terms in the given subject area. The consolidation of terms that can occur ranges from the compounding of real synonyms or spelling variations, to the more complex situation, where, based on subjective decisions, one or more specific terms are subsumed to a general term. This contraction of the potential total of index terms is intended to improve the recall ratio, but inevitably has the effect of lowering the precision ratio. In a printed index, the use of a controlled index language can be justified because it facilitates human searching by reducing the problem of looking under several different but related entries. The flexibility of a computer makes this a trivial consideration and such combining of terms as may, in a given search, be considered necessary is better done at the search stage than at the input. This appears to be one of the reasons why, in every test which has compared the performance of searching on controlled-language index terms as against searching on abstracts in natural language, the results have been in favour of natural language". (Cleverdon, 1984, p. 39)

"Ignoring the completely changed conditions under which information retrieval activities are now taking place, forgetting all the accumulated evidence and test data, and acting as if we were stuck in the nineteenth century with controlled vocabularies, thesaurus control, and all the attendant miseries, will surely not contribute to a proper understanding and appreciation of the modern information science field. (Salton, 1996, p. 333).

In controlled vocabularies it is the designer of the controlled vocabulary (the librarian or information specialist) who select and control the terms. The literature being indexed may, however, be considered more or less "controlled" by the language for special purposes, in which the text is written. Myers' (1990, p. 275) example, shown below, illustrates this point.

Natural languages are seen as the opposite of controlled languages. However, natural languages differ in this respect. For example is Danish much more controlled than Norwegian in allowing much less freedom, for example, in spelling.

Controlled vocabularies have mostly been seen as neutral tools, but it is important, as Fast; Leise & Steckel (2002) to consider it as an interpretative layer between a text and a user:

“A controlled vocabulary is a way to insert an interpretive layer of semantics between the term entered by the user and the underlying database to better represent the original intention of the terms of the user. Consider what happens when you do not use a controlled vocabulary. An uncontrolled vocabulary simply uses the natural language of the documents and matches that with the natural language of the user. This is extremely specific, and it gives the user exactly what they ask for. Sounds great right? Consider, however, a site about chemistry, where many of the documents use the chemical name of the element (“iron”), and many use the chemical symbol of the element (“Fe”). Using an uncontrolled vocabulary, the results will only include the terms entered by the user. If the user entered “Fe” in the search box, he will not get any of the results for documents that use the term “iron.” There is a good chance the user is missing some documents he would like to have. Very few users will enter both terms, and many will be reviewing their results thinking they are seeing the results from all relevant documents.” (Fast; Leise & Steckel, 2002).

Fast; Leise & Steckel (2002) do not, however, exemplify how this interpretative layer represents a dilemma for the user. In order to consider this, the above mentioned example from PsycINFO (burn out) might be useful. To understand controlled vocabularies as interpretative layers which have consequences for the users is to avoid a positivist understanding and approach a pragmatic understanding.

The most ambitious attempt to establish a controlled language is probably the Uniform Medical Language System (UMLS) in US National Library of Medicine (cf., metathesauri).

"Collaborative tagging has emerged as a means of organising information resources on the Web and is contradictory to the ethos of controlled vocabularies". (Macgregor & McCulloch, 2006).

Calkins, M. L. (1980). Free text or controlled vocabulary - A case history step-by-step analysis ... plus other aspects of search strategy. Database-The Magazine of Database Reference and Review, 3(2), 53-67.

Chamis, A. Y. (1991). Vocabulary Control and Search Strategies in Online Searching. Westport, Connecticut: Greenwood Press.

Cleverdon, C. W. (1970). Effects of variations in relevance assessments in comparative experimental tests of index languages. Cranfield Library Report, 3.

Cleverdon, C. (1984). Optimizing convenient online access to bibliographic databases. Information Services and Use, 4(1), 37-47. Also reprinted in Willett, P. (ed). (1988). Document Retrieval Systems. Taylor Graham, London, UK., 32–41.

Dubois, C. P. R. (1987). Freetext vs controlled vocabulary – a reassessment. Online Review, 11(4), 243-253.

Fidel, R. (1991). Searchers selection of search keys 2. Controlled vocabulary or free-text searching. Journal of the American Society for Information Science, 42(7), 501-514.

Fidel, R. (1992). Who needs controlled vocabulary. Special Libraries, 83(1), 1-9.

Lancaster, F. W. (1977). Vocabulary control in information retrieval systems. Advances in Librarianship, 7, 2-40.

Lancaster, F. W. (1986). Vocabulary Control for Information Retrieval. 2nd ed. Info Resources Press.

Lancaster, F. W. (1989). Natural language versus controlled language; a thirty year review of the literature of information science. (IN: Perspectives in information management 1. Ed. by C. Oppenheim et al. London: Butterworth.

Markey, K.; Atherton, P. & Newton, C. (1980). An analysis of controlled vocabulary and free text search statements in online searches. Online Review, 4(3), 225-236.

Myers, G. (1990). Writing Biology: Texts in the Social Construction of Scientific Knowledge. Madison, WI: University of Wisconsin Press.

Muddamalle, M. R. (1998). Natural language versus controlled vocabulary in information retrieval: A case study in soil mechanics . Journal of the American Society for Information Science, 49(10), 881-887.

Qin, J. (2000). Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature. Journal of the American Society for Information Science, 51(2), 166-180.

Rowley, J. (1994). The controlled versus natural indexing languages debate revisited: A perspective on information-retrieval practice and research. Journal of Information Science, 20(2), 108-119.

Salton, G. (1996). A New Horizon for Information Science. Journal of the American Society for Information Science, 47(4), 333.(Letter to the editor).

Savoy, J (2004). Bibliographic database access using free-text and controlled vocabulary: an evaluation. Information Processing & Management, 41, 873–890.

Svenonius, E. (1986). Unanswered Questions in the Design of Controlled Vocabularies. Journal of the American Society for Information Science, 37(5), 331-340.

Svenonius, E. (1989/2003). Design of controlled vocabularies. Encyclopedia of Library and Information Science. 2^nd ed. New York: Marcel Dekker, 2003 (pp. 822-838). (Reprinted from the first edition, 1989).

Comparison of Professional and Popular titles

(Based on Myers, 1990, p. 275)