Indexing theory

Indexing is depending both on the document to be indexed and on the indexer performing the process under specific conditions in a specific environment. Different documents are of course indexed differently by the same indexer. If they were not the index would be non-discriminative and total useless. Any theory of indexing has to deal with this fact and thus with how document attributes or properties should influence its representation.


The same document may be indexed differently by different indexers or by the same indexer at different times or by different indexing systems or in different libraries, for different target groups or for different ideal purposes. (See Consistency in Knowledge Organization; Request oriented indexing).


The indexing is close to the document if it is constructed by a set of terms selected mechanically from the document (e.g. from titles, references or full-text). This is the objective pole because the document is the object of the indexing process. Also the rhetorical view of indexing (Andersen 2004) is close to the objective pole emphasizing what the author of the document is arguing.


The subjective pole of indexing theory emphases that the same document may be seen differently by different people or systems and that the indexing should not aim at a purely objective representation but should also consider, for example, the collection to which the document belongs or the tasks for which the indexing is made. Automatic indexing usually represent the terms of a document relative to the terms frequency in a collection of documents. In this way is the representation not just a function of the document itself, but also a function of a collection. Another example is that the same book may be indexed differently for library for gender studies compared to a library of historical studies. Still, the indexing has to be loyal to the document being indexed, but different aspects of the document may be emphasized and the subject may be expressed in different controlled vocabularies constructed to support either collection.


The importance of indexing documents specific to a specific discipline, task or point of view may be illustrated by an example from the Royal Library in Copenhagen. First, the practice in this library is that a given book is circulated to different subject bibliographers. Each subject bibliographer then make a decision whether the book is relevant to his or her discipline or not. If it is relevant it is then indexed within that discipline. In this way a given document may be indexed from multiple points of view in the same catalog. Second, a staff member, Nynne Koch, began about 1972 to collect printed catalog cards which she regarded important to a new field, which she defined and termed "feminology". This initiative later developed and became an important independent library and research center "KVINFO". The important point in relation to indexing theory is that this new library was not started by a special collection of books, but by a new way of indexing books belonging to other disciplines. This example demonstrates the importance of the subjectivity of indexing: to regard the indexing in relation to the aim of the indexing system.


Indexing should not, of course, aim at an idiosyncratic understanding of the individual indexer. It is not his or her special interests or points of view, which should be emphasized. An indexer work in order to accomplish a goal which is implicit or explicit in a given library or information system. It is this goal, not the individual indexers goal which should form the basis for the indexing. This insight has led to an ideal of inter-indexer consistency. However, as pointed out by Cooper (1969), indexing may be consistently wrong, why studies of inter-indexer consistency may not necessarily provide a basis for indexing quality.


The following quote demonstrates how difficult indexing often is:


”Anybody who has ever tried to index a psychoanalytic article or book knows how difficult it is to find the terms that accurately answer to our abstract vocabulary. And anybody who has tried to trace the definition or description of a psychoanalytic term or concept does not need to be told how difficult that can be. These problems led the Indexing Study Group of the American Psychoanalytic Association to an experiment. About a dozen seasoned analysts independently indexed a passage from the Standard Edition [of Sigmund Freud]. When they compared what they had done, all agreed that the failure to agree about which terms to index was humbling and impressive. The group did not even agree on which words required see or see also directives, or on the words that should follow those directives. ” (Klumpner, 1993, p. 1)


While we know that indexers often disagree, we know very little about why they disagree and whether a discussion between them could provide some kind of consensus (or at least some kind of systematic patterns in their disagreements). We have many quantitative studies measuring degrees of disagreement, but we have almost none qualitative studies discussing the nature of disagreements. O'Connor (1967, 1969) demonstrated how relevance disagreements could be resolved by discussion with a colleague. This might also be the case with disagreements in indexing: we simply lack studies of this kind to inform us. Probably are systematic patterns in indexer-disagreements among competent indexers mostly related to different theoretical understandings. This is indirectly confirmed by citation-studies (cf. Hjørland, 2002). Concerning indexing done by people without proper subject knowledge the problem may be that indexers make too broad descriptions why users are overloaded with references without being able to make the necessary discriminations.


It is difficult in the literature to find comprehensive overviews and discussions of indexing theories. Andersen (2004) should be praised for providing a broad overview of these, which are presented and discussed in chapter 7 in his dissertation. He use the following systematization of the theories:


7.3.1 The aboutness concept

Authors discussed: Fairthorne (1969) Maron (1977), Swift, Winn & Bramer (1977) Hutchins (1978)

7.3.2 The concept of subject and subject analysis

Authors discussed: Wilson (1968) Hjørland (1992, 1997) Langridge (1989) Fugmann (1993)

7.3.3. Request, user and cognitive-oriented indexing

Authors discussed: Soergel (1985) Fidel (1994) Pejtersen (1979, 1980, 1994) Pejtersen & Austin (1983, 1984) Farrow (1991, 1994, 1995)

7.3.4. Meaning, language and interpretation [and epistemology, cf., p. 153] [rhetorical view of indexing]

Authors discussed: Blair (1990, 1992, 2003) Frohmann (1990) Andersen & Christensen (2001) Campbell (2000b) Mai (2001) Blair & Kimbrough (2002)

7.3.5. Techniques of indexing

[Automatic indexing]. Authors discussed: Salton (1971) Salton & McGill (1983)

Pre-coordinate versus post-coordinate indexing. Authors discussed: None

Latent semantic indexing. Authors discussed: Deerwester et al. (1990) Letsche & Berry (1997)

Citation indexing. Authors discussed: Garfield (1979) Small (1978) Cozzens (1989) Nicolaisen (2003)


Although it is praiseworthy that he provides such a comprehensive overview of indexing theories, I do not find his classification of indexing theories fruitful. "The aboutness concept" is not a theory of indexing, neither is "the concept of subject". Any theory of subject indexing has to relate to the concept of subject in one way or another. It may be of minor importance whether it is termed subject or aboutness and whether these two words are regarded as synonyms or not. Different theories of indexing relates to concepts such as aboutness or subject in different ways. Also different theories of indexing may imply different techniques of indexing and may relate differently to theories of meaning, language and interpretation. In my opinion theories of indexing crosses Andersen's categories. In other publications (e.g. Hjørland, 1997) I have proposed a quite different classification of indexing theories based on the theories' epistemological assumptions:


Rationalist theories of indexing (such as Ranganathan's theory) suggest that subjects are constructed logically from a fundamental set of categories. The basic method of subject analysis is then "analytic-synthetic", to isolate a set of basic categories (=analysis) and then to construct the subject of any given document by combining those categories according to some rules (=synthesis). Empiricist theories of indexing are based on selecting similar documents based on their properties, in particular by applying numerical statistical techniques.  Historicist and hermeneutical theories of indexing suggest that the subject of a given document is relative to a given discourse or domain, why the indexing should reflect the need of a particular discourse or domain. According to hermeneutics is a document always written and interpreted from particular horizon. The same is the case with systems of knowledge organization and with all users searching such systems. Any question put to such a system is put from a particular horizon. All those horizons may be more or less in consensus or in conflict. To index a document is to try to contribute to the retrieval of “relevant” documents by knowing about those different horizons. Pragmatic and critical theories of indexing (such as Hjørland, 1997) is in agreement with the historicist point of view that subjects are relative to specific discourses but emphasizes that subject analysis should support given goals and values and should consider the consequences of indexing one way or another. These theories believe that indexing cannot be neutral and that it is a wrong goal to try to index in a neutral way. Indexing is an act (and computer based indexing is acting according to the programmers intentions). Acts serve human goals. Libraries and information services also serve human goals, why their indexing should be done in a way that supports these goals as much as possible. At a first glance this looks strange because the goals of libraries and information services is to identify any document or piece of information. Nonetheless is any specific way of indexing always supporting some kind of uses at the expense of other. The documents to be indexed intend to serve some specific purposes in a community. Basically the indexing should intend serving the same purposes. Primary and secondary documents and information services are parts of the same overall social system. In such a system different theories, epistemologies, worldviews etc may be at play and users need to be able to orient themselves and to navigate among those different views. This calls for a mapping of the different epistemologies in the field and classification of the single document into such a map. Excellent examples of such different paradigms and their consequences for indexing and classification systems are provided in the domain of art by Ørom (2003) and in music by Abrahamsen (2003).


The core of indexing is, as stated by Rowley & Farrow to evaluate a papers contribution to knowledge and index it accordingly. Or, with the words of Hjørland (1992, 1997) to index its informative potentials.


"In order to achieve good consistent indexing, the indexer must have a thorough appreciation of the structure of the subject  and the nature of the contribution that the document is making to the advancement of knowledge." (Rowley & Farrow, 2000, p. 99).

But again, there may be different views of what a contribution to knowledge is, and in what way a given document contributes (or do not contribute).






Birger Hjørland

Last edited: 13-08-2010