Human indexing ("manual indexing")

Human indexing is often contrasted to automatic indexing. It is also termed "manual indexing" (cf., automation). Machine-aided indexing is an overlapping form combining human skills with computer power.


Much human indexing is, however, rather mechanical. In such cases humans and computers function in similar way. If, for example, a book has the title "psychology" and there exists a class "psychology" in a classification scheme, then, very often, humans simply match the title with the class in a mechanical way. This is similar to make an algorithm making the match. In other cases, however, human indexers make interpretations, that computers cannot do. In a way one could say that the specific human capabilities are only used in the cases, where decisions differ from such simple rules. If, in a few cases, the indexer decide, that the title is misleading, the book is not about psychology, but about physiology, then the human indexer does something, that is much more difficult (but even this might sometimes be done by computers based on, for example, statistical measures of similarity or bibliographic coupling). The real important human element in indexing is connected to the judgment: What in this document is important for the (actual or potential) users of our service? (i.e. communicating the existence of the document to a potential user, thus relatively expose the valuable documents and hide the noise).


There is an inverse relation between ease and value of indexing. If schools of LIS teach simple, mechanical rules of indexing then machines can learn to do the job. Formerly most libraries in the world used to index its own books. Now this have changed, and the market for simple, more mechanical ways of indexing books in libraries has diminished.


As pointed out by Anderson & Pérez-Carballo (2001, p. 237), we may know more about computer indexing than about human indexing because "machine methods must be rigorously described in detail for the computer to carry them out".


The following quote (in frame) seems provocative to me because I believe that research and teaching may improve human indexing, and because I believe that the understanding of human indexing is necessary also in order to advance automatic indexing: Concepts and theories about the nature of language and knowledge should be much better explored in relation to both human indexing and automatic indexing. If we look, for example, at how living organisms are conceptualized, named and ordered, we see that computers have had only a superficial effect  (cf. biology), why knowledge organization is basically dependent on human cognition.


"The effectiveness of manual TC [text categorization] is not 100% anyway (Cleverdon 1984) and, more importantly, it is unlikely to be improved substantially by the progress of research."  (Sebastiani, 2002, p. 41).


The quote is not only provocative. It is also very poorly documented. What Cleverdon actually wrote was:


 "There has been the failure to realise and accept that retrieval of citations from a bibliographic database approximates to a random process. Support for this statement comes from investigations into various aspects of the storage and retrieval process, in which the results indicate that

1) if two people or groups of people construct a thesaurus in a given subject area, only 60 percent of the index terms may be common to both thesauri [source: "Private communication"]

2) if two experienced indexers index a given document using a given thesaurus, only 30 percent of the index terms may be common to the two sets of terms [source: Borko, 1979]

3) if two intermediaries search the same question in the same database on the same host, only 40 percent of the output may be common to both searches [source: Cleverdon, 1977]

4) if two scientists or engineers are asked to judge the relevance of a given set of documents to a given question, the area of agreement may not exceed 60 percent [two sources: Cleverdon, 1970 and Lesk & Salton, 1968]"

    Undoubtedly there will be many expectations to the above generalised statements, but all are supported by results from experimental or operational tests and under such circumstances it is surprising that retrieval systems can operate at performance levels of 60 percent recall and 50 percent precision. However, the apparent randomness of the process means that optimization is a more difficult problem than would be the case if it were a precise and rational activity" (Cleverdon, 1984, p. 38-39)


So, in the first round we observe, that Sebastiani did not refer directly to empirical documentation but just provided a very imprecise reference to second-hand claims. His source (Cleverdon, 1984) is based on sources, of which at least one is useless ("private communication"). The value of the other sources used by Cleverdon will be discussed below (which is made difficult by lacking page numbers and the lack of precise quotes).


Lesk & Salton (1968) .






Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part I: Research, and the nature of human indexing. Information Processing & Management, 37(2), 231-254.


Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing & Management, 37(2), 255-277.


Borko. H. (1979). Inter-indexer consistency. 7th Cranfield Conference.


Cleverdon, C. W. (1970). Effects of variations in relevance assessments in comparative experimental tests of index languages. Cranfield Library Report, 3.


Cleverdon, C. W. (1977). A comparative evaluation of searching by controlled language and natural language in an experimental NASA database. European Space Agency Contract Report 1/432.


Cleverdon, C. (1984). Optimizing convenient online access to bibliographic databases. Information Services and Use, 4(1), 37-47. Also reprinted in Willett, P. (ed). (1988). Document Retrieval Systems. Taylor Graham, London, UK., 32–41.


Lesk, M. E. & Salton, G. (1968). Relevance assessment and retrieval system evaluation. Information Storage and Retrieval Report IRS-14, Cornell University.


Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.



Tulic, M. (2005). Book indexing > about indexing > automatic indexing.






Birger Hjřrland

Last edited: 30-01-2007