Automatic Indexing

Automatic indexing is indexing made by algorithmic procedures. The algorithm works on a database containing document representations (which may be full text representations or bibliographical records or partial text representations and in principle also value added databases). Automatic indexing may also be performed on non-text databases, e.g. images or music.

In text-databases may the algorithm perform string searching, but is mostly based on searching the words in the the single document representation as well as in the total database (via inverted files). The use of words is mostly based on stemming). Algorithms may count co-occurrences of words (or references), they may consider levels of proximity between words, and so on.

Automatic indexing may be contrasted to human indexing. It should be considered however, that if humans are being taught strict rules on how to index, their indexing should also be considered mechanical or algorithmic. If, for example, a librarian mechanically matches words from titles with words from a controlled vocabulary, is this corresponding to primitive forms of automatic indexing. It is also an open question whether the principles developed by the facet analytic approach can be automated. Of this reason should manual indexing and machine indexing not necessarily be considered two fundamentally different approaches to indexing, but the principles and assumptions underlying both kinds of indexing should be uncovered. For example, are assigned and derived indexing approaches, which may be applied - although differently - by both humans and machines. As pointed out by Anderson & Pérez-Carballo (2001, p. 237), we know more about computer indexing than about human indexing because "machine methods must be rigorously described in detail for the computer to carry them out". Automatic indexing may thus inspire us to put more precise questions also about human indexing.

The earliest and most primitive form of automatic indexing were the KWIC / KWAC / KWOC systems based just on simple, mechanical manipulations of terms derived from document titles. Related forms are the Permuterm Subject Index and the KeyWord Plus known from ISI's citation indexes (this last system is based on assigning terms from cited titles).

When full text documents are available in a digital medium may a simple kind of automatic indexing of course be made by putting all words (except stop words) into a database and produce an index in alphabetical order. Such a primitive, mechanical index is easily made by computer, but is extremely time consuming to produce by human beings. Although such an index is very primitive compared to other kinds of indexes, it has important merits for certain kinds of queries, and most of us expects today that we are able to identify documents and pages in which a certain word or phrase appears. We expect to do this kind of searches in full-text documents on the Internet, and we may, for example, on Amazon find books in which the phrase "domain analysis" is just mentioned on one arbitrary page. Clearly such a technique is valuable in the situations in which rare expressions are searched for.

The main problems with such simple indexes are that they in many cases have too low precision because normally we are not searching rare expressions, but common words or phrases. Recall may also be a problem because of synonymy. We may, for example use a brand name in searching for a drug where the chemical name appears in the document. Another problem is generic level: we may use too broad or too narrow terms. Basically are problems in automatic indexing, as in other kinds of knowledge organization, thus related to meanings and semantic relations (cf., Hjørland, 2007).

Research in automatic indexing is―like indexing and IR in general― intended to improve recall and precision in document retrieval, including providing clues for query refinement and related problems. For this purpose are many different kinds of techniques tested and otherwise explored.

A very influential way to cope with the problem of lack of precision in common search terms is to provide some kind of weighting of terms, for example, tf-idf (term frequency–inverse document frequency), which is frequently used in many search engines, without users have to know about the underlying technique. The intuitive philosophy behind tf-idf is that terms that are frequent in many documents are less suited to make discriminations, while terms that are frequent within a single document may indicate that this document has much information about the things the terms are referring to. This is, however, just one among a long range of actual used or potential useful strategies to cope with these problems (to be presented below).

Some techniques are fully automated, while other are semi-automatic or machine-aided. For example is the technique "text categorization" based on manually predetermined categories, while another technique, "document clustering", is not.

Automatic indexing may be based on terms and structures in documents alone or it may be based on information about user preferences, external semantic resources (e.g. thesauri) or other kinds of external information. (Relevance feedback is a technique that rely heavily on user preferences. Although it is less associated with automatic indexing than with information retrieval).

Some techniques, such as those based on vector space models disregards structures in the texts, whereas other approaches are utilizing information about structures, for example, recent approaches in XML-based retrieval.

"Natural language systems attempt to introduce a higher level of abstraction indexing on top of the statistical processes. Making use of rules associated with language assist in the disambiguation of terms and provide an additional layer of concepts that are not found in purely statistical systems. Use of natural language processing provides the additional data that could focus searches, " (Kowalski, 2000, 135-136).

Automatic indexing may be related to particular views on semantics and on systems evaluation that differs from philosophies associated with “intellectual indexing”. Semantic relations such as synonymy may be understood as a strong degree of co-occurrences (cf., Sparck Jones, 1992, p. 1608).

"Throughout the history of automatic indexing, two major theoretical models have emerged: the "vector-space model" and the probabilistic model. Sparck Jones, Walker and Robertson (2000) have provided a through review of the development, versions, results, and current status of the probabilistic model. In comparing this model to others, they conclude that

"by far the best-developed non-probabilistic view of IR is the vector-space model (VSM), most famously embodied in the SMART system (Salton, 1975, Salton & McGill, 1983a). In some respect the basic logic of the VSM is common to many other approaches, including our own [i.e., the probabilistic model] . . . In practice the difference [between these two models] has become somewhat blurred. Each approach has borrowed ideas from the other, and to some extent the original motivations have become disguised by the process. . . . This mutual learning is reflected in the results of successive round[s] of TREC. . . . It may be argued that the performance differences that do appear have more to do with choices of the device set used, and detailed matters of implementation, than with foundational differences of approach" (part 2, pp. 829-830).

The focus of our discussion will be on the automatic indexing of language texts. The various tactics and strategies are emphasized, rather than the underlying theoretical models" (Anderson & Pérez-Carballo, 2001, p. 256).

Sparck Jones, Walker and Robertson (2000) compare their own probabilistic approach with other "approaches, models, methods and techniques":

As we see seem different authors writing on approaches to automatic indexing to disagree on what approaches actually exists. One way to consider approaches would be to consider the different levels of language considered (see also linguistic aspects of LIS):

                                                             Pragmatic
                                                   Discourse
                                        Semantic
                              Syntactic
                    Lexical
          Morphological
Phonetic

Liddys model (2003) of Natural Language Processing

Hjørland has in several writings (e.g. 1992, 1997, 2002) suggested that approaches to Library and Information Science (LIS) are basically epistemologically approaches, why they may be classified according to epistemological positions, e.g. in empiricist, rationalist, historicist and pragmatist approaches). For the application of these categories to indexing in general see indexing theory). Is this classification also possible and valid for automatic indexing?

In principle, this should be the case. However, as pointed out by Liddy (2003, p. 2130) has the "lower levels" of language been thoroughly researched and implemented in natural language processing. Such lower levels (sounds, words, sentences) are more related to automatic indexing, while higher levels (meaning, semantics, pragmatics, discourses) are more related to human understanding and indexing. This may mean that research on automatic indexing has so far not considered historicist and pragmatic approaches very much. As claimed by Svenonius (2000, p. 46-49) seems automating subject determination to belong to logical positivism: a subject is considered to be a string occurring above a certain frequency, which is not a stop word, and/or is found in a given location (e.g. title), or, in clustering algorithms, inferences are made such as “if document A is on subject X, then if document B is sufficiently similar to document A (above a certain threshold), then document B is on that subject.”

A classification of approaches according to the epistemological point of view might look in this way:

Pragmatic approaches (approaches considering values, goals, interests, "paradigms", epistemologies).

"For the past ten years DRTC/ISI have had several projects on automatic indexing and automatic classification based on the conceptual principles of faceted classifications by Ranganathan and Bhattacharyya's theory of "deep structure of subject indexing languages". E.g. POPSI (knowledge representation model chosen to support inference rules for syntax synthesis), PROMETHEUS (parses expressive titles and extracts noun phrases within documents which are then processed through a knowledge representation model to generate meaningful strings) and VYASA (a knowledge representation system for automatic maintenance of analytico-synthetic scheme) " Aida Slavic, 2006-09-03, message posted on isko-l@lists.gseis.ucla.edu

"The primary reason computers cannot automatically generate usable indexes is that, in indexing, abstraction is more important than alphabetization. Abstractions result from intellectual processes based on judgments about what to include and what to exclude. Computers are good at algorithmic processes such as alphabetization, but not good at inexplicable processes such as abstraction. Another reason is that headings in an index do not depend solely on terms used in the document; they also depend on terminology employed by intended users of the index and on their familiarity with the document. For example: in medical indexing, separate entries may need to be provided for brand names of drugs, chemical names, popular names and names used in other countries, even when certain of the names are not mentioned in the text. A third reason is that indexes should not contain headings for topics for which there is no information in the document. A typical document includes many terms signifying topics about which it contains no information. Computer programs include those terms in their results because they lack the intelligence required to distinguish terms signifying topics about which information is presented from terms about which no information is presented. A fourth reason is that headings and subheadings should be tailored to the needs and viewpoints of anticipated users. Some are aimed at users who are very knowledgeable about topics addressed in the document; others at users with little knowledge. Some are reminders to those who read the document already; others are enticements to potential readers. To date, no one has found a way to provide computer programs with the judgment, expertise, intelligence or audience awareness that is needed to create usable indexes. Until they do, automatic indexing will remain a pipe dream." (Tulic, 2005, emphasis in original).

Anderson & Pérez-Carballo, on the other hand, find that human indexing has to be limited to specific kinds of tasks, which can justify their high costs and concludes their discussion of automatic indexing:

"The bottom line is clear: automatic indexing works! And it appears to work just as well as human indexing, just differently. . . " (Anderson & Pérez-Carballo, 2001, p. 236-237).

An important aspect is, of course, the qualifications of the human indexer. Should the author, for example, be the indexer of his or her own works? (Cf., author supplied keywords).

Organize all words in a text and in a given database and make statistical operations on them (e.g. Td-idf).

For example, consider these sentences from Bar-Hillel (1960): "Little John was looking for his toy box. Finally he found it. The box was in the pen. " The word pen can have at least two meanings (a container for animals or children, and a writing implement). In the sentence The box was in the pen one knows that only the first meaning is plausible; the second meaning is excluded by one's knowledge of the normal sizes of (writing) pens and boxes. Bar-Hillel contended that no computer program could conceivably deal with such "real world" knowledge without recourse to a vast encyclopedic store.

Warner (x) expresses the view that only syntactic labor, not semantic labor can be automated. Semantic and syntactic labor is defined in, for example, Warner (2002):

"Semantic labour is concerned with the content, meaning, or, in semiotic terms, the signified of messages. The intention of semantic labour may be the construction of further messages, for instance, a description of the original message or a dialogic response.

Syntactic labour is concerned with the form, expression, or signifier of the original message. Transformations operating on the form alone may produce further messages (classically, this would be exemplified in the logic formalised by Boole)."

Automatic indexing may ―at first ― look like a reasonable limited and well-defined research topic. Important developments have taken place, the practical implication which most of us use almost every day. However, there seems to be no limits to how automatic indexing may be improved and how the theoretical outlook opens-up. Nearly every aspect of human language may be involved in the improvement machine processing of language (and each natural language may need special consideration). Language is again connected to human action and to cultural and social issues, and a given natural language is not just one well-defined thing, why forms of sublanguages also have to be considered. Research in automatic indexing is no longer primarily a question of better computers, but primarily a question of better understanding of human language and the social actions, that this language is serving.

Assigned indexing which is not just a not simple substitutions of document terms with synonyms, but which represents independent conceptualizations of document contents may turn out to be the most important area in which human indexing performs better than automatic indexing (for example assigning "romantic poem" to a poem, which does not describe itself as such).

Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part I: Research, and the nature of human indexing. Information Processing & Management, 37(2), 231-254.

Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing & Management, 37(2), 255-277.

Bar-Hillel, Y. (1960). The present status of automatic translation of languages. Advances in Computers, 1, 91-163.

Chen, H, Yim, T, Fye, D & Schatz, B (1995). Automatic thesaurus generation for an electronic community system. Journal of the American Society for Information Society, 46, 175 – 193.

Ellis, D. (1996). Progress and Problems in Information Retrieval. London: Library Association Publishing.

Faraj, N.; Godin, R. & Missaoui, R. (1996). Analysis of an automatic-indexing method based on syntactic analysis of text. Canadian Journal of Information and Library Science-Revue Canadienne des Sciences de l'Information et de Bibliotheconomie, 21(1), 1-21.

Gnoli, C. (2004). Is there a role for traditional knowledge organization systems in the Digital Age? The Barrington Report on Advanced Knowledge Organization and Retrieval (BRAKOR) 1(1). http://eprints.rclis.org/archive/00001415/01/kos-role.htm

Golub, K. (2005). Automated subject classification of textual web pages, for browsing. Lund: Lund University, Department of Information Technology. Available: http://www.it.lth.se/koraljka/Lund/publ/LicE.pdf

Hjørland, B. (1992). The Concept of "Subject" in Information Science. Journal of Documentation, 48(2), 172-200. Click for full-text PDF

Hjørland, B. (1997): Information Seeking and Subject Representation. An Activity-theoretical approach to Information Science. Westport & London: Greenwood Press.

Hjørland, B. (2002). Epistemology and the Socio-Cognitive Perspective in Information Science. Journal of the American Society for Information Science and Technology, 53(4), 257-270.

Hjørland, B. (2007). Semantics and knowledge organization. Annual Review of Information Science and Technology, 41, 367-405.

Hodges, J. E. (2000). Automated systems for the generation of document indexes. IN: Encyclopedia of Library and Information Science. Ed. by A. Kent & C. M. Hall. New York: Marcel Dekker. (Vol. 66, supplement 29, pp. 1-19).

Kowalski, G. J. & Maybury, M. T. (2000). Information storage and retrieval systems: Theory and implementation. 2^nd ed. Norvel, Mass.: Kluwer Academic Publishers. Chapter 6: Document and term clustering, pp. 139-163.

Lancaster, F. W. (1991/1998/2003). Indexing and abstracting in theory and practice. London: Library Association. (1st ed. 1991; 2nd ed. 1998; 3rd. ed. 2003).

Liddy, E. D. (2003). Natural Language Processing. IN: Encyclopedia of Library and Information Science. New York: Marcel Dekker.

Luckhardt, H.-D. (2006). Approaches to sense disambiguation with respect to automatic indexing and machine translation. Saarbrücken: Universität des Saarlandes, Philosophische Fakultäten, Informationswissenschaft. http://is.uni-sb.de/studium/handbuch/infoling/ambi/general

Rehm, G. (2002). Towards Automatic Web Genre Identification. A Corpus-Based Approach in the domain of Academia by Example of the Academic’s Personal Homepage. Published in the Proceedings of the Hawai’i International Conference on System Sciences, January 7–10, 2002 http://www.uni-giessen.de/~g91063/pdf/HICSS35-rehm.pdf

Salton, G. (1975). A theory of indexing. Philadelphia: Society for Industrial and Applied Mathematics.

Salton, G. (1968). Automatic Information Organization and Retrieval. New York: McGraw-Hill.

Salton, G. (1989). Automatic Text Processing: Transformation, Analysis, and Retrieval of Information by Computer. Reading. Mass.: Addison-Wesley.

Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523.

Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. Englewood Cliffs, NJ: Prentice-Hall.

Sebastiani, F. (2003). Research in automated text classification: Trends and perspectives. Manuscript. 4^th International Colloquium on Library and Information Science, Salamanca, 5-7 May 2003. (Invited speech).

Sparck-Jones, K. (1980). Statistically-based document indexing. Skrifter om Anvendt og Matematisk Lingvistik, (SAML), No. 6, 79-93.

Sparck Jones, K. (1992). Thesaurus. IN: Encyclopedia of Artificial Intelligence, Vol. 1-2. Ed by S. C. Shapiro, New York: John Wiley & Sons. (Vol. 2, pp. 1605-1613).

Sparck Jones, K.; Walker, S. & Robertson, S. E. (2000). A probabilistic model of information retrieval: Development and comparative experiments. Information Processing & Management, 36(6), 779-840. Available: http://www.soi.city.ac.uk/~ser/blockbuster.html

Svenonius, E. (2000). The Intellectual Foundations of Information Organization. MIT Press, Cambridge, MA.

How do you see the role of human indexing in the future? If there is any future: What kind of knowledge should human indexers possess? What kind of work is made obsolete by computers?