Information retrieval (IR) as challenger to Knowledge Organization (KO)
Information retrieval (IR) and knowledge organization (KO) are normally considered two different, although strongly related subdisciplines within Library and Information Science (LIS) - related to search labor and description labor, respectively (cf. Warner, 2002). However, because they are trying to solve the same kind of problems, we have to consider them competing approaches, and thus try to evaluate their relative strengths and weaknesses. The question then becomes: Can IR be characterized as an approach relative to the other approaches discussed?
This is not simple, because IR of course will pick up ideas they find useful and try to make them part of their own approach. This may, for example, be the case with a concept such as genre and genre-analysis. One distinction in the literature have been between "physical paradigm" (or "system-driven") approaches on on side and "user oriented" or "cognitive paradigm" on the other. The value of this distinction has to be examined, and other distinctions may be proposed. My own view is that this distinction may represent a misinterpretation. The difference between Cranfield experiments and user oriented views is first and foremost that the first is based on expert evaluations of recall and precision, while the last is based on users evaluation. It is never the technology that make the decision of what is relevant.
Information retrieval is mostly based on fundamental assumptions about likeliness or match between a question and a document representation. This assumption is of course often fruitful, we all benefit from search engines based on this principle. However, it is important to understand that this assumption may sometimes be problematic. For example, sometimes it is relevant to identify papers that are co-cited whether or not these documents are "similar". Just to understand IR as "query transformation" is thus problematic as Julian Warner (2002) has pointed out:
"The idea of query transformation, understood as the automatic transformation of a query into a set of relevant records, has been dominant in information retrieval theory. A contrasting principle of selection power has been valued in ordinary discourse, librarianship, and, to some extent, in practical system design and use" (Warner, 2002).
Warner's idea is thus that the IR-approach may be characterized as "query transformation" and have some inbuilt weaknesses related to this concept.
Both the traditional classification systems (like UDC) and the facet-analytic method came under attack from the information retrieval tradition (IR), which was founded in the 1950’ties with experimental traditions like Cranfield (later continued in the TREC-experiments and with the development of Internet search engines).
The Cranfield experiments found that classification systems like UDC and facet-analytic systems were less efficient that free-text searches or low level indexing systems (“UNITERM”). Although KOS such as thesauri and descriptors are children of the IR-tradition, the main tendency has been to question the value of traditional classification and facet analysis and human indexing all together. It has more or less implicit worked with the assumption that algorithm working on textual representations (best full text representations) may fully substitute human indexing as well as algorithms constructed on the basis of human interpretations. (What is termed "text categorization" is a machine-learning approach involving manually categorizing a number of documents to pre-defined categories. This technique is an example in which human classification and machine classification is combined).
If one does not question the results obtained in this approach it implies the end of knowledge organization as a research field to be substituted by IR. This is the reason why it is important to consider IR as one among other approaches to KO in order to identify its relative strengths and weaknesses. Of course traditional classification systems may still be needed for shelf-arrangement, but this is a rather narrow issue, which cannot justify the existence of the larger research field of KO. Users are increasingly relying on electronic databases as well as Internet search Engines to find information, also information from libraries, why library KO compete with other providers of subject access and descriptive access to documents.
According to Bell (2004), “Google has become the symbol of competition to the academic library”.
Sparck Jones concludes her answer to Hjørland & Nissen Pedersen (2005) in the following way:
"At the same time, one of the most important techniques developed in retrieval research and very prominent in recent work, namely relevance feedback, raises a more fundamental question. This is whether classification in the conventional, explicit sense, is really needed for retrieval in many, or most, cases, or whether classification in the general (i.e. default) retrieval context has a quite other interpretation. Relevance feedback simply exploits term distribution information along with relevance judgments on viewed documents in order to modify queries. In doing this it is forming and using an implicit term classification for a particular user situation. As classification the process is indirect and minimal. It indeed depends on what properties are chosen as the basic data features, e.g. simple terms and, through weighting, on the values they can take; but beyond that it assumes very little from the point of view of classification. It is possible to argue that for at least the core retrieval requirement, giving a user more of what they like, it is fine. Yet it is certainly not a big deal as classification per se: in fact most of the mileage comes from weighting. And how large that mileage can be is what retrieval research in the many experiments done in the last decade have demonstrated, and web engines have taken on board." (Sparck Jones, 2005, p. 601).
Let us consider Sparck Jones suggestion that ". . . relevance feedback, raises a more fundamental question. This is whether classification in the conventional, explicit sense, is really needed for retrieval . . ." Suppose, for example, a person is searching information about "Sweden". Some references are retrieved by using search terms (or otherwise). The user indicates which references are relevant and the system is supposed to find "more like this". In a traditional classification may all Swedish place names be classified (e.g., Borås, Lund, Malmö, Stockholm . . .). Can such a classification be replaced by mechanisms providing relevance feedback? One problem might be that the user do not know which place names are Swedish and which are not Swedish. He may provide incorrect feedback (e.g. by stating that a reference about "Bagsværd" is relevant). A possibility may therefore be that users are not able to retrieve the relevant documents and to avoid the non-relevant documents by systems based on relevance-feedback. In other words: Classification in the traditional sense is still needed.
Gerard Salton wrote a letter to the editor of the American Journal of Information Science, criticizing views put forward by Hjørland & Albrechtsen, in which he considered traditional tools in knowledge organization, such as thesauri, for obsolete:
"Ignoring the completely changed conditions under which information retrieval activities are now taking place, forgetting all the accumulated evidence and test data, and acting as if we were stuck in the nineteenth century with controlled vocabularies, thesaurus control, and all the attendant miseries, will surely not contribute to a proper understanding and appreciation of the modern information science field" (Salton, 1996, 333)
The domain analytic criticism of the IR approach is in particular related to the following argument: Any database could be seen as a merging of different texts with different approaches and different conceptualizations to a given problem. One should not be interested in statistical averages but in differences in meaning related to underlying views and interests. Given a mapping of such views and interests may better retrieval mechanisms be developed in relation to a given view or interest.
“Libraries core skill is not delivering information. Libraries improve the quality of the question and the user experience. “(slide 13) and on slide 17: Google is most efficient in answering what, when and where questions. Libraries are better at answering why and how questions. (compare also slide 103-104 and 143). (Abram, 2007). |
Literature:
Abram, S. (2007). The social library 2.0. Presentation given at the Royal School of Library and Information Science in Copenhagen, on March 13. (24MB). Click for presentation
Bell, S. (2004), “The infodiet: how libraries can offer an appetizing alternative to Google”, The Chronicle of Higher Education, 50(24), p. B15.
Brophy, J. & Bawden, D. (2005). Is Google enough? Comparison of an internet
search engine with academic library resources. ASLIB Proceedings, 57(6),
498-512.
Broughton, V. (2006). The need for a faceted methods of information retrieval. ASLIB Proceedings, 58(1-2), 49-72.
Broughton, Vanda, Hansson, Joacim, Hjørland, Birger and López-Huertas, Maria J. (2005), “Knowledge organisation: Report of working group 7”, in Kajberg, L. and Lørring L. (Eds), European Curriculum Reflections on Education in Library and Information Science, Royal School of Library and Information Science, Copenhagen, available at:
http://biblis.db.dk/uhtbin/hyperion.exe/db.leikaj05
Ellis, D. (1996). Progress and
Problems in Information Retrieval. London: Library Association Publishing.
Chapter 1.
Gerhart, S. (2004). Do Web search engines suppress controversy? First Monday, 9(1), http://www.firstmonday.org/issues/issue9_1/gerhart/index.html
Gnoli, C. (2004). Is there a role for traditional knowledge organization systems in the Digital Age? The Barrington Report on Advanced Knowledge Organization and Retrieval (BRAKOR) 1(1). http://eprints.rclis.org/archive/00001415/01/kos-role.htm
Hjørland, B. (1996). Rejoinder: A New Horizon for Information Science. Journal of the American Journal for Information Science, 1996, 47(4), 333-335. (Answer to Salton, 1996). Click for fulltext.pdf
Hjørland, B. & Nissen Pedersen, K. (2005). A substantive theory of classification for information retrieval. Journal of Documentation, 61(5), 582-597. Click for full-text pdf.
Salton, G. (1996). Letter to the editor. A new horizon for Information Science. Journal of the American Journal for Information Science, 47(4), 333. Click for fulltext.pdf
Sparck Jones, K. (2005). Revisiting classification for retrieval. Journal of Documentation, 61(5), 598-601. [Reply to Hjørland & Nissen Pedersen, 2005]. http://www.db.dk/bh/Core%20Concepts%20in%20LIS/Sparck%20Jones_reply%20to%20Hjorland%20&%20Nissen.pdf
Warner, J. (2002). Forms of labour in information systems. Information Research 7(4), http://informationr.net/ir/7-4/paper135.html
See also: Automatic Indexing; Bag of word approach; Cranfield experiments (Core Concepts in LIS); Internet and KO;
Birger Hjørland
Last updated: 18-03-2007
Questions:
Some arguments for "intellectual" as opposed to "automatic" KO is related to the use of controlled vocabulary, especially the control of homonymy and synonymy. Discuss the validity of this argument.
The networking of libraries, publishers, bibliographical services, etc. have changed the need for knowledge organizing in individual libraries. Discuss the implications. (What would you do if you were head of a library)?
After Google. What more is needed?
How do we evaluate progress? Are there neutral, objective ways to do this or are evaluation methods themselves influenced by theoretical issues and thus related to different approaches?
Are there principal limits to what a computer cannot do with respect to knowledge organization? Discuss the relative merits of humans and computers.