Likeliness in Knowledge Organization (KO)
A fundamental principle in KO is that "like things" should be brought together, while "different things" should be separated. “Likeliness” is a concept that may also be expressed by other terms such as:
similarity
sameness (used by James, 1890),
resemblance
equivalence
Many writers in LIS have defined classification and KO by using the concepts of likeliness. For example:
Ernest Cushing Richardson (1964, p. 1) defined classification as the “putting together of like things, or more fully described, it is the arranging of things according to likeness and unlikeness. It may also be described as the sorting and grouping of things”.
Henry E. Bliss (1935, p. 3) wrote: “In dealing with the multiplicity of particular things, actualities, and specific kinds, we find that some are alike, in general characters and in specific characteristics; and we may consequently relate them in a class, or classes, that is classify them”.
How do we decide what things are alike? How do we develop our criteria for likeness? The literature on KO in LIS seems to ignore this core problem. Is this because the problem is seen as obvious? Is this based on a kind of naïve realism: that things are what they look like, and that people’s immediate sense of likeness is adequate as the basis for KO?
When are two things alike? Naïve realism confuses seeming similarity with similarity in an objective sense. Some metals might look like gold, but might not be precious metal. Chemical analysis (not common sense) has to make a distinction between “what looks like gold” and “what is gold”. Classification made for children may be based on more superficial attributes —e.g., "the big book of trains" which considers most aspects of the railway system". Scientific classifications, on the other hand, reflect some deeper properties, such as the classification of chemical substances in organic and inorganic compounds, precious metals etc. based on atomic theory. When scientific principles of classification are applied, seemingly related objects may be separated and seemingly different objects may be grouped together. For example, whales were once classified as fish, but are today – influenced by evolutionary theory - classified as mammals. These examples demonstrate that naïve realism is not adequate as a method to classify documents.
The biological species concept is a good example on how naive realism is different from scientific realism:
"a. Organisms may appear to be alike and be different species. For example, Western meadowlarks (Sturnella neglecta) and Eastern meadowlarks (Sturnella magna) look almost identical to one another, yet do not interbreed with each other—thus, they are separate species according to this definition.
b. Organisms may look different and yet be the same species. For example, look at these ants [click on link below]. You might think that they are distantly related species. In fact, they are sisters—two ants of the species Pheidole barbata, fulfilling different roles in the same colony. " Caldwell et al. (2006 at: http://evolution.berkeley.edu/evosite/evo101/VA1BioSpeciesConcept.shtml ).
Scientists have
long recognized that modern-day birds and reptiles share a common
ancestor. Both groups lay shelled eggs and have scales (in birds,
confined to the legs), nucleated red blood cells, and a number of
skeletal similarities.
Different methods and "paradigms" in biological taxonomy thus arrive at different results. Methods based on historical development such as phylogenetics consider birds and reptiles as related species (birds may be kinds of reptiles), while methods such as phenetics (also known as numerical taxonomy) based on structural similarities consider birds and reptiles to be less related species (birds are not reptiles). DNA evidence (“molecular phylogeny”) shows again a different picture compared to traditional phylogeny. The question of what to consider as "similar" is far from trivial. |
If we consider different examples, we will se that conceptual relations have different kinds of motivations:
“Institutions for Information Science” are generically subordinate to “institutions”. This relationship seems to be motivated by purely logical relations (or by relations inherent in a given language).
“Copenhagen” is part of “Denmark” (Whereas Malmö in Sweden is not. This last example is motivated by the fact that Denmark lost this part of Sweden in 1658). The parts of a country are thus defined by social arrangements. This type of semantic relations is in other words based on human conventions.
Whales are today classified as mammals. The explanation of this semantic (generic) relation is due to evolutionary theory.
Psychology (and also psychopathology) may be classified as part of neuroscience (natural science), as part of the social sciences or as part of the humanities (or otherwise). Such differences are, for example, visible in how such fields are placed in the organizational structures in universities. Such classifications (and semantic relations) often involve professional interests. It is, for example, partly a question of professional power whether a field (e.g. psychopharmacology) is monopolized by a profession. At the deepest level, the question of whether psychology is a human science or a natural science is a scientific question related to theoretical questions within psychology. It is well known that psychology is divided over this question. Different paradigms in psychology have different answers. For behaviorism psychology is clearly a part of the natural sciences (cf. Watson, 1913). For humanistic psychology it is clearly a part of the humanities. In classifying psychology as belonging to science, or social science or humanities, one is actually involved in a theoretical battle between paradigms. (Which most people find rather uncomfortable).
Classifications and semantic relations may be established by empirical generalizations. One example is Berlin & Kay (1969), who established the empirical generalization that human languages reflect the classification of color sensations in essentially the same ways universally, regardless of historical and cultural differences. Such empirical generalizations avoid the uncomfortable theoretical issues addressed above. There are, however, other kinds of problems with this approach, which are closely connected with the basic assumptions in empiricism.
Classifications and semantic relations may also be purely accidental or ad hoc. Such classifications may serve some purposes very well. In general, however, classifications that reflect essential characteristics in the objects are the most valuable. (This “essentialism” should not be confused with an objectivism that ignores human activity).
The similarity in automatic indexing based on vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. The most popular similarity measure is the cosine coefficient, which measures the angle between the a document vector and the query vector. Other measures are e.g., Jaccard and Dice coefficients. Such measures have been applied, for example to measure the relative similarity of Shakespeare's tragedies:
"Now we can answer our question "Which of Shakespeare's tragedies is most like Hamlet." Using the binary cosine measure, the play most like Hamlet in terms of its use of lemma is Othello, with a binary cosine value of 0.5198 . This isn't too surprising given that both plays feature revenge themes and more interior dialog then usual. The play least like Hamlet is Titus Andronicus, with a binary cosine value of 0.4622 . The Dice and Jaccard coefficients lead to the same conclusion. The binary overlap value places "Julius Caesar" as the tragedy most similar to Hamlet and King Lear as the least similar. The count-based cosine value places Macbeth as the most similar play to Hamlet with Titus Andronicus again as the least similar. " (Northwestern University. Wordhoard, 2006).
All such measures are, however, based on words in texts, and thus a kind of naïve similarity (empiricism). The same text in two different language, for example, would not be measured as "similar".
Thomas Kuhn discusses similarity in relation to his concept of paradigms:
"The practice of normal science depends of the ability, acquired from exemplars, to group objects and situations into similarity sets which are primitive in the sense that the grouping is done without an answer to the question, "Similar with respect to what?" One central aspect of any revolution is, then, that some of the similarity relations change. Objects that were grouped in the same set before are grouped in different ones afterward and vice versa. Think of the sun, moon, Mars, and earth before and after Copernicus; of free fall, pendular, and planetary motion before and after Galileo; or of salts, alloys, and sulpuhur-iron filing mix before and after Dalton. Since most objects within even the altered sets continue to be grouped together, the names of the sets are usually preserved. Nevertheless, the transfer of a subset is ordinarily part of a critical change in the network of interrelations among them. Transferring the metals from the set of compounds to the set of elements played an essential role in the emergence of a new theory of combustion, of acidity, and of physical and chemical combination. In short order those changes had spread through all of chemistry. Not surprisingly, therefore, when such redistributions occur, two men whose discourse had previously proceeded with apparently full understanding may suddenly find themselves responding to the same stimulus with incompatible descriptions and generalizations." (Kuhn, 1996, p. 200-201).
The things that we (science or culture) come to regard as functional equivalent (and thus "similar" in a deeper, non naïve way) are reflected in our concepts. What Kuhn described above was thus also a development of our concepts of, for example, planets and chemical elements.
The organization of like things in KO is thus based on the conceptualization of the person doing the organization. A theory of KO needs, however, also to address the question which kind of conceptualization should guide KO? What kind of teaching should be given to prepare for KO?
|
Literature:
Atlam, E .S.; Fuketa, M.; Morita, K. & Aoe, J. (2003). Documents similarity measurement using field association terms. Information Processing & Management, 39(6), 809-824.
Berlin, B. & Kay, P. (1969). Basic Color Terms: Their Universality and Evolution. Berkeley: University of California Press.
Bliss, H. E. (1935). A System of Bibliographical Classification. New York: The H. W. Wilson Company.
Calado, P.; Cristo, M.; Goncalves, M. A.; de Moura, E. S.; Ribeiro-Neto, B. & Ziviani, N. (2006). Link-based similarity measures for the classification of Web documents. Journal of the American Society for Information Science and Technology, 57(2), 208-221.
Caldwell, R. et al. (2006). Biological species concept. University of California Museum of Paleontology & the National Center for Science Education. http://evolution.berkeley.edu/evosite/evo101/VA1BioSpeciesConcept.shtml
Chang, C. C. &Wu, T. C. (1992). Retrieving the most similar symbolic pictures from pictorial databases. Information Processing & Management, 28(5), 581-588.
Chapman, S. (2006). Sam's string metrics. An open source library of similarity metrics ("SimMetrics). http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Chiappe, Dan L. (1998). Similarity, Relevance, and the Comparison Process. Metaphor and Symbol,
Vol. 13, No. 1, Pages 17-30
Gombrich, E. H. (2002). Part one: The limits of likeness. IN: Gombrich, E. H.: Art and illusion. A study in the psychology of pictorial representation. 6th ed. Phaidon Press, Inc. (Pp. 27-78).
James, W. (1890). The Principles of Psychology. (2 vols.). New York: Henry Holt.
Knappe, R. (2005). Measures of Semantic Similarity and Relatedness for Use in Ontology-based Information Retrieval. Roskilde: Roskilde University, Department of Communication, Journalism and Computer Science. http://www.ruc.dk/upload/application/pdf/71822029/thesis.pdf
Kuhn, T. S. (1996). The structure of scientific revolutions. Third edition. Chicago: University of Chicago Press.
Mai, J. E. (2000). Likeness: A Pragmatic Approach" In Dynamism and Stability in Knowledge Organization. Proceedings of the Sixth International ISKO Conference. Advances in Knowledge Organization, 7, 23-27. Available at: http://www.ischool.washington.edu/mai/Papers/2000_LikenessAPragmaticApproach.pdf
Northwestern University. Wordhoard, 2006. http://wordhoard.northwestern.edu/userman/analysis-comparingtexts.html
Richardson, E. C. (1964). Classification. Theoretical and practical. Third edition. Hamden, Connecticut: The Shoe String Press, Inc. (Reprinted unaltered from 1930 edition).
Schneider, Jesper W. & Borlund, Pia (2007). Matrix Comparison, Part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology, 58(11), 1596-1609.
Watson, J. B. (1913) Psychology as the behaviorist views it. Psychological Review, 20, 158-177.
Birger Hjørland
Last edited: 05-02-2008
Example of poor kinds of "similarity:
Hjørland, B. (2004). Domain analysis in information science. IN: Encyclopedia of Library and Information Science. New York: Marcel Dekker. Pp. 1-7. Online (only available to subscribers): http://www.dekker.com/sdek/abstract~db=enc?content=10.1081/E-ELIS-120024990
This paper is by the publisher associated with the following links, which are bad examples of subject relations:
Articles like this: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Probably has the publisher made a simple algorithm which identify other of their own papers if they contain certain common words such as "domain", and this word is apparently used in another sense in pharmacology.
Thought experiments: 35 balls:
colors |
red |
green |
blue |
Sizes |
1 cm |
2 cm |
3 cm |
Material: |
plastic |
wood |
iron |
solidity: |
solid |
semi-solid |
hollow |
age: |
antique |
used |
new |
Take one ball accidentally. Then consider to select one "more like this"
There are clearly non objective criteria for similarity: all the red balls are alike, all the 1 cm balls are alike, all plastic balls are alike etc. Whatever you choose, you disregard a criterion for likeness that might be relevant for some purpose.
In this example the qualities are independent. If other qualities, like prices or weights were involved, for example, iron would probably be more expensive and have more weight compared to plastic. In normal situations are properties usually not independent.
A learning algorithm might find out that a user prefer red things, prefer small things, prefer new things an so on. But can such knowledge be generalized from one situation to another?
Birger Hjørland
Last edited: 05-02-2008