Likeliness in Knowledge Organization (KO)

A fundamental principle in KO is that "like things" should be brought together, while "different things" should be separated. “Likeliness” is a concept that may also be expressed by other terms such as:

Many writers in LIS have defined classification and KO by using the concepts of likeliness. For example:

How do we decide what things are alike? How do we develop our criteria for likeness? The literature on KO in LIS seems to ignore this core problem. Is this because the problem is seen as obvious? Is this based on a kind of naïve realism: that things are what they look like, and that people’s immediate sense of likeness is adequate as the basis for KO?

 

When are two things alike? Naïve realism confuses seeming similarity with similarity in an objective sense. Some metals might look like gold, but might not be precious metal. Chemical analysis (not common sense) has to make a distinction between “what looks like gold” and “what is gold”. Classification made for children may be based on more superficial attributes —e.g., "the big book of trains" which considers most aspects of the railway system". Scientific classifications, on the other hand, reflect some deeper properties, such as the classification of chemical substances in organic and inorganic compounds, precious metals etc. based on atomic theory. When scientific principles of classification are applied, seemingly related objects may be separated and seemingly different objects may be grouped together. For example, whales were once classified as fish, but are today – influenced by evolutionary theory - classified as mammals. These examples demonstrate that naïve realism is not adequate as a method to classify documents.

 

The biological species concept is a good example on how naive realism is different from scientific realism:

"a. Organisms may appear to be alike and be different species. For example, Western meadowlarks (Sturnella neglecta) and Eastern meadowlarks (Sturnella magna) look almost identical to one another, yet do not interbreed with each other—thus, they are separate species according to this definition.

b. Organisms may look different and yet be the same species. For example, look at these ants [click on link below]. You might think that they are distantly related species. In fact, they are sisters—two ants of the species Pheidole barbata, fulfilling different roles in the same colony. " Caldwell et al. (2006 at: http://evolution.berkeley.edu/evosite/evo101/VA1BioSpeciesConcept.shtml ).

 

 

Scientists have long recognized that modern-day birds and reptiles share a common ancestor. Both groups lay shelled eggs and have scales (in birds, confined to the legs), nucleated red blood cells, and a number of skeletal similarities.

Different methods and "paradigms" in biological taxonomy thus arrive at different results. Methods based on historical development such as phylogenetics consider birds and reptiles as related species (birds may be kinds of reptiles), while methods such as phenetics (also known as numerical taxonomy) based on structural similarities consider birds and reptiles to be less related species (birds are not reptiles).

DNA evidence (“molecular phylogeny”) shows again a different picture compared to traditional phylogeny. The question of what to consider as "similar" is far from trivial.

 

If we consider different examples, we will se that conceptual relations have different kinds of motivations:

We may conclude that there exist many different kinds of criteria for likeliness. They may be conventional, logical, psychological and so on. Regarding natural kinds, however, they should especially be seen as domain-specific criteria which are discovered by science. They are not just something that can be extracted from users or from statistical investigations.

 

The similarity in automatic indexing based on vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. The most popular similarity measure is the cosine coefficient, which measures the angle between the a document vector and the query vector. Other measures are e.g., Jaccard and Dice coefficients. Such measures have been applied, for example to measure the relative similarity of Shakespeare's tragedies:

"Now we can answer our question "Which of Shakespeare's tragedies is most like Hamlet." Using the binary cosine measure, the play most like Hamlet in terms of its use of lemma is Othello, with a binary cosine value of 0.5198 . This isn't too surprising given that both plays feature revenge themes and more interior dialog then usual. The play least like Hamlet is Titus Andronicus, with a binary cosine value of 0.4622 . The Dice and Jaccard coefficients lead to the same conclusion. The binary overlap value places "Julius Caesar" as the tragedy most similar to Hamlet and King Lear as the least similar. The count-based cosine value places Macbeth as the most similar play to Hamlet with Titus Andronicus again as the least similar. " (Northwestern University. Wordhoard, 2006).

 

All such measures are, however, based on words in texts, and thus a kind of naïve similarity (empiricism). The same text in two different language, for example, would not be measured as "similar".

 

 

Thomas Kuhn discusses similarity in relation to his concept of paradigms:

 

"The practice of normal science depends of the ability, acquired from exemplars, to group objects and situations into similarity sets which are primitive in the sense that the grouping is done without an answer to the question, "Similar with respect to what?" One central aspect of any revolution is, then, that some of the similarity relations change. Objects that were grouped in the same set before are grouped in different ones afterward and vice versa. Think of the sun, moon, Mars, and earth before and after Copernicus; of free fall, pendular, and planetary motion before and after Galileo; or of salts, alloys, and sulpuhur-iron filing mix before and after Dalton. Since most objects within even the altered sets continue to be grouped together, the names of the sets are usually preserved. Nevertheless, the transfer of a subset is ordinarily part of a critical change in the network of interrelations among them. Transferring the metals from the set of compounds to the set of elements played an essential role in the emergence of a new theory of combustion, of acidity, and of physical and chemical combination. In short order those changes had spread through all of chemistry. Not surprisingly, therefore, when such redistributions occur, two men whose discourse had previously proceeded with apparently full understanding may suddenly find themselves responding to the same stimulus with incompatible descriptions and generalizations." (Kuhn, 1996, p. 200-201).

 

The things that we (science or culture) come to regard as functional equivalent (and thus "similar" in a deeper, non naïve way) are reflected in our concepts. What Kuhn described above was thus also a development of our concepts of, for example, planets and chemical elements.

 

The organization of like things in KO is thus based on the conceptualization of the person doing the organization. A theory of KO needs, however, also to address the question which kind of conceptualization should guide KO? What kind of teaching should be given to prepare for KO?

 

 

"According to Goodman (1972), two things are similar only if they possess relevant common properties. The relevance of properties, however, can vary with the context and with the goals of the person making the comparison. As a result, similarity is a highly unstable relation, and therefore difficult to use as a base from which to explain other processes, such as analogy, induction, categorization, and metaphor. A recent attempt by Gentner and her colleagues to explain the operations of the comparison process in analogy may run across some of these difficulties. In particular, her structure-mapping theory (Gentner, 1983, 1989; Gentner & Markman, 1997) does not explain how the relevant features are accessed during the early stages of the comparison process. Moreover, progress on this problem will not be made until we have a solution to a more general problem--the problem of explaining how people manage to bring relevant information to bear in the processing of new information. " (Chiappe, 1998).

 

 

 

Literature:

 

Atlam, E .S.; Fuketa, M.; Morita, K. & Aoe, J. (2003). Documents similarity measurement using field association terms. Information Processing & Management, 39(6), 809-824.

 

Berlin, B. & Kay, P. (1969). Basic Color Terms: Their Universality and Evolution. Berkeley: University of California Press.

 

Bliss, H. E. (1935). A System of Bibliographical Classification. New York: The H. W. Wilson Company.

 

Calado, P.; Cristo, M.; Goncalves, M. A.; de Moura, E. S.; Ribeiro-Neto, B. & Ziviani, N. (2006). Link-based similarity measures for the classification of Web documents. Journal of the American Society for Information Science and Technology, 57(2), 208-221.

 

Caldwell, R. et al. (2006). Biological species concept. University of California Museum of Paleontology & the National Center for Science Education.  http://evolution.berkeley.edu/evosite/evo101/VA1BioSpeciesConcept.shtml

 

Chang, C. C. &Wu, T. C. (1992).  Retrieving the most similar symbolic pictures from pictorial databases.  Information Processing & Management, 28(5), 581-588.

 

Chapman, S. (2006). Sam's string metrics. An open source library of similarity metrics ("SimMetrics). http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

 

Chiappe, Dan L. (1998). Similarity, Relevance, and the Comparison Process. Metaphor and Symbol,

Vol. 13, No. 1, Pages 17-30

 

Gombrich, E. H. (2002). Part one: The limits of likeness. IN: Gombrich, E. H.: Art and illusion. A study in the psychology of pictorial representation. 6th ed.  Phaidon Press, Inc. (Pp. 27-78).

 

James, W. (1890). The Principles of Psychology. (2 vols.). New York: Henry Holt.

 

Knappe, R. (2005). Measures of Semantic Similarity and Relatedness for Use in Ontology-based Information Retrieval. Roskilde: Roskilde University, Department of Communication, Journalism and Computer Science.  http://www.ruc.dk/upload/application/pdf/71822029/thesis.pdf

 

Kuhn, T. S. (1996). The structure of scientific revolutions. Third edition. Chicago: University of Chicago Press.

 

Mai, J. E. (2000). Likeness: A Pragmatic Approach" In Dynamism and Stability in Knowledge Organization.  Proceedings of the Sixth International ISKO Conference. Advances in Knowledge Organization, 7, 23-27. Available at: http://www.ischool.washington.edu/mai/Papers/2000_LikenessAPragmaticApproach.pdf

 

Northwestern University. Wordhoard, 2006. http://wordhoard.northwestern.edu/userman/analysis-comparingtexts.html

 

Richardson, E. C. (1964). Classification. Theoretical and practical. Third edition. Hamden, Connecticut: The Shoe String Press, Inc. (Reprinted unaltered from 1930 edition).

 

Schneider, Jesper W. & Borlund, Pia (2007). Matrix Comparison, Part 1: Motivation and Important Issues for Measuring the Resemblance Between Proximity Measures or Ordination Results. Journal of the American Society for Information Science and Technology, 58(11), 1586-1595.

Schneider, Jesper W. & Borlund, Pia (2007). Matrix Comparison, Part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology, 58(11), 1596-1609.

 

 

Watson, J. B. (1913) Psychology as the behaviorist views it. Psychological Review, 20, 158-177.

 

 

 

 

 

Birger Hjørland

Last edited: 05-02-2008

HOME

 

 

 

Example of poor kinds of "similarity:

 

Hjørland, B. (2004). Domain analysis in information science. IN: Encyclopedia of Library and Information Science. New York: Marcel Dekker. Pp. 1-7. Online (only available to subscribers): http://www.dekker.com/sdek/abstract~db=enc?content=10.1081/E-ELIS-120024990

This paper is by the publisher  associated with the following links, which are bad examples of subject relations:

 

 

Articles like this:

National Institutes of Health

 

Neurology Specialty Pharmacy Practice

 

Poison Information Pharmacy Practice

 

Pharmacist Managed Vaccination Programs

 

Disease Management

 

Evidence Based Practice

 

Health Status Assessment

 

Association of Faculties of Pharmacy of Canada

 

United States Pharmacopeia -- More Like These (824)

Probably has the publisher made a simple algorithm which identify other of their own papers if they contain certain common words such as "domain", and this word is apparently used in another sense in pharmacology. 

 

 

Thought experiments: 35  balls:

colors

red

green

blue

Sizes

1 cm

2 cm

3 cm

Material:

plastic

wood

iron

solidity:

solid

semi-solid

hollow

age:

antique

used

new

 

Take one ball accidentally. Then consider to select one "more like this"

There are clearly non objective criteria for similarity: all the red balls are alike, all the 1 cm balls are alike, all plastic balls are alike etc. Whatever you choose, you disregard a criterion for likeness that might be relevant for some purpose.

In this example the qualities are independent. If other qualities, like prices or weights were involved, for example, iron would probably be more expensive and have more weight compared to plastic. In normal situations are properties usually not independent.

A learning algorithm might find out that a user prefer red things, prefer small things, prefer new things an so on. But can such knowledge be generalized from one situation to another?

 

 

 

 

 

Birger Hjørland

Last edited: 05-02-2008

HOME