A word is a unit of language. In linguistics is the concept of word “notoriously difficult to define. For the sentence as well as for the word, many definitions have been proposed; but so far none have gained general acceptance”  (Uhlenbeck, 2003, p. 377).


In computer science is "word" used about a fixed-sized group of bits. (Cf. Wikipedia, 2005).


In information retrieval is a sequence of characters surrounded by blanks or punctuation normally regarded as a word. In bibliographical records a given field may be “word indexed” or “phrase indexed” (or both). The descriptor “child custody” is indexed by words with the expressions “child” and “custody” as index terms. It may be phrase indexed as “child custody” as an index term. In the last case are the blanks ignored when the expression is represented in the inverted file of the database.


In natural language processing (NLP) is stemming techniques used to create sets of words derived from a common root and appearing in a variety of forms, depending on particular functions in a sentence or variations in meaning. Lemmatization is a form of linguistic processing that determines the lemma for each word form that occurs in text. The lemma of a word encompasses its base form plus inflected forms that share the same part of speech.


"Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement, and run faster, and the reduced accuracy may not matter for some applications. (Wikipedia, 2006a).


In linguistics is morphology the study of grammatical and other variants of words that are derived from the same root or stem. The main branches of morphology are inflectional morphology, derivational morphology, and compounding. For an introduction see, for example, Wikipedia (2006b). Stemming techniques based on morphological analysis may increase the recall/precision rate.


Words can be seen as conceptual accumulators  that collect fragments of the history of human knowledge (cf., Miller 1997)





Garfield, E. (1986). ISI's master list of title words provides a special perspective on science and scholarship activity. Part I: Lexicography of the Unique Word Dictionary. Current Contents, #27, p.3-8, July 7. http://www.garfield.library.upenn.edu/essays/v9p208y1986.pdf


Korenius, T.; Laurikkala, J.;  Järvelin, K. & Juhola, M. (in press). Stemming and Lemmatization in the Clustering of Finnish Text Documents. http://www.info.uta.fi/tutkimus/fire/archive/KLJJ-CIKM04.pdf


Miller, U. (1997). Thesaurus Construction: Problems and their Roots. Information Processing & Management, 33(4). 481-493.


Uhlenbeck, E. M. (2003). Words. IN: International encyclopedia of linguistics. 2nd. Ed. Edited by W. J. Frawley. (Vol. 4, pp. 377-378). Oxford: Oxford University Press.


Wikipedia. The free encyclopedia. (2006a). Lemmatisation. http://en.wikipedia.org/wiki/Lemmatisation


Wikipedia. The free encyclopedia. (2006b). Morphology (linguistics). http://en.wikipedia.org/wiki/Morphology_%28linguistics%29


Wikipedia. The free encyclopedia. (2005). Word (Computer science). http://en.wikipedia.org/wiki/Word_%28computer_science%29




See also: Lexicology; Lexicon (Lifeboat for KO); String search (Core Concepts in LIS); Term (Core Concepts in LIS).




Birger Hjørland

Last edited: 15-07-2007