Word
A
word is a unit of language. In linguistics is the concept of word “notoriously difficult to define. For the
sentence as well as for the word, many definitions have been proposed; but so
far none have gained general acceptance” (Uhlenbeck, 2003, p. 377).
In computer science is "word" used about a fixed-sized
group of
bits. (Cf. Wikipedia, 2005).
In information retrieval is a sequence of
characters surrounded by blanks or punctuation normally regarded as a word. In bibliographical
records a given field may be “word indexed” or “phrase indexed” (or both).
The descriptor “child custody” is indexed by words with the expressions “child”
and “custody” as index terms. It may be phrase indexed as “child custody”
as an index term. In the last case are the blanks ignored when the expression is
represented in the
inverted file of the database.
In
natural language processing (NLP) is
stemming
techniques used to create sets of words derived from a common root and appearing
in a variety of forms, depending on particular functions in a sentence or
variations in meaning. Lemmatization is a form of linguistic processing
that determines the lemma for each word form that occurs in text. The lemma of a
word encompasses its base form plus inflected forms that share the same part of
speech.
"Lemmatisation
is closely related to
stemming.
The difference is that a stemmer operates on a single word
without knowledge of the context, and therefore cannot
discriminate between words which have different meanings
depending on part of speech. However, stemmers are typically
easier to implement, and run faster, and the reduced accuracy
may not matter for some applications. (Wikipedia, 2006a).
In linguistics is morphology the study of
grammatical and other variants of words that are derived from the same root or
stem. The main branches of morphology are inflectional morphology, derivational
morphology, and compounding. For an introduction see, for example, Wikipedia
(2006b).
Stemming
techniques based on morphological
analysis may increase the
recall/precision
rate.
Words can be
seen as conceptual accumulators that collect fragments of the history of
human knowledge (cf., Miller 1997)
Literature:
Garfield, E. (1986). ISI's master list of title words
provides a special perspective on science and scholarship activity. Part I:
Lexicography of the Unique Word Dictionary. Current Contents, #27, p.3-8,
July 7.
http://www.garfield.library.upenn.edu/essays/v9p208y1986.pdf
Korenius, T.; Laurikkala, J.; Järvelin, K. & Juhola, M. (in press).
Stemming and Lemmatization in the Clustering of Finnish Text Documents.
http://www.info.uta.fi/tutkimus/fire/archive/KLJJ-CIKM04.pdf
Uhlenbeck, E. M. (2003). Words. IN: International
encyclopedia of linguistics. 2nd.
Ed. Edited by W. J. Frawley. (Vol. 4, pp. 377-378). Oxford: Oxford University
Press.
Wikipedia. The free encyclopedia. (2006a).
Lemmatisation.
http://en.wikipedia.org/wiki/Lemmatisation
Wikipedia. The free encyclopedia. (2006b). Morphology (linguistics).
http://en.wikipedia.org/wiki/Morphology_%28linguistics%29
Wikipedia. The free encyclopedia. (2005). Word (Computer
science).
http://en.wikipedia.org/wiki/Word_%28computer_science%29
See also:
Lexicology; Lexicon (Lifeboat for KO);
String search (Core Concepts in LIS);
Term (Core Concepts in LIS).
Birger Hjørland
Last edited: 15-07-2007
HOME