© 1972 by British Computer Society
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
The identification of variable-length, equifrequent character strings in a natural language data base
Postgraduate School of Librarianship and Information Science, University of Sheffield, Sheffield, UK
The words of natural language texts exhibit a Poisson (or Zipfian) rank-frequency relationship, i.e., a small number of common words accounts for a large proportion of word occurrences, while a large number of the words occur as singletons or only infrequently. Inverted-file retrieval systems using free text data bases commonly identify words as the keys or index terms about which the file is inverted, and through which access is provided. They therefore involve large and growing dictionaries and may entail inefficient utilisation of storage because of the distribution characteristics.
An alternative approach may be used on the analysis of text in terms of sets of variable-length character strings, the frequency distributions of which are much less disparate than those of words. This could lead to substantial reductions in dictionary size, and increased efficiency both in dictionary look-up times and storage utilisation.
Received October 1971.
* Postgraduate School of Librarianship and Information Science, University of Sheffield
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. F. Lynch and P. Willett Information retrieval research in the Department of Information Studies, University of Sheffield: 1965-1985 Journal of Information Science, January 1, 1987; 13(4): 221 - 234. [Abstract] [PDF] |
||||
