This website helps selecting books from learner's vocabulary size and readability formulas.
Users of this website would be self learners, teachers, librarians, literature lovers, etc.

  • What percentage of vocabulary of a text is needed to comprehend?
  • Carver (1994) generalized that 98% is difficult for readers, 99% is appropriate, 100% is easy. Schmitt et al. (2011) concluded that there is no threshold percentage, and demonstrated an almost linear trend, which is knowing 90% to 100% of the words used in a text corresponds to 50% to 75% of the text comprehension.

  • Definition of vocabulary size
  • The definition of vocabulary size in this website is based on the concept of word family. A word family of a base word refers to the derivative words of the base word. Taking the word “learn” as an example, the word family of the word “learn” are “learned, learnedly, learner, learners, learning, learns, learnt, relearn, relearned, relearning, relearns, unlearn, unlearned”. These derivatives in texts were converted to the base words for analysis.
    When this website states 5000 words (word families), this means 5000 base words. The "word families" preceded by numbers represents vocabulary size counted from word family basis. Note that, in general, reading books or newspapers requires 8000-9000 word families (Nation 2006).

  • Readability
  • To measure difficulty of a text, readability formulas have been developed by multitude of earlier researches. Most conventional readability formulas have used average sentence lengths and vocabulary of a text to investigate text difficulty.

  • Conventional readability formulas
  • The readability formulas determine vocabulary difficulty with indirect measures. Gunning-Fog, Flesch Reading Ease, Flesch-Kincaid, Fry, Power-Sumner-Kearl and FORCAST use the correlation between number of syllables and vocabulary difficulty of a text. Coleman-Liau and ARI use the relationship between number of characters in the words and vocabulary difficulty. Dale-Chall and Spache measure the ratio of the number of words included in their prepared vocabulary difficulty list. These formulas are simple and useful for its purpose, however, they sometimes give discrepancies from true difficulty of a text, such as Shakespeares are easy to read (~4 in Flesch-kincaid) or Adventures of Huckleberry Finn by Mark Twain is very difficult to read (~17 in Flesch-Kincaid), where most learners may find them different.

  • Vocabulary based readability adopted in this website
  • This website determines vocabulary difficulty by a more direct measure than the conventional readability formulas. Earlier researches found that the vocabulary difficulty of a text correlates with its text comprehension (eg. Schmitt et al. 2011). This website investigates vocabulary of a text with a vocabulary frequency list. The investigation process used in this website is similar to that described by Nation (2006), where this website only counts words included in the word family list and this website omits words with diactrical marks.

  • Acknowledgement
  • For classifying texts, this website used BNC (British National Corpus) word family list, by courtesy of Prof. Paul Nation of Victoria University of Wellington, New Zealand.
    The bibliographic information in the readability catalog was extracted from catalog.rdf (downloaded 7 Oct 2012) of Project Gutenberg.

  • Reference
  • Carver R. P., "Percentage of unknown vocabulary words in text as a function of the relative difficulty of the text: implications for instruction", J. Reading Behavior., 26(4) (1994) 413-437.
    Nation I. S. P., "How Large a Vocabulary Is Needed For Reading and Listening?", Canadian Mod. Lang. Rev., 63(1) (2006) 59-82.
    Schmitt N., Jiang X., Grabe W., "The percentage of words known in a text and reading comprehension", Mod. Language J., 95(1) (2011) 26-43.

