MCWord: An Orthographic Wordform Database

Language Imaging Laboratory

Medical College of Wisconsin




Welcome to MCWord, an Orthographic Wordform Database.

The purpose of this program is to provide a convenient interface for researchers wishing to obtain lexical (word frequency and neighborhood counts) and sublexical (letter and letter combination) orthographic information about English words. The program also enables users to automatically generate nonword letter strings with specifiable degrees of approximation to English orthography.

The database used by MCWord is based on the CELEX efw.cd file. This file includes all the English word forms from a COBUILD corpus of both written and spoken text, which contains approximately 17,900,000 instances of word use. There are approximately 16,600,000 written examples, and 1,300,000 spoken examples.

To compute orthographic frequencies, we trimmed the CELEX database using the following criterion:

These constraints produced a list of 66,372 unique wordforms, with a total wordform count of 16,808,769. Individual wordform counts range from 0 (14,608 instances) to 1,168,607 (the word a). Word length ranged from 1 to 22 letters.

This database allows you to (1) retrieve orthographic characteristics of words and nonwords, (2) generate nonwords, and (3) retrieve words from the database using specific orthographic criteria. Click on any of the variable names in the Select Output Variables to obtain a description of the variable and how it was computed.

A paper describing this database is currently in preparation. In the meanwhile, if you find this database useful for your research, we would appreciate it if you would use the following citation:

If you have any questions about this database, please email me at medlerd@gmail.com as I am no longer at MCW.

For issues concerning the website please email trwilliams@mcw.edu.


Select Output Variables

Return All Statistics    
Number of Letters Constrained Unigram Statistics Unconstrained Unigram Statistics
Frequency of Orthographic Form Constrained Bigram Statistics Unconstrained Bigram Statistics
Orthographic Neighborhood Statistics (Coltheart's N) Constrained Trigram Statistics Unconstrained Trigram Statistics


Select Task

(1) Get Word/Nonword Statistics (2) Generate Nonwords (3) Retrieve Words


Enter words into the textbox below:



Select the appropriate retrieval constraints
 minmax minmax
Word Length: Word Frequency:
Orthographic Neighborhood Size: Orthographic Neighborhood Frequency:
Constrained Unigram Count: Constrained Unigram Frequency:
Constrained Bigram Count: Constrained Bigram Frequency:
Constrained Trigram Count: Constrained Trigram Frequency:
Unconstrained Unigram Count: Unconstrained Unigram Frequency:
Unconstrained Bigram Count: Unconstrained Bigram Frequency:
Unconstrained Trigram Count: Unconstrained Trigram Frequency:


Select the appropriate generation constraints
Consonant Strings Constrained Unigram-Based Strings Unconstrained Unigram-Based Strings
Random Letter Strings Constrained Bigram-Based Strings Unconstrained Bigram-Based Strings
  Constrained Trigram-Based Strings Unconstrained Trigram-Based Strings

String Length: Min = Max =
Number of Strings:
Exclude Words and Repeats     (Note: this may slow generation).
Maximum Number of Iterations:

webmaster