MCWord: An Orthographic Wordform Database

The purpose of this program is to provide a convenient interface for researchers wishing to obtain lexical (word frequency and neighborhood counts) and sublexical (letter and letter combination) orthographic information about English words. The program also enables users to automatically generate nonword letter strings with specifiable degrees of approximation to English orthography.

The database used by MCWord is based on the CELEX efw.cd file. This file includes all the English word forms from a COBUILD corpus of both written and spoken text, which contains approximately 17,900,000 instances of word use. There are approximately 16,600,000 written examples, and 1,300,000 spoken examples.

To compute orthographic frequencies, we trimmed the CELEX database using the following criterion:

These constraints produced a list of 66,372 unique wordforms, with a total wordform count of 16,808,769. Individual wordform counts range from 0 (14,608 instances) to 1,168,607 (the word a). Word length ranged from 1 to 22 letters.

This database allows you to (1) retrieve orthographic characteristics of words and nonwords, (2) generate nonwords, and (3) retrieve words from the database using specific orthographic criteria. Click on any of the variable names in the Select Output Variables to obtain a description of the variable and how it was computed.

A paper describing this database is currently in preparation. In the meanwhile, if you find this database useful for your research, we would appreciate it if you would use the following citation:

Select Output Variables

Return All Statistics
Number of Letters	Constrained Unigram Statistics	Unconstrained Unigram Statistics
Frequency of Orthographic Form	Constrained Bigram Statistics	Unconstrained Bigram Statistics
Orthographic Neighborhood Statistics (Coltheart's N)	Constrained Trigram Statistics	Unconstrained Trigram Statistics

Select Task

(1) Get Word/Nonword Statistics (2) Generate Nonwords (3) Retrieve Words

Enter words into the textbox below:

Select the appropriate retrieval constraints

	min	max		min	max
Word Length:			Word Frequency:
Orthographic Neighborhood Size:			Orthographic Neighborhood Frequency:
Constrained Unigram Count:			Constrained Unigram Frequency:
Constrained Bigram Count:			Constrained Bigram Frequency:
Constrained Trigram Count:			Constrained Trigram Frequency:
Unconstrained Unigram Count:			Unconstrained Unigram Frequency:
Unconstrained Bigram Count:			Unconstrained Bigram Frequency:
Unconstrained Trigram Count:			Unconstrained Trigram Frequency:

Select the appropriate generation constraints

Consonant Strings	Constrained Unigram-Based Strings	Unconstrained Unigram-Based Strings
Random Letter Strings	Constrained Bigram-Based Strings	Unconstrained Bigram-Based Strings
	Constrained Trigram-Based Strings	Unconstrained Trigram-Based Strings

String Length: Min = Max =
Number of Strings:
Exclude Words and Repeats (Note: this may slow generation).
Maximum Number of Iterations:

MCWord: An Orthographic Wordform Database

Language Imaging Laboratory

Medical College of Wisconsin

Select Output Variables

Select Task