Welcome to MCWord, an Orthographic Wordform Database.
The purpose of this program is to provide a convenient interface for researchers wishing to obtain lexical (word frequency and neighborhood counts) and sublexical (letter and letter combination) orthographic information about English words. The program also enables users to automatically generate nonword letter strings with specifiable degrees of approximation to English orthography.
The database used by MCWord is based on the CELEX efw.cd file. This file includes all the English word forms from a COBUILD corpus of both written and spoken text, which contains approximately 17,900,000 instances of word use. There are approximately 16,600,000 written examples, and 1,300,000 spoken examples.
To compute orthographic frequencies, we trimmed the CELEX database using the following criterion:
These constraints produced a list of 66,372 unique wordforms, with a total wordform count of 16,808,769. Individual wordform counts range from 0 (14,608 instances) to 1,168,607 (the word a). Word length ranged from 1 to 22 letters.
This database allows you to (1) retrieve orthographic characteristics of words and nonwords, (2) generate nonwords, and (3) retrieve words from the database using specific orthographic criteria. Click on any of the variable names in the Select Output Variables to obtain a description of the variable and how it was computed.
A paper describing this database is currently in preparation. In the meanwhile, if you find this database useful for your research, we would appreciate it if you would use the following citation:
For issues concerning the website please email firstname.lastname@example.org.