research teaching resources home

Resources for Psycholinguists

Resources that I am involved with:

WebExp is a set of Java classes designed to run linguistic questionnaire-type experiments on the internet. A number of experiments using WebExp are generally accessible via
Gsearch was a software suite designed to extract sentences from text corpora by syntactic criteria, even where the corpora contain no syntactic markup. Gsearch is pretty old now but if you're interested, you can get the sources here.
Statistical Libraries
Summary is an object-oriented Perl library designed to make summarising data from experiments easy. You write the wrapper which interprets the raw data, and Summary creates tables of means, errors, etc. suitable for further statistical analysis. Summary includes methods for taking out outliers, estimating missing values, etc. It is a work in (constant) progress, but the current version is fully documented and can be downloaded here.
Eyenal is designed to help with the analysis of eyetracking experiments. It uses the Summary library above (included in the archive), together with the |stat statistical analysis tools, written by Gary Perlman, to perform instant ANOVA analyses on the output of eyedry. (If you have never heard of eyedry, part of Chuck Clifton's experimental and analysis software for eyetracking experiments, then you don't need this. If you have heard of eyedry, you really, really want it!) Eyenal is written in Perl and should run on any modern computer (although it was designed and tested using Linux).

Eyeplot visualises data created using eyewash, part of Chuck Clifton's eyetracking analysis software (sample eyeplot output here [pdf]). It is written in Perl and relies on the GNU plotutils library, together with its Perl interface Graph::Plotter.

Eyeplot is only know to work under UNIX-like systems, using X windows. If you download it and have any success in making it work on another operating system, please let me know!

Frequency Tools
I have written a couple of programs which are designed to help psycholinguists with material generation. freqdata annotates an input file with frequency (or other) information; freqmatch gives best matches in terms of frequency to target words. Both of these tools rely on the MRC psycholinguistic database (see here for more information and a more traditional interface). Source code for these utilities is available here (should compile on any UNIX-like system).
Neighbourhood Lists
As part of a project with Jane Hinton I compiled exhaustive orthographic neighbourhood lists from the British National Corpus. These are available below (each file is approximately 300K, gzipped text): 3-letter words; 4-letter words; 5-letter words; 6-letter words.
Get rid of cruft!
UNIX/Linux user? Fed up with countless '~' and '.bak' backup files? This small utility is what I use to remove stuff that I can live without (specifically, backup files, and the byproducts of LaTeX runs). It's reasonably well-tested but I take no responsibility if it wipes your hard drive or deletes that one file you wanted to keep. I strongly recommend you look at the code before you run it. If it works for you (it does for me), it's a useful thing to have lying around.

Other Resources:

Spreadsheets for calculating MinF' and Confidence Intervals
Struggling with the JML stats requirements? Rob Hartsuiker has released couple of invaluable spreadsheets which take all the hard work out of the calculations.
link from Corey McMillan
Nonword Database
The ARC nonword database at Macquarie University is an invaluable resource for lexical experiments.
A large collection of images, including "Snodgrass and Vanderwart-like" images
Made available by Michael Tarr at Brown University.
link from Lucy MacGregor
The Beckman Spoken Picture Naming Norms
Norms on picture name agreement, etc., from Griffin & Huitema. More norms and information are available at Zenzi Griffin's website.
Norms for timed picture naming
The UCSD Center for Research in Language is engaged in a large international study to provide norms for timed picture naming in seven different languages (American English, German, Mexican Spanish, Italian, Bulgarian, Hungarian, and the variant of Mandarin Chinese spoken in Taiwan). They currently have data for over 500 pictures.
link from Ciara Catchpole
Frequency Lists derived from the British National Corpus
These extremely useful lists are created and made available by Adam Kilgariff. They appear to be collated from version 1 of the BNC.
The MRC Psycholinguistic Database
Hosted at the Univesity of Western Australia, the MRC database contains information on linguistic and psychological properties for about 150,000 words (although by no means every property is listed for every word). A comprehensive resource for psycholinguistic experiments (see also the frequency tools above, which use the MRC database as a data source).
Semantic Space and Probability Models

Scott McDonald has provided a web interface to several semantic space models derived from the British National Corpus (and therefore British English: highly recommended).

The LSA model (Landauer & Dumais, 1997) has a mature web interface to several variants derived from American English hosted at CU Boulder.

Also from Scott, lexical probability with respect to a trained neural probabilistic language model (Bengio et al., 2003).

Memory Tests
A collection of semi-automated and automated E-Prime scripts for memory measurement, including Reading Span (see this article for more detail).
link from Michael Jin
top Last modified: Mon Oct 22 12:25:59 BST 2007