Prosody Logo Greg Kochanski Prosody Logo

The 2008 Oxford MRI Corpus

This data (and the data it refers to) is copyright 2007, 2008 by Greg Kochanski, and is licensed in England under Creative Commons Noncommercial-Attribution License. You may copy and/or use this file (and referenced files) for noncommercial purposes so long as the author is properly acknowledged. For commercial licensing, contact Isis Innovation.


This corpus contains the Magnetic Resonance Imaging data and the associated audio recordings for ESRC grant "Articulation and Coarticulation in the Lower Vocal Tract" with G. Kochanski and J. Coleman as principal investigators. Data is courtesy of the UK's Economics and Social Research Council, derived from project RES-000-23-1094, 7/2005 through 3/2008. (Personal web pages describing the project can be found here . )

This project was designed to study the lower vocal tract: the region from the velum on down to the larynx. This is a region that is hard to study by other techniques such as EMMA, palatography or ultrasound. Our goal was to look for mathematical models that can predict the shape of the airway, based on the dictionary pronunciation of the words in a phrase. We designed the experiment to emphasize contrasts of the phonological [ATR] feature to try to settle a old dispute about whether or not it is useful in the description of English speech.

This corpus covers both Experiment 2 and Experiment 3 of the proposal. The goal of Experiment 2 was to decide whether or real-time MRI imaging was better than "gated" MRI imaging. (Gated MRI sequences are typically used to image the heart; they collect data over a number of repetitions/heartbeats, and reconstruct images from fragments that are collected at the same phase in different heartbeats.) Experiment 3 involves building mathematical models of articulatory positions.

Experiment 2 is reported in detail in a paper (C. Alvey, C. Orphanidou, J. Coleman, A. McIntyre, S. Golding and G. Kochanski "Image quality in non-gated versus gated reconstruction of tongue motion using Magnetic Resonance Imaging: A comparison using automated image processing", under publication (2008) at the International Journal of Compter Assisted Radiography and Surgery, (Springer, ISSN: 1861-6410 [print version] ISSN: 1861-6429 [electronic version]), doi:10.1007/s11548-008-0218-5 . (A personal copy of that paper can be found at . ) There is also an abstract and slides of a talk that was presented at Computer Assisted Radiology and Surgery: 22nd International Congress and Exhibition, Barcelona, Spain, 26 June 2008.

Data collection and processing for Experiment 3 is described in the attached draft.


All the metadata is currently truncated and hand-anonymized. This is only useful for getting a general understanding of the file formats.

The files, and contain metadata describing the recordings. The experimental data itself consists of speech recordings, and they are stored in subdirectories. It also contains hand-checked files that mark the beginning and end of utterances, and hand checked positions for finger taps and metronome ticks.

This corpus of data consists partly of short files of repetitive speech: phrases like "Nothing Matters. Nothing Matters. Nothing Matters. ..." (There are 75 different phrases.) The remainder consists of the same phrases (and a few others) spoken in a more standard laboratory phonology context: a randomized list of phrases.

It also includes some longer, rhythmic passages from Dr. Suess.

The speakers are all speakers of Southern British English. It contains 1308 audio files and totals 2.6 gigabytes of uncompressed audio. There are 14 speakers.

The corpus contains a large number of directories. Inside each, it contains several files of interest:

Data Files


The original recording, in Microsoft WAV format. It is a two-channel file. One channel contains the recorded speech, and the other channel contains either metronome ticks or an audio channel from a microphone positioned to pick up finger taps. (The subject's finger tapped on a hardcover book about 2cm from the microphone.) The finger tap channel will pick up some speech, but faintly, and the speech channel will pick up some finger tap sounds. However, metronome ticks were coupled in electronically and are completely isolated from the speech channel.


These are the start and end-points of the speech in the utterance, automatically generated but checked for accuracy by a human. A small amount of silence (probably <100ms) is included within the marked endpoints on either side of the utterance. See the above publication for details. The data files are in a format suitable for reading by the ESPS package Xwaves, and can be read by Wavesurfer. Python 2.5 code for reading these files is available on Sourceforge, in the speechresearch project, in file gmisclib/ . In brief, the format contains a bunch of header lines of basically useless information, then a line consisting of a single hash mark ('#'), then two relevant lines. The one containing an asterisk in the third field marks the utterance start (the time is in the first field). Likewise, the line containing '%' marks the end. Times are relative to the beginning of the raw.wav files.


This file contains experimental tick or tap events. For the metronome data, it contains the times at which metronome ticks occur. For the "tick" data, if it exists, it lists the times at which the subject's finger tapped to mark a stressed syllable. This is computed from one of the channels of the raw.wav file, but manually checked. This file is in the Xwaves label format, same as ue.lbl.


This file contains computed tick or tap locations. It is meaningful only for metronome data, where it simply marks the metronome ticks.

This file (and other files with the ".dat" extension) are stored in the GPK ASCII Image format. This can be read by code available on Sourceforge, in the speechresearch project, in file gpkio/ascii_read.c and gpkio/read.c . (Note, the gpklib library is required for this code; that can be found in the gpklib subdirectory in the same project.) A Python interface to these libraries is available in the gpk_img_python subdirectory of the same project, and is documented at .

This data format consists of a header, followed by data. The header consists of lines in the form attribute = value and the data section is a two-dimensional array of values, either in ASCII in IEEE-754 binary format for floating point values, on in binary integer formats. The header information loosely follows NASA's FITS standard (Flexible Image Transport Standard). (Incidentally, the same software will read and write FITS format images, too.)

Other Files

Other files are computed from the raw data, and are preserved for convenience. These were used in the "What marks the beat of speech?" paper.


An irregularity measure that separates voiced speech from unvoiced. It quantifies speech that is not fully voiced.


The perceptual loudness.


A measure of duration for the current syllable. Essentially, it measures how far one can go (in time) before the spectrum changes substantially.


The RMS (intensity or power).


A standard computation of the speech fundamental frequency.


A measurement of the average slope of the speech spectrum.


A small subset of the corpus (1 subject's worth of data, about 7% of the corpus) is available at ZIP archive format or Tar format (linux), compressed with gzip. For the full data set, please contact Greg Kochanski, greg.kochanski (at) . Eventually, the data set is expected to be available from the Economic and Social Data Service and/or from another web server.


When using the data with "rep*" in the "text" field, the appropriate publication to reference is DOI: 10.1121/1.2890742, "What Marks the Beat of Speech?" G. Kochanski and C. Orphanidou, Journal of the Acoustical Society of America, ISSN 0001-4966, Volume 123(5), pages 2780-2791.

Files whose text field is in the form "sent" are long lists of randomized sentences. These "sent" files were used, along with the "rep*" files in another publication: "Testing the Ecological Validity of Repetitive Speech", Greg Kochanski and Christina Orphanidou, presented at the 2007 International Congress of the Phonetic Sciences (ICPhS2007), 6-10 August 2007. It is available on the web at, , or .


Utterances with "rep*" in the text field are repetitive speech; each phrase is repeated 10-15 times in succession. Files where the text field equals "fox", "king", and "lucky" are longer texts that were not used. They are from three books by Dr. Suess (Geisel).

More detailed documentation is in the file that contains the bulk of the metadata.

[ Papers | | Phonetics Lab | Oxford ] Last Modified Mon Jun 30 01:06:13 2008 Greg Kochanski: [ Mail | Home ]