script-l_classifier

Script l_classifier

A classifier that assumes that P is linear in position. This is known as a (linear) logistic discriminant analysis:

This is a script that can be run from the Linux command line. Usage: l_classifier [flags] < input_data >log_file This script also produces two files: classes.chunk and classified.fiat.

Flags:

-test Run some tests.

-D Print extra debug information. Repeated -D flags increase verbosity.

-quiet Print less.

-c 0.NN Ignore the specified fraction (

0 <= 
      0.NN < 1

) of the worst classifications. See q_classifier_r.evaluate_Bayes for details. When building the classifiers, if 0.NN > 0, this essentially says that "nothing is extremely improbable, because there's a 0.NN chance that it is just a mistake." This makes the classifier boundaries less sensitive to points on the outskirts of regions.

-ftest 0.NN Use a fraction

0 < 
      0.NN < 1

of the data for the test set; the remainder is used for training the classifiers.

-coverage N.N This script generates a group of classifiers for a particular test-set/training-set pair, but then it samples a new test set and repeats until an average datum appears in a test set N.N times.

-nperdim N.N This controls how many classifiers are generated per test-set/training-set pair. The number is N.N times the number of dimensions in the feature vector.

-i filename Take input from the specified file instead of the standard input.

The input data is a multicolumn ASCII file with one line per measurement to be classified. Columns are separated by whitespace and are:

1: The correct class (i.e. the 'gold standard'). Obviously, the classification problem gets more difficult as the number of distinct classes gets larger.

2 - N+1: The various components of the feature vector to be used for classification. All lines must have the same number of components.

*: An optional hash mark followed by an arbitrary identifier for that measurement. (If no identifier is supplied, it will be called "Line:NN" based on the line number in this input file.) Identifiers don't affect the computation, but they do let you connect values in the output files to feature vectors in the input file.

The standard output contains miscellaneous progress information and lines (that are prefixed with "WRONG") that list incorrect classifications. However, comprehensive classification information can be found in classes.chunk. This provides a list of all the classifiers that were generated, and contains enough information to reconstruct the classifiers so that they could be applied to another set of data. (classes.chunk is in a format readable by chunkio.py.) It is recommended that it be read in by read_classified.read_classes_header (if you just want the top few lines of header information), or read_classified.read_classes for the full description.

The header contains information on classifier performance. It contains attribute/value pairs as follows:

Pcorrect Average fraction of correct classification

PSigma The standard deviation (among the classifiers) of fraction of correct classification.

total 3660 Total number of classifications (i.e. number of data times number of classifiers).

nok 8 K 0.994231955051 KSigma

Chance The overall probability of accidentally making a correct classification. (This is an average over all classifiers.)

ChSigma The standard deviation (among the classifiers) of Chance.

Perfection How well a perfect classifier could perform on that test set (normally 1.0).

PerfectionSigma The standard deviation of Perfection across all classifiers (normally 0).

N_per_dim, Ftest, Coverage Parameters set by command line flags

classifier_type "linear_discriminant_classifier" for this program.

Notes:

A useful general reference is: @inbook{webb:spr:logistic, author = {Andrew Webb}, title = {Statistical Pattern Recognition}, pages = {124--132}, year = {1999}, publisher = {Arnold}, address = {London, New York}, note = {ISBN 0 340 74164 3} }
This code was described in an appendix to "Dimensions of durational variation in speech", by Anastassia Loukina, Greg Kochanski, Burton Rosner, Chilin Shih, and Elinor Keane, submitted 2010 to J. Acoustical Society of America.

Variables

__package__ = None

Imports: die, Q, QC