Package classifiers :: Module q_classifier_r
[frames] | no frames]

Module q_classifier_r

source code

This is a support module, used by many types of classifiers.

Classes
  datum_c
This is an unclassified datum, either in the test or training set.
  datum_tr
This is a datum where we know the true class, presumably in the training set.
  grouper_c
A 'grouper' function takes a DUID (a unique i.d.
  model_template
This class describes how to compute the relative probability that a datum is a member of a particular class.
  qmodel
  lmodel
  classifier_desc
This is a thing that describes and generates classifiers.
  classifier
This is the base class for all kinds of classifers.
  evaluate_match_w_rare
This is called in the same way as evaluate_match or evaluate_Bayes.
Functions
str
Hash(dl)
Returns: a hash of the UID's of data items.
source code
str
Hash1(l)
Returns: a hash of a vector.
source code
 
prior(training)
This computes the probability of correct classification, assuming you can't see the feature vector.
source code
 
max_correct(training, testing)
This is a hard, conservative upper limit for the probability of correct classification.
source code
list(datum_tr)
read_data(fd, commentarray=None)
Reads in feature vectors where the first element is the true class.
source code
int
get_dim(fd)
This function takes a list of data (type datum_tr) and makes sure that they all have the same length feature vector.
source code
 
compute_cross_class(training, testing, modelchoice=None, n_per_dim=None, builder=None, classout=None, trainingset_name=None, modify_class=None, verbose=True)
Build classifiers based on the training set, and test them on the testing set.
source code
 
compute_self_class(d, coverage=None, ftest=None, modelchoice=None, n_per_dim=None, modify_class=None, builder=None, classout=None, verbose=True)
modelchoice here is expected to take one argument-- the data.
source code
 
list_groups(d, gr) source code
 
compute_group_class(dg, modelchoice=None, n_per_dim=None, builder=None, classout=None, ftest=None, grouper=None, coverage=None, modify_class=None, verbose=True)
This function makes sure that the training set and testing set come from different groups.
source code
 
qzmodel(ndim) source code
 
lzmodel(ndim) source code
 
forest_build(data, N, modelchoice=None, trainingset_name=None)
Build a forest of classifiers.
source code
int
evaluate_match(cl, data)
This can be passed into a classifier descriptor as the evaluate argument.
source code
float
evaluate_Bayes(cl, data, constrain=0.0)
Evaluates the negative log of the probability that the classifier would assign to the datum being in the observed class (i.e.
source code
 
default_writer(summary, out, classout, wrong, fname='classes.chunk')
This writes out classifiers to a data file.
source code
 
count_classes(data)
Count how many instances there are of each class.
source code
 
list_classes(data)
List the names of the classes in a dataset, with the most populus classes first.
source code
str
name_of_evaluator(e)
Used to get the name of an evaluator, to write it to a file header.
source code
 
evaluator_from_name(nm)
Maps a name to a function that will evaluate how well a classifier performs.
source code
 
default_modify_class(qc, training_counts, testing_counts)
Modifies a classifier so it isn't so dominated by the most frequent classes.
source code
Variables
  ERGCOVER = 4.0
  D = False
  CONSTRAIN = 1e-06
  __package__ = 'classifiers'

Imports: re, math, zlib, numpy, chunkio, DS, die, mcmc, mcmc_helper, g_implements, fiatio, dictops, gpkmisc, DV, gpkavg


Function Details

Hash(dl)

source code 
Parameters:
  • dl (list(datum_c)) - A list of data
Returns: str
a hash of the UID's of data items.

Hash1(l)

source code 
Parameters:
  • l - A list of data
Returns: str
a hash of a vector.

prior(training)

source code 

This computes the probability of correct classification, assuming you can't see the feature vector. It is used to compute P(chance). It assumes that you choose class C with probability 1 if P(C) is the biggest among all the classes.

read_data(fd, commentarray=None)

source code 

Reads in feature vectors where the first element is the true class. This is the main data input for l_classifier, qdg_classifier and qd_classifier.

Parameters:
  • fd (file)
  • commentarray (a list or None)
Returns: list(datum_tr)

get_dim(fd)

source code 

This function takes a list of data (type datum_tr) and makes sure that they all have the same length feature vector. If so, it reports the length (dimension) of the feature vector.

Parameters:
Returns: int
length of vectors

compute_cross_class(training, testing, modelchoice=None, n_per_dim=None, builder=None, classout=None, trainingset_name=None, modify_class=None, verbose=True)

source code 

Build classifiers based on the training set, and test them on the testing set. Modelchoice here is the completed class object, not a closure.

compute_group_class(dg, modelchoice=None, n_per_dim=None, builder=None, classout=None, ftest=None, grouper=None, coverage=None, modify_class=None, verbose=True)

source code 

This function makes sure that the training set and testing set come from different groups. The 'grouper' returns a group name, when given a datum. Modelchoice is expected to take one argument, the training set.

Parameters:
  • grouper (function from datum_tr to str) - function returning a group name for each datum

forest_build(data, N, modelchoice=None, trainingset_name=None)

source code 

Build a forest of classifiers.

Parameters:
  • data (datum_c) - data to train the classifiers on.
  • N (int,) - How many to build.
  • modelchoice (subclass of model_template) - what kind of classifier to build
  • trainingset_name (str) - (stored for later use).

evaluate_match(cl, data)

source code 

This can be passed into a classifier descriptor as the evaluate argument. It returns the number of exact matches between the classified data and the input, true classification.

Parameters:
  • cl (typically a subclass of classifier.) - some classifier that has a bestc() method.
  • data (Typically a subclass of datum_c.) - list of classes that describe data points.
Returns: int
the number of classification errors made

evaluate_Bayes(cl, data, constrain=0.0)

source code 

Evaluates the negative log of the probability that the classifier would assign to the datum being in the observed class (i.e. whatever class is specified in the datum_tr). Obviously, you want this to be a relatively small number.

Parameters:
Returns: float
the log of the probability of being in the observed class.

If cdesc.ftrim is not None, we assume that some of the data in each class are dubious, and should be ignored if they are sufficiently improbable. We modify the probability scores of data that is among the worst (cl.cdesc.ftrim[0] fraction), and limit those scores to be no larger than cl.cdesc.ftrim[1] larger than the best score. This lets you limit by score or limit by fraction or any mixture in between. If cl.cdesc.ftrim is None, then no limiting or trimming is done.

default_writer(summary, out, classout, wrong, fname='classes.chunk')

source code 

This writes out classifiers to a data file.

Attention: out needs to be a list, not an iterator, because we use it twice.

count_classes(data)

source code 

Count how many instances there are of each class.

Parameters:
  • data (datum_c @rtype map from str to int)

list_classes(data)

source code 

List the names of the classes in a dataset, with the most populus classes first.

Parameters:

name_of_evaluator(e)

source code 

Used to get the name of an evaluator, to write it to a file header.

Parameters:
  • e (function, preferable with __name__ attribute.)
Returns: str

evaluator_from_name(nm)

source code 

Maps a name to a function that will evaluate how well a classifier performs.

Parameters:
  • nm (str) - a printable name
Returns:
a function

default_modify_class(qc, training_counts, testing_counts)

source code 

Modifies a classifier so it isn't so dominated by the most frequent classes.

Parameters:
  • training_counts (map str to int) - how many data are there in each class in the training set
  • testing_counts (map str to int) - how many data are there in each class in the testing set
  • qc (classifier)