parent
c20ff6ce19
commit
c6593d02a6
12 changed files with 513 additions and 1483 deletions
@ -1,15 +0,0 @@ |
||||
Extremely randomized trees |
||||
========================== |
||||
|
||||
Extremely randomized trees have been introduced by Pierre Geurts, Damien Ernst and Louis Wehenkel in the article "Extremely randomized trees", 2006 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.7485&rep=rep1&type=pdf]. The algorithm of growing Extremely randomized trees is similar to :ref:`Random Trees` (Random Forest), but there are two differences: |
||||
|
||||
#. Extremely randomized trees don't apply the bagging procedure to construct a set of the training samples for each tree. The same input training set is used to train all trees. |
||||
|
||||
#. Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly), whereas Random Forest finds the best split (optimal one by variable index and variable splitting value) among random subset of variables. |
||||
|
||||
|
||||
CvERTrees |
||||
---------- |
||||
.. ocv:class:: CvERTrees : public CvRTrees |
||||
|
||||
The class implements the Extremely randomized trees algorithm. ``CvERTrees`` is inherited from :ocv:class:`CvRTrees` and has the same interface, so see description of :ocv:class:`CvRTrees` class to get details. To set the training parameters of Extremely randomized trees the same class :ocv:struct:`CvRTParams` is used. |
@ -1,272 +0,0 @@ |
||||
.. _Gradient Boosted Trees: |
||||
|
||||
Gradient Boosted Trees |
||||
====================== |
||||
|
||||
.. highlight:: cpp |
||||
|
||||
Gradient Boosted Trees (GBT) is a generalized boosting algorithm introduced by |
||||
Jerome Friedman: http://www.salfordsystems.com/doc/GreedyFuncApproxSS.pdf . |
||||
In contrast to the AdaBoost.M1 algorithm, GBT can deal with both multiclass |
||||
classification and regression problems. Moreover, it can use any |
||||
differential loss function, some popular ones are implemented. |
||||
Decision trees (:ocv:class:`CvDTree`) usage as base learners allows to process ordered |
||||
and categorical variables. |
||||
|
||||
.. _Training GBT: |
||||
|
||||
Training the GBT model |
||||
---------------------- |
||||
|
||||
Gradient Boosted Trees model represents an ensemble of single regression trees |
||||
built in a greedy fashion. Training procedure is an iterative process |
||||
similar to the numerical optimization via the gradient descent method. Summary loss |
||||
on the training set depends only on the current model predictions for the |
||||
training samples, in other words |
||||
:math:`\sum^N_{i=1}L(y_i, F(x_i)) \equiv \mathcal{L}(F(x_1), F(x_2), ... , F(x_N)) |
||||
\equiv \mathcal{L}(F)`. And the :math:`\mathcal{L}(F)` |
||||
gradient can be computed as follows: |
||||
|
||||
.. math:: |
||||
grad(\mathcal{L}(F)) = \left( \dfrac{\partial{L(y_1, F(x_1))}}{\partial{F(x_1)}}, |
||||
\dfrac{\partial{L(y_2, F(x_2))}}{\partial{F(x_2)}}, ... , |
||||
\dfrac{\partial{L(y_N, F(x_N))}}{\partial{F(x_N)}} \right) . |
||||
|
||||
At every training step, a single regression tree is built to predict an |
||||
antigradient vector components. Step length is computed corresponding to the |
||||
loss function and separately for every region determined by the tree leaf. It |
||||
can be eliminated by changing values of the leaves directly. |
||||
|
||||
See below the main scheme of the training process: |
||||
|
||||
#. |
||||
Find the best constant model. |
||||
#. |
||||
For :math:`i` in :math:`[1,M]`: |
||||
|
||||
#. |
||||
Compute the antigradient. |
||||
#. |
||||
Grow a regression tree to predict antigradient components. |
||||
#. |
||||
Change values in the tree leaves. |
||||
#. |
||||
Add the tree to the model. |
||||
|
||||
|
||||
The following loss functions are implemented for regression problems: |
||||
|
||||
* |
||||
Squared loss (``CvGBTrees::SQUARED_LOSS``): |
||||
:math:`L(y,f(x))=\dfrac{1}{2}(y-f(x))^2` |
||||
* |
||||
Absolute loss (``CvGBTrees::ABSOLUTE_LOSS``): |
||||
:math:`L(y,f(x))=|y-f(x)|` |
||||
* |
||||
Huber loss (``CvGBTrees::HUBER_LOSS``): |
||||
:math:`L(y,f(x)) = \left\{ \begin{array}{lr} |
||||
\delta\cdot\left(|y-f(x)|-\dfrac{\delta}{2}\right) & : |y-f(x)|>\delta\\ |
||||
\dfrac{1}{2}\cdot(y-f(x))^2 & : |y-f(x)|\leq\delta \end{array} \right.`, |
||||
|
||||
where :math:`\delta` is the :math:`\alpha`-quantile estimation of the |
||||
:math:`|y-f(x)|`. In the current implementation :math:`\alpha=0.2`. |
||||
|
||||
|
||||
The following loss functions are implemented for classification problems: |
||||
|
||||
* |
||||
Deviance or cross-entropy loss (``CvGBTrees::DEVIANCE_LOSS``): |
||||
:math:`K` functions are built, one function for each output class, and |
||||
:math:`L(y,f_1(x),...,f_K(x)) = -\sum^K_{k=0}1(y=k)\ln{p_k(x)}`, |
||||
where :math:`p_k(x)=\dfrac{\exp{f_k(x)}}{\sum^K_{i=1}\exp{f_i(x)}}` |
||||
is the estimation of the probability of :math:`y=k`. |
||||
|
||||
As a result, you get the following model: |
||||
|
||||
.. math:: f(x) = f_0 + \nu\cdot\sum^M_{i=1}T_i(x) , |
||||
|
||||
where :math:`f_0` is the initial guess (the best constant model) and :math:`\nu` |
||||
is a regularization parameter from the interval :math:`(0,1]`, further called |
||||
*shrinkage*. |
||||
|
||||
.. _Predicting with GBT: |
||||
|
||||
Predicting with the GBT Model |
||||
----------------------------- |
||||
|
||||
To get the GBT model prediction, you need to compute the sum of responses of |
||||
all the trees in the ensemble. For regression problems, it is the answer. |
||||
For classification problems, the result is :math:`\arg\max_{i=1..K}(f_i(x))`. |
||||
|
||||
|
||||
.. highlight:: cpp |
||||
|
||||
|
||||
CvGBTreesParams |
||||
--------------- |
||||
.. ocv:struct:: CvGBTreesParams : public CvDTreeParams |
||||
|
||||
GBT training parameters. |
||||
|
||||
The structure contains parameters for each single decision tree in the ensemble, |
||||
as well as the whole model characteristics. The structure is derived from |
||||
:ocv:class:`CvDTreeParams` but not all of the decision tree parameters are supported: |
||||
cross-validation, pruning, and class priorities are not used. |
||||
|
||||
CvGBTreesParams::CvGBTreesParams |
||||
-------------------------------- |
||||
.. ocv:function:: CvGBTreesParams::CvGBTreesParams() |
||||
|
||||
.. ocv:function:: CvGBTreesParams::CvGBTreesParams( int loss_function_type, int weak_count, float shrinkage, float subsample_portion, int max_depth, bool use_surrogates ) |
||||
|
||||
:param loss_function_type: Type of the loss function used for training |
||||
(see :ref:`Training GBT`). It must be one of the |
||||
following types: ``CvGBTrees::SQUARED_LOSS``, ``CvGBTrees::ABSOLUTE_LOSS``, |
||||
``CvGBTrees::HUBER_LOSS``, ``CvGBTrees::DEVIANCE_LOSS``. The first three |
||||
types are used for regression problems, and the last one for |
||||
classification. |
||||
|
||||
:param weak_count: Count of boosting algorithm iterations. ``weak_count*K`` is the total |
||||
count of trees in the GBT model, where ``K`` is the output classes count |
||||
(equal to one in case of a regression). |
||||
|
||||
:param shrinkage: Regularization parameter (see :ref:`Training GBT`). |
||||
|
||||
:param subsample_portion: Portion of the whole training set used for each algorithm iteration. |
||||
Subset is generated randomly. For more information see |
||||
http://www.salfordsystems.com/doc/StochasticBoostingSS.pdf. |
||||
|
||||
:param max_depth: Maximal depth of each decision tree in the ensemble (see :ocv:class:`CvDTree`). |
||||
|
||||
:param use_surrogates: If ``true``, surrogate splits are built (see :ocv:class:`CvDTree`). |
||||
|
||||
By default the following constructor is used: |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
CvGBTreesParams(CvGBTrees::SQUARED_LOSS, 200, 0.01f, 0.8f, 3, false) |
||||
: CvDTreeParams( 3, 10, 0, false, 10, 0, false, false, 0 ) |
||||
|
||||
CvGBTrees |
||||
--------- |
||||
.. ocv:class:: CvGBTrees : public CvStatModel |
||||
|
||||
The class implements the Gradient boosted tree model as described in the beginning of this section. |
||||
|
||||
CvGBTrees::CvGBTrees |
||||
-------------------- |
||||
Default and training constructors. |
||||
|
||||
.. ocv:function:: CvGBTrees::CvGBTrees() |
||||
|
||||
.. ocv:function:: CvGBTrees::CvGBTrees( const Mat& trainData, int tflag, const Mat& responses, const Mat& varIdx=Mat(), const Mat& sampleIdx=Mat(), const Mat& varType=Mat(), const Mat& missingDataMask=Mat(), CvGBTreesParams params=CvGBTreesParams() ) |
||||
|
||||
.. ocv:function:: CvGBTrees::CvGBTrees( const CvMat* trainData, int tflag, const CvMat* responses, const CvMat* varIdx=0, const CvMat* sampleIdx=0, const CvMat* varType=0, const CvMat* missingDataMask=0, CvGBTreesParams params=CvGBTreesParams() ) |
||||
|
||||
.. ocv:pyfunction:: cv2.GBTrees([trainData, tflag, responses[, varIdx[, sampleIdx[, varType[, missingDataMask[, params]]]]]]) -> <GBTrees object> |
||||
|
||||
The constructors follow conventions of :ocv:func:`CvStatModel::CvStatModel`. See :ocv:func:`CvStatModel::train` for parameters descriptions. |
||||
|
||||
CvGBTrees::train |
||||
---------------- |
||||
Trains a Gradient boosted tree model. |
||||
|
||||
.. ocv:function:: bool CvGBTrees::train(const Mat& trainData, int tflag, const Mat& responses, const Mat& varIdx=Mat(), const Mat& sampleIdx=Mat(), const Mat& varType=Mat(), const Mat& missingDataMask=Mat(), CvGBTreesParams params=CvGBTreesParams(), bool update=false) |
||||
|
||||
.. ocv:function:: bool CvGBTrees::train( const CvMat* trainData, int tflag, const CvMat* responses, const CvMat* varIdx=0, const CvMat* sampleIdx=0, const CvMat* varType=0, const CvMat* missingDataMask=0, CvGBTreesParams params=CvGBTreesParams(), bool update=false ) |
||||
|
||||
.. ocv:function:: bool CvGBTrees::train(CvMLData* data, CvGBTreesParams params=CvGBTreesParams(), bool update=false) |
||||
|
||||
.. ocv:pyfunction:: cv2.GBTrees.train(trainData, tflag, responses[, varIdx[, sampleIdx[, varType[, missingDataMask[, params[, update]]]]]]) -> retval |
||||
|
||||
The first train method follows the common template (see :ocv:func:`CvStatModel::train`). |
||||
Both ``tflag`` values (``CV_ROW_SAMPLE``, ``CV_COL_SAMPLE``) are supported. |
||||
``trainData`` must be of the ``CV_32F`` type. ``responses`` must be a matrix of type |
||||
``CV_32S`` or ``CV_32F``. In both cases it is converted into the ``CV_32F`` |
||||
matrix inside the training procedure. ``varIdx`` and ``sampleIdx`` must be a |
||||
list of indices (``CV_32S``) or a mask (``CV_8U`` or ``CV_8S``). ``update`` is |
||||
a dummy parameter. |
||||
|
||||
The second form of :ocv:func:`CvGBTrees::train` function uses :ocv:class:`CvMLData` as a |
||||
data set container. ``update`` is still a dummy parameter. |
||||
|
||||
All parameters specific to the GBT model are passed into the training function |
||||
as a :ocv:class:`CvGBTreesParams` structure. |
||||
|
||||
|
||||
CvGBTrees::predict |
||||
------------------ |
||||
Predicts a response for an input sample. |
||||
|
||||
.. ocv:function:: float CvGBTrees::predict(const Mat& sample, const Mat& missing=Mat(), const Range& slice = Range::all(), int k=-1) const |
||||
|
||||
.. ocv:function:: float CvGBTrees::predict( const CvMat* sample, const CvMat* missing=0, CvMat* weakResponses=0, CvSlice slice = CV_WHOLE_SEQ, int k=-1 ) const |
||||
|
||||
.. ocv:pyfunction:: cv2.GBTrees.predict(sample[, missing[, slice[, k]]]) -> retval |
||||
|
||||
:param sample: Input feature vector that has the same format as every training set |
||||
element. If not all the variables were actually used during training, |
||||
``sample`` contains forged values at the appropriate places. |
||||
|
||||
:param missing: Missing values mask, which is a dimensional matrix of the same size as |
||||
``sample`` having the ``CV_8U`` type. ``1`` corresponds to the missing value |
||||
in the same position in the ``sample`` vector. If there are no missing values |
||||
in the feature vector, an empty matrix can be passed instead of the missing mask. |
||||
|
||||
:param weakResponses: Matrix used to obtain predictions of all the trees. |
||||
The matrix has :math:`K` rows, |
||||
where :math:`K` is the count of output classes (1 for the regression case). |
||||
The matrix has as many columns as the ``slice`` length. |
||||
|
||||
:param slice: Parameter defining the part of the ensemble used for prediction. |
||||
If ``slice = Range::all()``, all trees are used. Use this parameter to |
||||
get predictions of the GBT models with different ensemble sizes learning |
||||
only one model. |
||||
|
||||
:param k: Number of tree ensembles built in case of the classification problem |
||||
(see :ref:`Training GBT`). Use this |
||||
parameter to change the output to sum of the trees' predictions in the |
||||
``k``-th ensemble only. To get the total GBT model prediction, ``k`` value |
||||
must be -1. For regression problems, ``k`` is also equal to -1. |
||||
|
||||
The method predicts the response corresponding to the given sample |
||||
(see :ref:`Predicting with GBT`). |
||||
The result is either the class label or the estimated function value. The |
||||
:ocv:func:`CvGBTrees::predict` method enables using the parallel version of the GBT model |
||||
prediction if the OpenCV is built with the TBB library. In this case, predictions |
||||
of single trees are computed in a parallel fashion. |
||||
|
||||
|
||||
CvGBTrees::clear |
||||
---------------- |
||||
Clears the model. |
||||
|
||||
.. ocv:function:: void CvGBTrees::clear() |
||||
|
||||
.. ocv:pyfunction:: cv2.GBTrees.clear() -> None |
||||
|
||||
The function deletes the data set information and all the weak models and sets all internal |
||||
variables to the initial state. The function is called in :ocv:func:`CvGBTrees::train` and in the |
||||
destructor. |
||||
|
||||
|
||||
CvGBTrees::calc_error |
||||
--------------------- |
||||
Calculates a training or testing error. |
||||
|
||||
.. ocv:function:: float CvGBTrees::calc_error( CvMLData* _data, int type, std::vector<float> *resp = 0 ) |
||||
|
||||
:param _data: Data set. |
||||
|
||||
:param type: Parameter defining the error that should be computed: train (``CV_TRAIN_ERROR``) or test |
||||
(``CV_TEST_ERROR``). |
||||
|
||||
:param resp: If non-zero, a vector of predictions on the corresponding data set is |
||||
returned. |
||||
|
||||
If the :ocv:class:`CvMLData` data is used to store the data set, :ocv:func:`CvGBTrees::calc_error` can be |
||||
used to get a training/testing error easily and (optionally) all predictions |
||||
on the training/testing set. If the Intel* TBB* library is used, the error is computed in a |
||||
parallel way, namely, predictions for different samples are computed at the same time. |
||||
In case of a regression problem, a mean squared error is returned. For |
||||
classifications, the result is a misclassification error in percent. |
@ -1,279 +1,126 @@ |
||||
MLData |
||||
Training Data |
||||
=================== |
||||
|
||||
.. highlight:: cpp |
||||
|
||||
For the machine learning algorithms, the data set is often stored in a file of the ``.csv``-like format. The file contains a table of predictor and response values where each row of the table corresponds to a sample. Missing values are supported. The UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/) provides many data sets stored in such a format to the machine learning community. The class ``MLData`` is implemented to easily load the data for training one of the OpenCV machine learning algorithms. For float values, only the ``'.'`` separator is supported. The table can have a header and in such case the user have to set the number of the header lines to skip them duaring the file reading. |
||||
In machine learning algorithms there is notion of training data. Training data includes several components: |
||||
|
||||
CvMLData |
||||
-------- |
||||
.. ocv:class:: CvMLData |
||||
|
||||
Class for loading the data from a ``.csv`` file. |
||||
:: |
||||
|
||||
class CV_EXPORTS CvMLData |
||||
{ |
||||
public: |
||||
CvMLData(); |
||||
virtual ~CvMLData(); |
||||
|
||||
int read_csv(const char* filename); |
||||
|
||||
const CvMat* get_values() const; |
||||
const CvMat* get_responses(); |
||||
const CvMat* get_missing() const; |
||||
* A set of training samples. Each training sample is a vector of values (in Computer Vision it's sometimes referred to as feature vector). Usually all the vectors have the same number of components (features); OpenCV ml module assumes that. Each feature can be ordered (i.e. its values are floating-point numbers that can be compared with each other and strictly ordered, i.e. sorted) or categorical (i.e. its value belongs to a fixed set of values that can be integers, strings etc.). |
||||
|
||||
void set_response_idx( int idx ); |
||||
int get_response_idx() const; |
||||
* Optional set of responses corresponding to the samples. Training data with no responses is used in unsupervised learning algorithms that learn structure of the supplied data based on distances between different samples. Training data with responses is used in supervised learning algorithms, which learn the function mapping samples to responses. Usually the responses are scalar values, ordered (when we deal with regression problem) or categorical (when we deal with classification problem; in this case the responses are often called "labels"). Some algorithms, most noticeably Neural networks, can handle not only scalar, but also multi-dimensional or vector responses. |
||||
|
||||
* Another optional component is the mask of missing measurements. Most algorithms require all the components in all the training samples be valid, but some other algorithms, such as decision tress, can handle the cases of missing measurements. |
||||
|
||||
void set_train_test_split( const CvTrainTestSplit * spl); |
||||
const CvMat* get_train_sample_idx() const; |
||||
const CvMat* get_test_sample_idx() const; |
||||
void mix_train_and_test_idx(); |
||||
* In the case of classification problem user may want to give different weights to different classes. This is useful, for example, when |
||||
* user wants to shift prediction accuracy towards lower false-alarm rate or higher hit-rate. |
||||
* user wants to compensate for significantly different amounts of training samples from different classes. |
||||
|
||||
const CvMat* get_var_idx(); |
||||
void change_var_idx( int vi, bool state ); |
||||
* In addition to that, each training sample may be given a weight, if user wants the algorithm to pay special attention to certain training samples and adjust the training model accordingly. |
||||
|
||||
const CvMat* get_var_types(); |
||||
void set_var_types( const char* str ); |
||||
* Also, user may wish not to use the whole training data at once, but rather use parts of it, e.g. to do parameter optimization via cross-validation procedure. |
||||
|
||||
int get_var_type( int var_idx ) const; |
||||
void change_var_type( int var_idx, int type); |
||||
As you can see, training data can have rather complex structure; besides, it may be very big and/or not entirely available, so there is need to make abstraction for this concept. In OpenCV ml there is ``cv::ml::TrainData`` class for that. |
||||
|
||||
void set_delimiter( char ch ); |
||||
char get_delimiter() const; |
||||
|
||||
void set_miss_ch( char ch ); |
||||
char get_miss_ch() const; |
||||
|
||||
const std::map<String, int>& get_class_labels_map() const; |
||||
TrainData |
||||
-------- |
||||
.. ocv:class:: TrainData |
||||
|
||||
protected: |
||||
... |
||||
}; |
||||
Class encapsulating training data. Please note that the class only specifies the interface of training data, but not implementation. All the statistical model classes in ml take Ptr<TrainData>. In other words, you can create your own class derived from ``TrainData`` and supply smart pointer to the instance of this class into ``StatModel::train``. |
||||
|
||||
CvMLData::read_csv |
||||
------------------ |
||||
Reads the data set from a ``.csv``-like ``filename`` file and stores all read values in a matrix. |
||||
TrainData::loadFromCSV |
||||
---------------------- |
||||
Reads the dataset from a .csv file and returns the ready-to-use training data. |
||||
|
||||
.. ocv:function:: int CvMLData::read_csv(const char* filename) |
||||
.. ocv:function:: Ptr<TrainData> loadFromCSV(const String& filename, int headerLineCount, int responseStartIdx=-1, int responseEndIdx=-1, const String& varTypeSpec=String(), char delimiter=',', char missch='?'); |
||||
|
||||
:param filename: The input file name |
||||
|
||||
While reading the data, the method tries to define the type of variables (predictors and responses): ordered or categorical. If a value of the variable is not numerical (except for the label for a missing value), the type of the variable is set to ``CV_VAR_CATEGORICAL``. If all existing values of the variable are numerical, the type of the variable is set to ``CV_VAR_ORDERED``. So, the default definition of variables types works correctly for all cases except the case of a categorical variable with numerical class labels. In this case, the type ``CV_VAR_ORDERED`` is set. You should change the type to ``CV_VAR_CATEGORICAL`` using the method :ocv:func:`CvMLData::change_var_type`. For categorical variables, a common map is built to convert a string class label to the numerical class label. Use :ocv:func:`CvMLData::get_class_labels_map` to obtain this map. |
||||
|
||||
Also, when reading the data, the method constructs the mask of missing values. For example, values are equal to `'?'`. |
||||
|
||||
CvMLData::get_values |
||||
-------------------- |
||||
Returns a pointer to the matrix of predictors and response values |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_values() const |
||||
|
||||
The method returns a pointer to the matrix of predictor and response ``values`` or ``0`` if the data has not been loaded from the file yet. |
||||
|
||||
The row count of this matrix equals the sample count. The column count equals predictors ``+ 1`` for the response (if exists) count. This means that each row of the matrix contains values of one sample predictor and response. The matrix type is ``CV_32FC1``. |
||||
|
||||
CvMLData::get_responses |
||||
----------------------- |
||||
Returns a pointer to the matrix of response values |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_responses() |
||||
|
||||
The method returns a pointer to the matrix of response values or throws an exception if the data has not been loaded from the file yet. |
||||
|
||||
This is a single-column matrix of the type ``CV_32FC1``. Its row count is equal to the sample count, one column and . |
||||
|
||||
CvMLData::get_missing |
||||
--------------------- |
||||
Returns a pointer to the mask matrix of missing values |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_missing() const |
||||
|
||||
The method returns a pointer to the mask matrix of missing values or throws an exception if the data has not been loaded from the file yet. |
||||
|
||||
This matrix has the same size as the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) and the type ``CV_8UC1``. |
||||
|
||||
CvMLData::set_response_idx |
||||
-------------------------- |
||||
Specifies index of response column in the data matrix |
||||
|
||||
.. ocv:function:: void CvMLData::set_response_idx( int idx ) |
||||
|
||||
The method sets the index of a response column in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) or throws an exception if the data has not been loaded from the file yet. |
||||
|
||||
The old response columns become predictors. If ``idx < 0``, there is no response. |
||||
|
||||
CvMLData::get_response_idx |
||||
:param headerLineCount: The number of lines in the beginning to skip; besides the header, the function also skips empty lines and lines staring with '#' |
||||
|
||||
:param responseStartIdx: Index of the first output variable. If -1, the function considers the last variable as the response |
||||
|
||||
:param responseEndIdx: Index of the last output variable + 1. If -1, then there is single response variable at ``responseStartIdx``. |
||||
|
||||
:param varTypeSpec: The optional text string that specifies the variables' types. It has the format ``ord[n1-n2,n3,n4-n5,...]cat[n6,n7-n8,...]``. That is, variables from n1 to n2 (inclusive range), n3, n4 to n5 ... are considered ordered and n6, n7 to n8 ... are considered as categorical. The range [n1..n2] + [n3] + [n4..n5] + ... + [n6] + [n7..n8] should cover all the variables. If varTypeSpec is not specified, then algorithm uses the following rules: |
||||
1. all input variables are considered ordered by default. If some column contains has non-numerical values, e.g. 'apple', 'pear', 'apple', 'apple', 'mango', the corresponding variable is considered categorical. |
||||
2. if there are several output variables, they are all considered as ordered. Error is reported when non-numerical values are used. |
||||
3. if there is a single output variable, then if its values are non-numerical or are all integers, then it's considered categorical. Otherwise, it's considered ordered. |
||||
|
||||
:param delimiter: The character used to separate values in each line. |
||||
|
||||
:param missch: The character used to specify missing measurements. It should not be a digit. Although it's a non-numerical value, it surely does not affect the decision of whether the variable ordered or categorical. |
||||
|
||||
TrainData::create |
||||
----------------- |
||||
Creates training data from in-memory arrays. |
||||
|
||||
.. ocv:function:: Ptr<TrainData> create(InputArray samples, int layout, InputArray responses, InputArray varIdx=noArray(), InputArray sampleIdx=noArray(), InputArray sampleWeights=noArray(), InputArray varType=noArray()) |
||||
|
||||
:param samples: matrix of samples. It should have ``CV_32F`` type. |
||||
|
||||
:param layout: it's either ``ROW_SAMPLE``, which means that each training sample is a row of ``samples``, or ``COL_SAMPLE``, which means that each training sample occupies a column of ``samples``. |
||||
|
||||
:param responses: matrix of responses. If the responses are scalar, they should be stored as a single row or as a single column. The matrix should have type ``CV_32F`` or ``CV_32S`` (in the former case the responses are considered as ordered by default; in the latter case - as categorical) |
||||
|
||||
:param varIdx: vector specifying which variables to use for training. It can be an integer vector (``CV_32S``) containing 0-based variable indices or byte vector (``CV_8U``) containing a mask of active variables. |
||||
|
||||
:param sampleIdx: vector specifying which samples to use for training. It can be an integer vector (``CV_32S``) containing 0-based sample indices or byte vector (``CV_8U``) containing a mask of training samples. |
||||
|
||||
:param sampleWeights: optional vector with weights for each sample. It should have ``CV_32F`` type. |
||||
|
||||
:param varType: optional vector of type ``CV_8U`` and size <number_of_variables_in_samples> + <number_of_variables_in_responses>, containing types of each input and output variable. The ordered variables are denoted by value ``VAR_ORDERED``, and categorical - by ``VAR_CATEGORICAL``. |
||||
|
||||
|
||||
TrainData::getTrainSamples |
||||
-------------------------- |
||||
Returns index of the response column in the loaded data matrix |
||||
|
||||
.. ocv:function:: int CvMLData::get_response_idx() const |
||||
|
||||
The method returns the index of a response column in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) or throws an exception if the data has not been loaded from the file yet. |
||||
|
||||
If ``idx < 0``, there is no response. |
||||
|
||||
|
||||
CvMLData::set_train_test_split |
||||
------------------------------ |
||||
Divides the read data set into two disjoint training and test subsets. |
||||
|
||||
.. ocv:function:: void CvMLData::set_train_test_split( const CvTrainTestSplit * spl ) |
||||
|
||||
This method sets parameters for such a split using ``spl`` (see :ocv:class:`CvTrainTestSplit`) or throws an exception if the data has not been loaded from the file yet. |
||||
|
||||
CvMLData::get_train_sample_idx |
||||
------------------------------ |
||||
Returns the matrix of sample indices for a training subset |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_train_sample_idx() const |
||||
|
||||
The method returns the matrix of sample indices for a training subset. This is a single-row matrix of the type ``CV_32SC1``. If data split is not set, the method returns ``0``. If the data has not been loaded from the file yet, an exception is thrown. |
||||
|
||||
CvMLData::get_test_sample_idx |
||||
----------------------------- |
||||
Returns the matrix of sample indices for a testing subset |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_test_sample_idx() const |
||||
|
||||
|
||||
CvMLData::mix_train_and_test_idx |
||||
-------------------------------- |
||||
Mixes the indices of training and test samples |
||||
|
||||
.. ocv:function:: void CvMLData::mix_train_and_test_idx() |
||||
|
||||
The method shuffles the indices of training and test samples preserving sizes of training and test subsets if the data split is set by :ocv:func:`CvMLData::get_values`. If the data has not been loaded from the file yet, an exception is thrown. |
||||
|
||||
CvMLData::get_var_idx |
||||
--------------------- |
||||
Returns the indices of the active variables in the data matrix |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_var_idx() |
||||
|
||||
The method returns the indices of variables (columns) used in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`). |
||||
|
||||
It returns ``0`` if the used subset is not set. It throws an exception if the data has not been loaded from the file yet. Returned matrix is a single-row matrix of the type ``CV_32SC1``. Its column count is equal to the size of the used variable subset. |
||||
|
||||
CvMLData::change_var_idx |
||||
------------------------ |
||||
Enables or disables particular variable in the loaded data |
||||
|
||||
.. ocv:function:: void CvMLData::change_var_idx( int vi, bool state ) |
||||
|
||||
By default, after reading the data set all variables in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) are used. But you may want to use only a subset of variables and include/exclude (depending on ``state`` value) a variable with the ``vi`` index from the used subset. If the data has not been loaded from the file yet, an exception is thrown. |
||||
|
||||
CvMLData::get_var_types |
||||
----------------------- |
||||
Returns a matrix of the variable types. |
||||
|
||||
.. ocv:function:: const CvMat* CvMLData::get_var_types() |
||||
|
||||
The function returns a single-row matrix of the type ``CV_8UC1``, where each element is set to either ``CV_VAR_ORDERED`` or ``CV_VAR_CATEGORICAL``. The number of columns is equal to the number of variables. If data has not been loaded from file yet an exception is thrown. |
||||
|
||||
CvMLData::set_var_types |
||||
----------------------- |
||||
Sets the variables types in the loaded data. |
||||
|
||||
.. ocv:function:: void CvMLData::set_var_types( const char* str ) |
||||
|
||||
In the string, a variable type is followed by a list of variables indices. For example: ``"ord[0-17],cat[18]"``, ``"ord[0,2,4,10-12], cat[1,3,5-9,13,14]"``, ``"cat"`` (all variables are categorical), ``"ord"`` (all variables are ordered). |
||||
|
||||
CvMLData::get_header_lines_number |
||||
--------------------------------- |
||||
Returns a number of the table header lines. |
||||
|
||||
.. ocv:function:: int CvMLData::get_header_lines_number() const |
||||
|
||||
CvMLData::set_header_lines_number |
||||
--------------------------------- |
||||
Sets a number of the table header lines. |
||||
|
||||
.. ocv:function:: void CvMLData::set_header_lines_number( int n ) |
||||
|
||||
By default it is supposed that the table does not have a header, i.e. it contains only the data. |
||||
|
||||
CvMLData::get_var_type |
||||
---------------------- |
||||
Returns type of the specified variable |
||||
|
||||
.. ocv:function:: int CvMLData::get_var_type( int var_idx ) const |
||||
|
||||
The method returns the type of a variable by the index ``var_idx`` ( ``CV_VAR_ORDERED`` or ``CV_VAR_CATEGORICAL``). |
||||
|
||||
CvMLData::change_var_type |
||||
------------------------- |
||||
Changes type of the specified variable |
||||
|
||||
.. ocv:function:: void CvMLData::change_var_type( int var_idx, int type) |
||||
|
||||
The method changes type of variable with index ``var_idx`` from existing type to ``type`` ( ``CV_VAR_ORDERED`` or ``CV_VAR_CATEGORICAL``). |
||||
|
||||
CvMLData::set_delimiter |
||||
----------------------- |
||||
Sets the delimiter in the file used to separate input numbers |
||||
|
||||
.. ocv:function:: void CvMLData::set_delimiter( char ch ) |
||||
|
||||
The method sets the delimiter for variables in a file. For example: ``','`` (default), ``';'``, ``' '`` (space), or other characters. The floating-point separator ``'.'`` is not allowed. |
||||
Returns matrix of train samples |
||||
|
||||
CvMLData::get_delimiter |
||||
----------------------- |
||||
Returns the currently used delimiter character. |
||||
.. ocv:function:: Mat TrainData::getTrainSamples(int layout=ROW_SAMPLE, bool compressSamples=true, bool compressVars=true) const |
||||
|
||||
.. ocv:function:: char CvMLData::get_delimiter() const |
||||
:param layout: The requested layout. If it's different from the initial one, the matrix is transposed. |
||||
|
||||
:param compressSamples: if true, the function returns only the training samples (specified by sampleIdx) |
||||
|
||||
:param compressVars: if true, the function returns the shorter training samples, containing only the active variables. |
||||
|
||||
In current implementation the function tries to avoid physical data copying and returns the matrix stored inside TrainData (unless the transposition or compression is needed). |
||||
|
||||
|
||||
CvMLData::set_miss_ch |
||||
--------------------- |
||||
Sets the character used to specify missing values |
||||
TrainData::getTrainResponses |
||||
---------------------------- |
||||
Returns the vector of responses |
||||
|
||||
.. ocv:function:: void CvMLData::set_miss_ch( char ch ) |
||||
.. ocv:function:: Mat TrainData::getTrainResponses() const |
||||
|
||||
The method sets the character used to specify missing values. For example: ``'?'`` (default), ``'-'``. The floating-point separator ``'.'`` is not allowed. |
||||
The function returns ordered or the original categorical responses. Usually it's used in regression algorithms. |
||||
|
||||
CvMLData::get_miss_ch |
||||
--------------------- |
||||
Returns the currently used missing value character. |
||||
|
||||
.. ocv:function:: char CvMLData::get_miss_ch() const |
||||
TrainData::getClassLabels |
||||
---------------------------- |
||||
Returns the vector of class labels |
||||
|
||||
CvMLData::get_class_labels_map |
||||
------------------------------- |
||||
Returns a map that converts strings to labels. |
||||
.. ocv:function:: Mat TrainData::getClassLabels() const |
||||
|
||||
.. ocv:function:: const std::map<String, int>& CvMLData::get_class_labels_map() const |
||||
The function returns vector of unique labels occurred in the responses. |
||||
|
||||
The method returns a map that converts string class labels to the numerical class labels. It can be used to get an original class label as in a file. |
||||
|
||||
CvTrainTestSplit |
||||
---------------- |
||||
.. ocv:struct:: CvTrainTestSplit |
||||
TrainData::getTrainNormCatResponses |
||||
----------------------------------- |
||||
Returns the vector of normalized categorical responses |
||||
|
||||
Structure setting the split of a data set read by :ocv:class:`CvMLData`. |
||||
:: |
||||
.. ocv:function:: Mat TrainData::getTrainNormCatResponses() const |
||||
|
||||
struct CvTrainTestSplit |
||||
{ |
||||
CvTrainTestSplit(); |
||||
CvTrainTestSplit( int train_sample_count, bool mix = true); |
||||
CvTrainTestSplit( float train_sample_portion, bool mix = true); |
||||
The function returns vector of responses. Each response is integer from 0 to <number of classes>-1. The actual label value can be retrieved then from the class label vector, see ``TrainData::getClassLabels``. |
||||
|
||||
union |
||||
{ |
||||
int count; |
||||
float portion; |
||||
} train_sample_part; |
||||
int train_sample_part_mode; |
||||
TrainData::setTrainTestSplitRatio |
||||
----------------------------------- |
||||
Splits the training data into the training and test parts |
||||
|
||||
bool mix; |
||||
}; |
||||
.. ocv:function:: void TrainData::setTrainTestSplitRatio(double ratio, bool shuffle=true) |
||||
|
||||
There are two ways to construct a split: |
||||
The function selects a subset of specified relative size and then returns it as the training set. If the function is not called, all the data is used for training. Please, note that for each of ``TrainData::getTrain*`` there is corresponding ``TrainData::getTest*``, so that the test subset can be retrieved and processed as well. |
||||
|
||||
* Set the training sample count (subset size) ``train_sample_count``. Other existing samples are located in a test subset. |
||||
|
||||
* Set a training sample portion in ``[0,..1]``. The flag ``mix`` is used to mix training and test samples indices when the split is set. Otherwise, the data set is split in the storing order: the first part of samples of a given size is a training subset, the second part is a test subset. |
||||
Other methods |
||||
------------- |
||||
The class includes many other methods that can be used to access normalized categorical input variables, access training data by parts, so that does not have to fit into the memory etc. |
||||
|
Loading…
Reference in new issue