Model

Cubist

class cubist.cubist.Cubist(n_rules: int = 500, *, n_committees: int = 1, neighbors: int | None = None, unbiased: bool = False, auto: bool = False, extrapolation: float = 0.05, sample: float | None = None, cv: int | None = None, random_state: int | None = None, target_label: str = 'outcome', verbose: int = 0)[source]

Cubist Regression Model (Public v2.07) developed by Ross Quinlan.

Parameters:
n_rulesint, default=500

Changes the max number of rules Cubist will generate for a model. Recommended and default value is 500.

n_committeesint, default=1

Number of models (called committees) Cubist will construct. Each committee is a rule based model and beyond the first tries to correct the prediction errors of the prior constructed model. Recommended value is 5.

neighborsint, default=None

Number between 1 and 9 for how many instances should be used to correct the rule-based prediction. If this value is set, Cubist will create a composite model with the given number of neighbors. If no value is given, Cubist will build a rule-based model only. If auto=True, Cubist may override this input and choose a different number of neighbors. Note that using a composite model may improve accuracy at the expense of interpretability as the linear models won’t be completely followed and the training dataset will be stored along with the trained model if disk space is a concern.

unbiasedbool, default=False

Determines whether Cubist generates unbiased rules, that is, whether to allow the mean predicted value for the training cases covered by a rule to differ from their mean value. The default behavior is to minimize the average absolute error. A case where this is recommended is where a dataset has frequent occurrences of the same value. This may be experimented during model development to assess the influence on predictive performance.

autobool, default=False

A value of True allows the algorithm to choose whether to use nearest-neighbor corrections and how many neighbors. In this case, neighbors must be left as None since the input has no bearing on the model’s behavior. False will leave the choice of whether to use a composite model to value passed to the neighbors parameter.

extrapolationfloat, default=0.05

Adjusts the percentage outside of the output values seen in the training dataset to which Cubist can extrapolate. Recommended value is 5% as a decimal (0.05).

samplefloat, default=None

Percentage of the training dataset to be randomly selected for model building and held out for model testing. When using this parameter, Cubist will report evaluation results on the testing set in addition to the training set results. Note this is inadvisable for small datasets as Cubist may not have enough samples to produce a representative model.

cvint, default=None

Enable cross-validation and set the number of folds to use as an integer greater than 1 (recommended value is 10 per Ross Quinlan). Note the model only produces a report for the user and doesn’t save a model so this is only used for assessing model performance.

random_stateint, default=None

An integer to set the random seed for Cubist to enable repeatable cross-validation and sampling.

target_labelstr, default=”outcome”

A label for the target/outcome (y) variable. This is only used when printing the model summary and can be changed to show something other than the default of “outcome”.

verboseint, default=0

Indicates whether Cubist should print the model report, summary, and training performance to the console. Either an integer or Python boolean is accepted.

Attributes:
model_str

The raw model generated by the C Cubist library.

output_str

The model summary generated by the C Cubist library.

Added in version 1.0.0.

feature_importances_pd.DataFrame

Table of how training dataset variables are used in the Cubist model. The first column for “Conditions” shows the approximate percentage of cases for which the named attribute appears in a condition of an applicable rule, while the second column “Attributes” gives the percentage of cases for which the attribute appears in the linear formula of an applicable rule.

n_features_in_int

Number of training features (columns) seen during fit.

feature_names_in_ndarray of shape (n_features_in_,)

Names of the training features (columns) seen during fit. Defined only when X has feature names that are all strings.

splits_pd.DataFrame

Table of the split conditions created by rule (and committee if used).

coeff_pd.DataFrame

Table of the regression coefficients determined by rule (and committee if used).

version_str

The Cubist version with the time of local build/install.

Added in version 1.0.0.

feature_statistics_pd.DataFrame

Table of statistics generated for each input attribute.

Added in version 1.0.0.

committee_error_reduction_float | None

The reduction in error by using committees.

Added in version 1.0.0.

n_committees_used_int

Number of committees actually used in the model.

Added in version 1.0.0.

global_mean_float

Mean of entire training set.

Added in version 1.1.0.

ceiling_float

Maximum allowable global prediction.

Added in version 1.1.0.

floor_float

Minimum allowable global prediction. If all target values are greater than or equal to zero, the minimum allowable prediction will be zero.

Added in version 1.1.0.

References

Examples

>>> from cubist import Cubist
>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.model_selection import train_test_split
>>> X, y = fetch_california_housing(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=42,
...                                                     test_size=0.2)
>>> model = Cubist()
>>> model.fit(X_train, y_train)
Cubist()
>>> model.predict(X_test)
array([0.50073832, 0.86456549, 5.14631033, ..., 4.76159859, 0.76238906,
...    1.9493351 ], shape=(4128,))
>>> model.score(X_test, y_test)
0.81783547...
fit(X, y, sample_weight=None)[source]

Build a Cubist regression model from training set (X, y).

Parameters:
X{array-like} of shape (n_samples, n_features)

The training input samples. Must have complete column names or none provided at all (NumPy arrays will be given names by column index).

yarray-like of shape (n_samples,)

The target values (Real numbers in regression).

sample_weightarray-like of shape (n_samples,)

Optional vector of sample weights that is the same length as y for how much each instance should contribute to the model fit.

Returns:
selfobject

Fitted estimator.

predict(X)[source]

Predict Cubist regression target for X.

Parameters:
X{array-like} of shape (n_samples, n_features)

The input samples. Must have complete column names or none provided at all (NumPy arrays will be given names by column index).

Returns:
yndarray of shape (n_samples,)

The predicted values.

score(X, y, sample_weight=None)[source]

Return coefficient of determination on test data.

The coefficient of determination, \(R^2\), is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

\(R^2\) of self.predict(X) w.r.t. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).