Model¶

Cubist¶

class cubist.cubist.Cubist(n_rules: int = 500, *, n_committees: int = 1, neighbors: int = None, unbiased: bool = False, auto: bool = False, extrapolation: float = 0.05, sample: float = None, cv: int = None, random_state: int = None, target_label: str = 'outcome', verbose: int = 0)[source]¶

Cubist Regression Model (Public v2.07) developed by Ross Quinlan.

Parameters:

n_rulesint, default=500: Limit of the number of rules Cubist will build. Recommended and default value is 500.
n_committeesint, default=1: Number of committees to construct. Each committee is a rule based model and beyond the first tries to correct the prediction errors of the prior constructed model. Recommended value is 5.
neighborsint, default=None: Number between 1 and 9 for how many instances should be used to correct the rule-based prediction. If no value is given, Cubist will build a rule-based model only. If this value is set, Cubist will create a composite model with the given number of neighbors. Regardless of the value set, if auto=True, Cubist may override this input and choose a different number of neighbors. Please assess the model for the selected value for the number of neighbors used.
unbiasedbool, default=False: Should unbiased rules be used? Since Cubist minimizes the MAE of the predicted values, the rules may be biased and the mean predicted value may differ from the actual mean. This is recommended when there are frequent occurrences of the same target value. Note that MAE may be slightly higher.
autobool, default=False: A value of True allows the algorithm to choose whether to use nearest-neighbor corrections and how many neighbors. In this case, neighbors must be left as None since the input has no bearing on the model’s behavior. False will leave the choice of whether to use a composite model to value passed to the neighbors parameter.
extrapolationfloat, default=0.05: Adjusts how much rule predictions are adjusted to be consistent with the training dataset. Recommended value is 5% as a decimal (0.05)
samplefloat, default=None: Percentage of the data set to be randomly selected for model building and held out for model testing. When using this parameter, Cubist will report evaluation results on the testing set in addition to the training set results.
cvint, default=None: Whether to carry out cross-validation and how many folds to use (recommended value is 10 per Quinlan)
random_stateint, default=None: An integer to set the random seed for the C Cubist code when performing cross-validation or sampling.
target_labelstr, default=”outcome”: A label for the outcome variable. This is only used for printing rules.
verboseint, default=0: Should the Cubist output be printed?

Attributes:

model_str: The Cubist model string generated by the C code.
output_str: The Cubist model summary generated by the C code.

Added in version 1.0.0.
feature_importances_pd.DataFrame: Table of how training data variables are used in the Cubist model. The first column for “Conditions” shows the approximate percentage of cases for which the named attribute appears in a condition of an applicable rule, while the second column “Attributes” gives the percentage of cases for which the attribute appears in the linear formula of an applicable rule.
n_features_in_int: Number of features seen during fit.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Defined only when X has feature names that are all strings.
splits_pd.DataFrame: Table of the splits built by the Cubist model for each rule.
coeff_pd.DataFrame: Table of the regression coefficients found by the Cubist model for each rule.
version_str: The Cubist version with the time of local build/install.

Added in version 1.0.0.
feature_statistics_pd.DataFrame: Table of statistics Cubist generated for each input attribute.

Added in version 1.0.0.
committee_error_reduction_float: The reduction in error by using committees.

Added in version 1.0.0.
n_committees_used_int: Number of committees actually used by Cubist.

Added in version 1.0.0.

References

Examples

>>> from cubist import Cubist
>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.model_selection import train_test_split
>>> X, y = fetch_california_housing(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=42,
...                                                     test_size=0.2)
>>> model = Cubist()
>>> model.fit(X_train, y_train)
Cubist()
>>> model.predict(X_test)
array([0.50073832, 0.86456549, 5.14631033, ..., 4.76159859, 0.76238906,
...    1.9493351 ], shape=(4128,))
>>> model.score(X_test, y_test)
0.81783547...

fit(X, y, sample_weight=None)[source]¶

Build a Cubist regression model from training set (X, y).

Parameters:

X{array-like} of shape (n_samples, n_features): The training input samples. Must have complete column names or none provided at all (NumPy arrays will be given names by column index).
yarray-like of shape (n_samples,): The target values (Real numbers in regression).
sample_weightarray-like of shape (n_samples,): Optional vector of sample weights that is the same length as y for how much each instance should contribute to the model fit.

Returns:

selfobject: Fitted estimator.

predict(X)[source]¶

Predict Cubist regression target for X.

Parameters:

X{array-like} of shape (n_samples, n_features): The input samples. Must have complete column names or none provided at all (NumPy arrays will be given names by column index).

Returns:

yndarray of shape (n_samples,): The predicted values.

score(X, y, sample_weight=None)[source]¶

Return coefficient of determination on test data.

The coefficient of determination, \(R^2\), is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:

Xarray-like of shape (n_samples, n_features): Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True values for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns:

scorefloat: \(R^2\) of self.predict(X) w.r.t. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).