Model¶
Cubist¶
- class cubist.cubist.Cubist(n_rules: int = 500, *, n_committees: int = 1, neighbors: int = None, unbiased: bool = False, auto: bool = False, extrapolation: float = 0.05, sample: float = None, cv: int = None, random_state: int = None, target_label: str = 'outcome', verbose: int = 0)[source]¶
Cubist Regression Model (Public v2.07) developed by Ross Quinlan.
- Parameters:
- n_rulesint, default=500
Limit of the number of rules Cubist will build. Recommended and default value is 500.
- n_committeesint, default=1
Number of committees to construct. Each committee is a rule based model and beyond the first tries to correct the prediction errors of the prior constructed model. Recommended value is 5.
- neighborsint, default=None
Number between 1 and 9 for how many instances should be used to correct the rule-based prediction. If no value is given, Cubist will build a rule-based model only. If this value is set, Cubist will create a composite model with the given number of neighbors. Regardless of the value set, if auto=True, Cubist may override this input and choose a different number of neighbors. Please assess the model for the selected value for the number of neighbors used.
- unbiasedbool, default=False
Should unbiased rules be used? Since Cubist minimizes the MAE of the predicted values, the rules may be biased and the mean predicted value may differ from the actual mean. This is recommended when there are frequent occurrences of the same target value. Note that MAE may be slightly higher.
- autobool, default=False
A value of True allows the algorithm to choose whether to use nearest-neighbor corrections and how many neighbors. In this case, neighbors must be left as None since the input has no bearing on the model’s behavior. False will leave the choice of whether to use a composite model to value passed to the neighbors parameter.
- extrapolationfloat, default=0.05
Adjusts how much rule predictions are adjusted to be consistent with the training dataset. Recommended value is 5% as a decimal (0.05)
- samplefloat, default=None
Percentage of the data set to be randomly selected for model building and held out for model testing. When using this parameter, Cubist will report evaluation results on the testing set in addition to the training set results.
- cvint, default=None
Whether to carry out cross-validation and how many folds to use (recommended value is 10 per Quinlan)
- random_stateint, default=None
An integer to set the random seed for the C Cubist code when performing cross-validation or sampling.
- target_labelstr, default=”outcome”
A label for the outcome variable. This is only used for printing rules.
- verboseint, default=0
Should the Cubist output be printed?
- Attributes:
- model_str
The Cubist model string generated by the C code.
- output_str
The Cubist model summary generated by the C code.
Added in version 1.0.0.
- feature_importances_pd.DataFrame
Table of how training data variables are used in the Cubist model. The first column for “Conditions” shows the approximate percentage of cases for which the named attribute appears in a condition of an applicable rule, while the second column “Attributes” gives the percentage of cases for which the attribute appears in the linear formula of an applicable rule.
- n_features_in_int
Number of features seen during fit.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
- splits_pd.DataFrame
Table of the splits built by the Cubist model for each rule.
- coeff_pd.DataFrame
Table of the regression coefficients found by the Cubist model for each rule.
- version_str
The Cubist version with the time of local build/install.
Added in version 1.0.0.
- feature_statistics_pd.DataFrame
Table of statistics Cubist generated for each input attribute.
Added in version 1.0.0.
- committee_error_reduction_float
The reduction in error by using committees.
Added in version 1.0.0.
- n_committees_used_int
Number of committees actually used by Cubist.
Added in version 1.0.0.
References
Examples
>>> from cubist import Cubist >>> from sklearn.datasets import fetch_california_housing >>> from sklearn.model_selection import train_test_split >>> X, y = fetch_california_housing(return_X_y=True, as_frame=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, ... random_state=42, ... test_size=0.2) >>> model = Cubist() >>> model.fit(X_train, y_train) Cubist() >>> model.predict(X_test) array([0.50073832, 0.86456549, 5.14631033, ..., 4.76159859, 0.76238906, ... 1.9493351 ], shape=(4128,)) >>> model.score(X_test, y_test) 0.81783547...
- fit(X, y, sample_weight=None)[source]¶
Build a Cubist regression model from training set (X, y).
- Parameters:
- X{array-like} of shape (n_samples, n_features)
The training input samples. Must have complete column names or none provided at all (NumPy arrays will be given names by column index).
- yarray-like of shape (n_samples,)
The target values (Real numbers in regression).
- sample_weightarray-like of shape (n_samples,)
Optional vector of sample weights that is the same length as y for how much each instance should contribute to the model fit.
- Returns:
- selfobject
Fitted estimator.
- predict(X)[source]¶
Predict Cubist regression target for X.
- Parameters:
- X{array-like} of shape (n_samples, n_features)
The input samples. Must have complete column names or none provided at all (NumPy arrays will be given names by column index).
- Returns:
- yndarray of shape (n_samples,)
The predicted values.
- score(X, y, sample_weight=None)[source]¶
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns:
- scorefloat
\(R^2\) of
self.predict(X)
w.r.t. y.
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).