Testing the impact on features, generated in `preprocessing`, on the model performance
%%time
loading.N_TRAIN = 10_000
loading.N_TEST = 10_000
%%time
ashrae_data = loading.load_all()
%%time
tfms_config = {
'add_time_features':{},
'add_weather_features':{'fix_time_offset':True,
'add_na_indicators':True,
'impute_nas':True},
'add_building_features':{},
}
df, df_test, var_names = preprocessing.preprocess_all(ashrae_data, tfms_config)
%%time
#df = load_df(data_path/'X.parquet')
%time
#df_test_p = load_df(data_path/'X_test.parquet')
%time
#var_names = load_var_names(data_path/'var_names.pckl')
Training to get a basic idea if the added features do have any benefit
split_params = dict(
split_kind = 'time_split_day',
t_train = None,
train_frac = .8,
)
model_params = {'n_estimators': 40, 'max_features': 'sqrt',
'n_jobs': -1, 'min_samples_leaf':25}
model = ensemble.RandomForestRegressor
# params = {}
# model = linear_model.LinearRegression
n_rep = 3
n_samples_train = 10_000
n_samples_valid = 10_000
The following is not always necessary. Sensible in the case of a linear model to remove categorical values which are not onehot encoded
# for k in ['cats', 'conts']:
# var_names[k] = [_v for _v in var_names[k] if _v not in to_remove[k]]
# var_names
%%time
procs = [Categorify, FillMissing, Normalize]
df_rep = train_predict(df.copy(), var_names, model, model_params=model_params,
n_rep=n_rep, n_samples_train=n_samples_train,
n_samples_valid=n_samples_valid, procs=procs,
split_params=split_params)
df_rep['rmse loss'].describe()
px.box(df_rep, y='rmse loss', range_y=(0, 2.5))
Baseline model = RandomForest with 20 estimators and sqrt features, training over 100k samples and predicting over 1k
input | model | rmse loss | time [s/it] |
---|---|---|---|
meter and building id only | random forest | 1.2 - 1.21 | 10.2 |
using dep_var stats | random forest | 1.16 - 1.18 | 17.3 |
using time stats | random forest | 1.2 - 1.21 | 13.2 - 13.7 |
using building info | random forest | 1.19 | 17 - 18 |
using weather (+ building) info | random forest | 1.13 - 1.139 | 14.6 - 15 |
using all above | random forest | 1.19 - 1.21 | 20 - 26 |
removing leading 0s in `dep_var` | random forest | .36 - .37 | 4 |
removing trailing 0s in `dep_var` | random forest | 1.2 | 4 |
removing empty weeks before the first full week | random forest | 1.16 | 4 |
meter only | linear model | 2.2 | 5 |
meter + hour | linear model | 2.1 | 5 |
meter reading stats only (meter, building_id, hour) | linear model | 1.23 - 1.24 / 1.68 - 1.7 | 5 |
meter + meter reading stats (meter, building_id, hour) | linear model | 1.51 - 1.52 | 5 |
meter reading stats (meter, building_id, hour) | random forest | 0.58 - 0.6 | 5 |
meter + meter reading stats (meter, building_id, hour) | random forest | 1.21 - 1.22 | 5 |
%%time
df, df_test, var_names = preprocessing.preprocess_all(ashrae_data, tfms_config)
%%time
splits = preprocessing.split_dataset(df, **split_params)
%%time
to = get_tabular_object(df, var_names, splits=splits, procs=procs)
%%time
params = {'n_estimators': 100, 'max_features': 'sqrt',
'min_samples_leaf':5}
model = ensemble.RandomForestRegressor
m = model(**params)
%%time
_X = to.train.xs.sample(n_samples_train, replace=True).values
_y = to.train.ys.sample(n_samples_train, replace=True).values.ravel()
m.fit(_X, _y)
%%time
_X = to.valid.xs.sample(n_samples_valid, replace=True).values
_y = to.valid.ys.sample(n_samples_valid, replace=True).values.ravel()
pred = m.predict(_X)
hist_plot_preds(_y, pred, label0='truth (valid)', label1='prediction (valid)')
%%time
_X = to.valid.xs
_y = to.valid.ys.values.ravel()
pred = m.predict(to.valid.xs.values)
assert _y.shape == pred.shape
%%time
bwt = BoldlyWrongTimeseries(to.valid.xs.join(df.loc[:,['building_id', 'meter','timestamp']], lsuffix='to_'),
_y, pred)
Adding plotting capability based on the loss or meter/building id
bwt.plot_boldly_wrong(nth_last=-1)
Adding widgets for interactive exploration
bwt.run_boldly()