Testing the impact on features, generated in `preprocessing`, on the model performance

Loading

%%time
loading.N_TRAIN = 10_000
loading.N_TEST = 10_000
%%time
ashrae_data = loading.load_all()

Executing the pre-processing

%%time
tfms_config = {
    'add_time_features':{},
    'add_weather_features':{'fix_time_offset':True,
                            'add_na_indicators':True,
                            'impute_nas':True},
    'add_building_features':{},
}

df, df_test, var_names = preprocessing.preprocess_all(ashrae_data, tfms_config)
%%time
#df = load_df(data_path/'X.parquet')

%time
#df_test_p = load_df(data_path/'X_test.parquet')

%time
#var_names = load_var_names(data_path/'var_names.pckl')

Testing if certain features improve the score beyond the baseline

Training to get a basic idea if the added features do have any benefit

get_tabular_object[source]

get_tabular_object(df:DataFrame, var_names:dict, splits=None, procs:list=None)

train_predict[source]

train_predict(df:DataFrame, var_names:dict, model, model_params:dict=None, n_rep:int=3, n_samples_train:int=10000, n_samples_valid:int=10000, procs:list=[<class 'fastai.tabular.core.Categorify'>, <class 'fastai.tabular.core.FillMissing'>, <class 'fastai.data.transforms.Normalize'>], split_params:dict=None)

split_params = dict(
    split_kind = 'time_split_day',
    t_train = None,
    train_frac = .8,
)
model_params = {'n_estimators': 40, 'max_features': 'sqrt',
                'n_jobs': -1, 'min_samples_leaf':25}
model = ensemble.RandomForestRegressor
# params = {}
# model = linear_model.LinearRegression
n_rep = 3
n_samples_train = 10_000
n_samples_valid = 10_000

The following is not always necessary. Sensible in the case of a linear model to remove categorical values which are not onehot encoded

# for k in ['cats', 'conts']:
#     var_names[k] = [_v for _v in var_names[k] if _v not in to_remove[k]]
# var_names
%%time
procs = [Categorify, FillMissing, Normalize]
df_rep = train_predict(df.copy(), var_names, model, model_params=model_params, 
                       n_rep=n_rep, n_samples_train=n_samples_train,
                       n_samples_valid=n_samples_valid, procs=procs,
                       split_params=split_params)
df_rep['rmse loss'].describe()
px.box(df_rep, y='rmse loss', range_y=(0, 2.5))

Baseline model = RandomForest with 20 estimators and sqrt features, training over 100k samples and predicting over 1k

input model rmse loss time [s/it]
meter and building id only random forest 1.2 - 1.21 10.2
using dep_var stats random forest 1.16 - 1.18 17.3
using time stats random forest 1.2 - 1.21 13.2 - 13.7
using building info random forest 1.19 17 - 18
using weather (+ building) info random forest 1.13 - 1.139 14.6 - 15
using all above random forest 1.19 - 1.21 20 - 26
removing leading 0s in `dep_var` random forest .36 - .37 4
removing trailing 0s in `dep_var` random forest 1.2 4
removing empty weeks before the first full week random forest 1.16 4
meter only linear model 2.2 5
meter + hour linear model 2.1 5
meter reading stats only (meter, building_id, hour) linear model 1.23 - 1.24 / 1.68 - 1.7 5
meter + meter reading stats (meter, building_id, hour) linear model 1.51 - 1.52 5
meter reading stats (meter, building_id, hour) random forest 0.58 - 0.6 5
meter + meter reading stats (meter, building_id, hour) random forest 1.21 - 1.22 5

Plotting predictions with a single model

Comparing dep_var distributions

%%time
df, df_test, var_names = preprocessing.preprocess_all(ashrae_data, tfms_config)
%%time
splits = preprocessing.split_dataset(df, **split_params)
%%time
to = get_tabular_object(df, var_names, splits=splits, procs=procs)
%%time
params = {'n_estimators': 100, 'max_features': 'sqrt',
          'min_samples_leaf':5}
model = ensemble.RandomForestRegressor
m = model(**params)
%%time
_X = to.train.xs.sample(n_samples_train, replace=True).values
_y = to.train.ys.sample(n_samples_train, replace=True).values.ravel()
m.fit(_X, _y)
%%time
_X = to.valid.xs.sample(n_samples_valid, replace=True).values
_y = to.valid.ys.sample(n_samples_valid, replace=True).values.ravel()
pred = m.predict(_X)

hist_plot_preds[source]

hist_plot_preds(y0:ndarray, y1:ndarray, label0:str='y0', label1:str='y1')

hist_plot_preds(_y, pred, label0='truth (valid)', label1='prediction (valid)')

Inspecting confidently wrong predictions

class BoldlyWrongTimeseries[source]

BoldlyWrongTimeseries(xs, y_true, y_pred, info:DataFrame=None)

%%time
_X = to.valid.xs 
_y = to.valid.ys.values.ravel() 
pred = m.predict(to.valid.xs.values)
assert _y.shape == pred.shape
%%time
bwt = BoldlyWrongTimeseries(to.valid.xs.join(df.loc[:,['building_id', 'meter','timestamp']], lsuffix='to_'), 
                            _y, pred)

Adding plotting capability based on the loss or meter/building id

BoldlyWrongTimeseries.plot_boldly_wrong[source]

BoldlyWrongTimeseries.plot_boldly_wrong(nth_last:int=None, meter:int=None, bid:int=None)

bwt.plot_boldly_wrong(nth_last=-1)

Adding widgets for interactive exploration

BoldlyWrongTimeseries.init_widgets[source]

BoldlyWrongTimeseries.init_widgets()

BoldlyWrongTimeseries.run_boldly[source]

BoldlyWrongTimeseries.run_boldly()

BoldlyWrongTimeseries.click_boldly_wrong[source]

BoldlyWrongTimeseries.click_boldly_wrong(change)

bwt.run_boldly()