Loading

%%time
loading.N_TRAIN = 10_000
loading.N_TEST = 10_000

%%time
ashrae_data = loading.load_all()

Executing the pre-processing

%%time
tfms_config = {
    'add_time_features':{},
    'add_weather_features':{'fix_time_offset':True,
                            'add_na_indicators':True,
                            'impute_nas':True},
    'add_building_features':{},
}

df, df_test, var_names = preprocessing.preprocess_all(ashrae_data, tfms_config)

%%time
#df = load_df(data_path/'X.parquet')

%time
#df_test_p = load_df(data_path/'X_test.parquet')

%time
#var_names = load_var_names(data_path/'var_names.pckl')

Testing if certain features improve the score beyond the baseline

Training to get a basic idea if the added features do have any benefit

split_params = dict(
    split_kind = 'time_split_day',
    t_train = None,
    train_frac = .8,
)

model_params = {'n_estimators': 40, 'max_features': 'sqrt',
                'n_jobs': -1, 'min_samples_leaf':25}
model = ensemble.RandomForestRegressor
# params = {}
# model = linear_model.LinearRegression
n_rep = 3
n_samples_train = 10_000
n_samples_valid = 10_000

The following is not always necessary. Sensible in the case of a linear model to remove categorical values which are not onehot encoded

# for k in ['cats', 'conts']:
#     var_names[k] = [_v for _v in var_names[k] if _v not in to_remove[k]]
# var_names

%%time
procs = [Categorify, FillMissing, Normalize]
df_rep = train_predict(df.copy(), var_names, model, model_params=model_params, 
                       n_rep=n_rep, n_samples_train=n_samples_train,
                       n_samples_valid=n_samples_valid, procs=procs,
                       split_params=split_params)

df_rep['rmse loss'].describe()

px.box(df_rep, y='rmse loss', range_y=(0, 2.5))

Baseline model = RandomForest with 20 estimators and sqrt features, training over 100k samples and predicting over 1k

input	model	rmse loss	time [s/it]
meter and building id only	random forest	1.2 - 1.21	10.2
using dep_var stats	random forest	1.16 - 1.18	17.3
using time stats	random forest	1.2 - 1.21	13.2 - 13.7
using building info	random forest	1.19	17 - 18
using weather (+ building) info	random forest	1.13 - 1.139	14.6 - 15
using all above	random forest	1.19 - 1.21	20 - 26
removing leading 0s in `dep_var`	random forest	.36 - .37	4
removing trailing 0s in `dep_var`	random forest	1.2	4
removing empty weeks before the first full week	random forest	1.16	4
meter only	linear model	2.2	5
meter + hour	linear model	2.1	5
meter reading stats only (meter, building_id, hour)	linear model	1.23 - 1.24 / 1.68 - 1.7	5
meter + meter reading stats (meter, building_id, hour)	linear model	1.51 - 1.52	5
meter reading stats (meter, building_id, hour)	random forest	0.58 - 0.6	5
meter + meter reading stats (meter, building_id, hour)	random forest	1.21 - 1.22	5

Plotting predictions with a single model

Comparing `dep_var` distributions

%%time
df, df_test, var_names = preprocessing.preprocess_all(ashrae_data, tfms_config)

%%time
splits = preprocessing.split_dataset(df, **split_params)

%%time
to = get_tabular_object(df, var_names, splits=splits, procs=procs)

%%time
params = {'n_estimators': 100, 'max_features': 'sqrt',
          'min_samples_leaf':5}
model = ensemble.RandomForestRegressor
m = model(**params)

%%time
_X = to.train.xs.sample(n_samples_train, replace=True).values
_y = to.train.ys.sample(n_samples_train, replace=True).values.ravel()
m.fit(_X, _y)

%%time
_X = to.valid.xs.sample(n_samples_valid, replace=True).values
_y = to.valid.ys.sample(n_samples_valid, replace=True).values.ravel()
pred = m.predict(_X)

hist_plot_preds(_y, pred, label0='truth (valid)', label1='prediction (valid)')

Inspecting confidently wrong predictions

%%time
_X = to.valid.xs 
_y = to.valid.ys.values.ravel() 
pred = m.predict(to.valid.xs.values)

assert _y.shape == pred.shape

%%time
bwt = BoldlyWrongTimeseries(to.valid.xs.join(df.loc[:,['building_id', 'meter','timestamp']], lsuffix='to_'), 
                            _y, pred)

Adding plotting capability based on the loss or meter/building id

bwt.plot_boldly_wrong(nth_last=-1)

Adding widgets for interactive exploration

bwt.run_boldly()

Testing features

Loading

Executing the pre-processing

Testing if certain features improve the score beyond the baseline

`get_tabular_object`[source]

`train_predict`[source]

Plotting predictions with a single model

Comparing `dep_var` distributions

`hist_plot_preds`[source]

Inspecting confidently wrong predictions

`class` `BoldlyWrongTimeseries`[source]

`BoldlyWrongTimeseries.plot_boldly_wrong`[source]

`BoldlyWrongTimeseries.init_widgets`[source]

`BoldlyWrongTimeseries.run_boldly`[source]

`BoldlyWrongTimeseries.click_boldly_wrong`[source]

Testing features

Loading

Executing the pre-processing

Testing if certain features improve the score beyond the baseline

get_tabular_object[source]

train_predict[source]

Plotting predictions with a single model

Comparing dep_var distributions

hist_plot_preds[source]

Inspecting confidently wrong predictions

class BoldlyWrongTimeseries[source]

BoldlyWrongTimeseries.plot_boldly_wrong[source]

BoldlyWrongTimeseries.init_widgets[source]

BoldlyWrongTimeseries.run_boldly[source]

BoldlyWrongTimeseries.click_boldly_wrong[source]

`get_tabular_object`[source]

`train_predict`[source]

Comparing `dep_var` distributions

`hist_plot_preds`[source]

`class` `BoldlyWrongTimeseries`[source]

`BoldlyWrongTimeseries.plot_boldly_wrong`[source]

`BoldlyWrongTimeseries.init_widgets`[source]

`BoldlyWrongTimeseries.run_boldly`[source]

`BoldlyWrongTimeseries.click_boldly_wrong`[source]