# Tree-based models 

## Overview 

This notebook contains an initial exploration of tree-based regressions to predict monthly ED demand. 

As the variables population, people, places and lives only vary annually they cannot be included in the model due to data leakage between the training and test sets.

For all models, variables used include:

- Service capacity (111, GP, Ambulance)
- Service utility (111, Ambulance)

In [1]:
#turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings('ignore')

## Import libraries 

In [2]:
import os
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

## Import data 

In [3]:
dta = pd.read_csv('https://raw.githubusercontent.com/CharlotteJames/ed-forecast/main/data/master_scaled_new.csv',
                  index_col=0)

In [4]:
dta.columns = ['_'.join([c.split('/')[0],c.split('/')[-1]]) 
               if '/' in c else c for c in dta.columns]

In [5]:
dta.ccg.unique().shape

(71,)

## Add random feature

In [6]:
# Adding random features

rng = np.random.RandomState(0)
rand_var = rng.rand(dta.shape[0])
dta['rand1'] = rand_var

In [7]:
dta.shape

(1425, 13)

## Fitting function 

In [8]:
def fit_model(dta, model, features):
    
    
    y = dta['ae_attendances_attendances']
    X = dta[features]
    
    #cross validate to get errors on performance and coefficients
    cv_model = cross_validate(model, X,y, 
                            cv=RepeatedKFold(n_splits=5, n_repeats=5,
                                             random_state=0),
                            return_estimator=True, 
                              return_train_score=True, n_jobs=2)
    
    clf = model.fit(X, y)

    
    return cv_model

## Model Comparison

### Random Forest 

In [9]:
model = RandomForestRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [10]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.333201,0.906589
std,0.069171,0.004628
min,0.251717,0.897753
25%,0.279984,0.90241
50%,0.304903,0.907451
75%,0.362969,0.909603
max,0.498345,0.914369


#### Coefficients 

In [11]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.366641,0.165856,0.201054,0.090782,0.175667
std,0.007121,0.021493,0.017359,0.004463,0.022596
min,0.351474,0.1323,0.17509,0.084022,0.135952
25%,0.36521,0.151593,0.185831,0.087954,0.154818
50%,0.36832,0.159192,0.199516,0.090297,0.175823
75%,0.370201,0.181076,0.215841,0.092903,0.189647
max,0.380198,0.211102,0.231699,0.100936,0.23417


### Extra Trees

In [12]:
model = ExtraTreesRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [13]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.193561,1.0
std,0.096864,0.0
min,0.071355,1.0
25%,0.12484,1.0
50%,0.181068,1.0
75%,0.219083,1.0
max,0.443345,1.0


#### Coefficients 

In [14]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.402054,0.109594,0.207259,0.086423,0.194669
std,0.01099,0.004853,0.008885,0.00363,0.008952
min,0.378114,0.099784,0.189729,0.077573,0.173157
25%,0.397852,0.106351,0.202674,0.084358,0.190694
50%,0.404817,0.108629,0.207431,0.087038,0.194526
75%,0.407286,0.113269,0.213277,0.088099,0.201837
max,0.424098,0.120241,0.223125,0.095986,0.208935


### Gradient Boosted Trees

In [15]:
model = GradientBoostingRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [16]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.418383,0.575885
std,0.045061,0.013257
min,0.343425,0.553899
25%,0.381125,0.566043
50%,0.4218,0.578035
75%,0.443099,0.580513
max,0.499701,0.598746


#### Coefficients 

In [17]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.195654,0.238214,0.352163,0.056986,0.156983
std,0.016058,0.055671,0.04215,0.00925,0.047319
min,0.166057,0.12561,0.292792,0.040693,0.094238
25%,0.185457,0.198476,0.321992,0.052005,0.12816
50%,0.193253,0.246512,0.352053,0.055817,0.141809
75%,0.206808,0.289712,0.375758,0.063965,0.187729
max,0.224541,0.313029,0.442558,0.08112,0.300498


### ADA Boost

In [18]:
model = AdaBoostRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [19]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.363098,0.398103
std,0.050498,0.021394
min,0.257391,0.357158
25%,0.325664,0.379429
50%,0.355694,0.401322
75%,0.403047,0.4126
max,0.458281,0.440541


#### Coefficients 

In [20]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.13932,0.188261,0.447331,0.033578,0.19151
std,0.021558,0.065087,0.105463,0.023787,0.115929
min,0.098634,0.057043,0.221313,0.003519,0.020152
25%,0.125532,0.146272,0.409148,0.017522,0.110426
50%,0.144176,0.191572,0.446521,0.029078,0.16965
75%,0.154744,0.223349,0.530729,0.042222,0.238708
max,0.183413,0.327782,0.644891,0.096912,0.464397


## Summary 

- Extra Trees does not preform well
- Random forest with default parameters is overfitting to the training data
- Gradient boosted trees performs best