{ "cells": [ { "cell_type": "markdown", "id": "5891f4b6", "metadata": {}, "source": [ "# Notebook 08: Training more than one ML model\n", "\n", "### Goal: Training a ML using all features/predictors/inputs and all ML methods\n", "\n", "#### Background\n", "\n", "So far in Notebooks 4-7 we have shown you how to train a single ML model for each task. But one really nice thing about ```sklearn``` is that they have coded up all the models we discussed in the paper to adopt the same syntax. This will make more sense in a little bit, but what this enables us to do is train many different ML methods to find which method performs best for our specific task. This generally a good method for designing your own ML projects.\n", "\n", "### Classification, simple \n", "We will start off the same as Notebook 4. " ] }, { "cell_type": "code", "execution_count": 1, "id": "e71637dd", "metadata": {}, "outputs": [], "source": [ "#needed packages \n", "import xarray as xr\n", "import matplotlib.pyplot as plt \n", "import numpy as np\n", "import pandas as pd\n", "\n", "#plot parameters that I personally like, feel free to make these your own.\n", "import matplotlib\n", "matplotlib.rcParams['axes.facecolor'] = [0.9,0.9,0.9] #makes a grey background to the axis face\n", "matplotlib.rcParams['axes.labelsize'] = 14 #fontsize in pts\n", "matplotlib.rcParams['axes.titlesize'] = 14 \n", "matplotlib.rcParams['xtick.labelsize'] = 12 \n", "matplotlib.rcParams['ytick.labelsize'] = 12 \n", "matplotlib.rcParams['legend.fontsize'] = 12 \n", "matplotlib.rcParams['legend.facecolor'] = 'w' \n", "matplotlib.rcParams['savefig.transparent'] = False\n", "\n", "#make default resolution of figures much higher (i.e., High definition)\n", "%config InlineBackend.figure_format = 'retina'\n", "\n", "#import some helper functions for our other directory.\n", "import sys\n", "sys.path.insert(1, '../scripts/')\n", "from aux_functions import load_n_combine_df\n", "(X_train,y_train),(X_validate,y_validate),(X_test,y_test) = load_n_combine_df(path_to_data='../datasets/sevir/',features_to_keep=np.arange(0,1,1),class_labels=True)" ] }, { "cell_type": "markdown", "id": "96f9c993", "metadata": {}, "source": [ "But now we will initalize a list of models!" ] }, { "cell_type": "code", "execution_count": 2, "id": "841acc02", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[LogisticRegression(), GaussianNB(), DecisionTreeClassifier(), RandomForestClassifier(), GradientBoostingClassifier(), LinearSVC(dual=False)]\n" ] } ], "source": [ "#load ML code from sklearn\n", "from sklearn.linear_model import LogisticRegression,SGDClassifier\n", "from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.naive_bayes import GaussianNB\n", "\n", "#initialize\n", "model_list = [LogisticRegression(),GaussianNB(),DecisionTreeClassifier(),RandomForestClassifier(),GradientBoostingClassifier(),LinearSVC(dual=False)]\n", "\n", "print(model_list)" ] }, { "cell_type": "markdown", "id": "0771d776", "metadata": {}, "source": [ "since the syntax is identical, we can loop over this list and train all the methods" ] }, { "cell_type": "code", "execution_count": 3, "id": "1779e9d5", "metadata": {}, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'tqdm'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/scratch/local/u0035056/5884459/ipykernel_359337/1348316319.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m#Import a progress bar so we know how long it is taking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtqdm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'tqdm'" ] } ], "source": [ "#Import a progress bar so we know how long it is taking\n", "import tqdm \n", "\n", "for model in tqdm.tqdm(model_list):\n", " model.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "id": "64efd006", "metadata": {}, "source": [ "Wait for the star to finish up above, then that means it is done training! Way to go, you just trained 6 **different** ML models. Nice right? But lets evaluate them now. \n", "\n", "Let's start by looking at the performance diagram first " ] }, { "cell_type": "code", "execution_count": null, "id": "8f68d37d", "metadata": {}, "outputs": [], "source": [ "#load contingency_table func\n", "from gewitter_functions import get_contingency_table,make_performance_diagram_axis,get_acc,get_pod,get_sr,csi_from_sr_and_pod\n", "\n", "#make axis to plot on \n", "ax = make_performance_diagram_axis()\n", "\n", "#make list of colors so each method shows up as a different color\n", "colors=['b','r','g','y','LightGreen','k']\n", "legend_labels = ['LgR','NB','DT','RF','GBT','SVM']\n", "\n", "#loop over all trained models \n", "for idx,model in enumerate(model_list):\n", " #get predictions \n", " yhat = model.predict(X_validate)\n", " #the contingency table calculator expects y_true,y_pred\n", " cont_table = get_contingency_table(y_validate,yhat)\n", " \n", " #get metrics\n", " accuracy = get_acc(cont_table)\n", " pod = get_pod(cont_table)\n", " sr = get_sr(cont_table)\n", " csi = csi_from_sr_and_pod(sr,pod)\n", " \n", " ax.plot(sr,pod,'o',color=colors[idx],markerfacecolor='w',label=legend_labels[idx])\n", " \n", " print('{} accuracy: {}%'.format(legend_labels[idx],np.round(accuracy,0)))\n", " \n", "ax.legend()" ] }, { "cell_type": "markdown", "id": "bae30ce5", "metadata": {}, "source": [ "As shown in the paper, all of the methods basically have the same results with some minor differences. If we look at the AUC of the ROC curve we will see something similar. Just one note though, the SVM method we used here (LinearSVC) does not support ```model.predict_proba```, so we will leave it off here. " ] }, { "cell_type": "code", "execution_count": null, "id": "5cd2b3a5", "metadata": {}, "outputs": [], "source": [ "#load contingency_table func\n", "from gewitter_functions import get_points_in_roc_curve,get_area_under_roc_curve\n", "\n", "#something to help with annotating the figure\n", "import matplotlib.patheffects as path_effects\n", "pe = [path_effects.withStroke(linewidth=2,\n", " foreground=\"k\")]\n", "pe2 = [path_effects.withStroke(linewidth=2,\n", " foreground=\"w\")]\n", " \n", "#make figure\n", "fig = plt.figure(figsize=(4.1,5))\n", "#set facecolor to white so you can copy/paste the image somewhere \n", "fig.set_facecolor('w')\n", "\n", "#make list of colors so each method shows up as a different color\n", "colors=['b','r','g','y','LightGreen','k']\n", "legend_labels = ['LgR','NB','DT','RF','GBT','SVM']\n", "\n", "ax = plt.gca()\n", "\n", "#loop over all trained models \n", "for idx,model in enumerate(model_list[:-1]):\n", " #get predictions \n", " yhat_proba = model.predict_proba(X_validate)\n", " \n", " #lets just focus on the output from class 1 (note, the sum of these two columns should be 1)\n", " y_preds = yhat_proba[:,1]\n", " \n", " #get the roc curve\n", " pofds, pods = get_points_in_roc_curve(forecast_probabilities=y_preds, observed_labels=y_validate, threshold_arg=np.linspace(0,1,100))\n", " \n", " #get AUC \n", " auc = get_area_under_roc_curve(pofds,pods)\n", " \n", " ax.plot(pofds,pods,'-',color=colors[idx],label=legend_labels[idx])\n", " \n", " print('{} AUC: {}'.format(legend_labels[idx],np.round(auc,2)))\n", " \n", "ax.legend()\n", "\n", "#set some limits\n", "ax.set_xlim([0,1])\n", "ax.set_ylim([0,1])\n", "\n", "#set the no-skill line\n", "ax.plot([0,1],[0,1],'--',color='Grey')\n", "\n", "#label things\n", "ax.set_title(\"AUC of ROC Curve\")\n", "ax.set_xlabel('POFD')\n", "ax.set_ylabel('POD')\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "id": "b571030d", "metadata": {}, "source": [ "Congrats, you have now trained and evaluated multiple models with the same data! I encourage you to go ahead and now code up how to do it with all the data as inputs. Please note, it does take a bit longer to train, but shouldn't take more than 5-10 mins.\n", "\n", "### Regression, simple \n", "We will start off the same as Notebook 5." ] }, { "cell_type": "code", "execution_count": null, "id": "53cf6864", "metadata": {}, "outputs": [], "source": [ "#needed packages \n", "import xarray as xr\n", "import matplotlib.pyplot as plt \n", "import numpy as np\n", "import pandas as pd\n", "\n", "#plot parameters that I personally like, feel free to make these your own.\n", "import matplotlib\n", "matplotlib.rcParams['axes.facecolor'] = [0.9,0.9,0.9] #makes a grey background to the axis face\n", "matplotlib.rcParams['axes.labelsize'] = 14 #fontsize in pts\n", "matplotlib.rcParams['axes.titlesize'] = 14 \n", "matplotlib.rcParams['xtick.labelsize'] = 12 \n", "matplotlib.rcParams['ytick.labelsize'] = 12 \n", "matplotlib.rcParams['legend.fontsize'] = 12 \n", "matplotlib.rcParams['legend.facecolor'] = 'w' \n", "matplotlib.rcParams['savefig.transparent'] = False\n", "\n", "#make default resolution of figures much higher (i.e., High definition)\n", "%config InlineBackend.figure_format = 'retina'\n", "\n", "#import some helper functions for our other directory.\n", "import sys\n", "sys.path.insert(1, '../scripts/')\n", "from aux_functions import load_n_combine_df\n", "(X_train,y_train),(X_validate,y_validate),(X_test,y_test) = load_n_combine_df(path_to_data='../datasets/sevir/',features_to_keep=np.arange(0,1,1),class_labels=False,dropzeros=True)" ] }, { "cell_type": "markdown", "id": "467dca24", "metadata": {}, "source": [ "Now we will initalize a list of *Regression* models" ] }, { "cell_type": "code", "execution_count": null, "id": "b028d912", "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.svm import LinearSVR\n", "\n", "#initialize\n", "model_list = [LinearRegression(),DecisionTreeRegressor(),RandomForestRegressor(),GradientBoostingRegressor(),LinearSVR()]\n", "\n", "print(model_list)" ] }, { "cell_type": "markdown", "id": "98a67c80", "metadata": {}, "source": [ "Now go train your ML regressors!" ] }, { "cell_type": "code", "execution_count": null, "id": "7abab05e", "metadata": {}, "outputs": [], "source": [ "#Import a progress bar so we know how long it is taking\n", "import tqdm \n", "\n", "for model in tqdm.tqdm(model_list):\n", " model.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "id": "7da1566f", "metadata": {}, "source": [ "As before, lets evaluate them. First we will make the one-to-one scatter plot like Figures 14 and 16" ] }, { "cell_type": "code", "execution_count": null, "id": "b9682b76", "metadata": {}, "outputs": [], "source": [ "from aux_functions import boxbin,make_colorbar\n", "#make figure with 2 rows and 3 columns with size 7.5\" by 5\"\n", "fig,axes = plt.subplots(2,3,figsize=(7.5,5))\n", "#set facecolor to white so we can copy paste it if you want to somewhere else\n", "fig.set_facecolor('w')\n", "\n", "#the number of bins for the boxbin method \n", "n = 33\n", "#the bins we want in x and y \n", "xbins = np.logspace(0,3.5,n)\n", "ybins = np.logspace(0,3.5,n)\n", "\n", "#colors i like \n", "r = [255/255,127/255,127/255]\n", "b = [126/255,131/255,248/255]\n", "\n", "#labels\n", "labels= ['LnR','DT','RF','GBT','SVM']\n", "#color list, one for each model \n", "colors= [r,b,'orange','purple','dimgrey']\n", "#colormaps to match the colors in 'theme'\n", "cmaps=['Reds_r','Blues_r','Oranges_r','Purples_r','Greys_r']\n", "\n", "#force ticks to show up where i want them \n", "locmin = matplotlib.ticker.LogLocator(base=10.0, subs=(0.1,0.2,0.4,0.6,0.8,1,2,4,6,8,10 )) \n", "\n", "#axes is shape [2,3], it is easier to loop if we flatten this, which is what ravel does \n", "axes = axes.ravel()\n", "\n", "#some parameters to make it pretty \n", "c_scale = 0.575\n", "fs3 = 11\n", "fs4 = 18\n", "props = dict(boxstyle='square', facecolor='White', alpha=0.75)\n", "annotate_list = ['a)','b)','c)','d)','e)',]\n", "\n", "#draw a new axis for a new colorbar to go on \n", "ax_cbar = fig.add_axes([0.75, 0.15, 0.015,0.33])\n", "#draw that colorbar \n", "cbar = make_colorbar(ax_cbar,0,2,plt.cm.Greys_r)\n", "#label that colorbar \n", "cbar.set_label('$\\%$ of total points')\n", "\n", "#loop over axes and draw scatters \n", "for i,ax in enumerate(axes):\n", " #we have 1 too many subplots, so turn off the last one [5]\n", " if i==5:\n", " ax.axis('off')\n", " break\n", " #make axes log-log \n", " ax.semilogy()\n", " ax.semilogx()\n", " \n", " #grab model\n", " model = model_list[i]\n", " #get predicitions \n", " yhat = model.predict(X_validate)\n", " #make scatter plot \n", " ax.scatter(yhat,y_validate,color=colors[i],s=1,marker='+')\n", " \n", " #box and bin up data to show density of points \n", " ax,cbar,C = boxbin(yhat,y_validate,xbins,ybins,ax=ax,mincnt=100,normed=True,cmap=cmaps[i],vmin=0,vmax=2,cbar=False)\n", " \n", " #set some axis limits and ticks \n", " ax.set_xlim([1,4000])\n", " ax.set_xticks([1,10,100,1000])\n", " ax.set_yticks([1,10,100,1000])\n", " ax.set_ylim([1,4000])\n", " \n", " #add diaganol line \n", " ax.plot([1,4000],[1,4000],'--k',alpha=0.5)\n", " \n", " #add a subplot label \n", " ax.text(0.075, 0.25, annotate_list[i], transform=ax.transAxes,fontsize=fs4,\n", " verticalalignment='top', bbox=props)\n", " \n", " #only label certain axes x-y axis to save space \n", " if (i == 0) or (i==3):\n", " ax.set_ylabel('$y$, [# of flashes]')\n", " if i==4:\n", " ax.set_xlabel(r'$\\hat{y}$, [# of flashes]')\n", " \n", " #label each subplot title as the method used \n", " ax.set_title(labels[i])\n", " \n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "id": "d809d183", "metadata": {}, "source": [ "And their quantitative metrics" ] }, { "cell_type": "code", "execution_count": null, "id": "3840d150", "metadata": {}, "outputs": [], "source": [ "from gewitter_functions import get_mae,get_rmse,get_bias,get_r2\n", "\n", "#loop over all trained models \n", "for idx,model in enumerate(model_list):\n", " #get predictions \n", " yhat = model.predict(X_validate)\n", " \n", " mae = get_mae(y_validate,yhat)\n", " rmse = get_rmse(y_validate,yhat)\n", " bias = get_bias(y_validate,yhat)\n", " r2 = get_r2(y_validate,yhat)\n", "\n", " #print them out so we can see them \n", " print('Method: {} .. MAE:{} flashes, RMSE:{} flashes, Bias:{} flashes, Rsquared:{}'.format(labels[idx],np.round(mae,2),np.round(rmse,2),np.round(bias,2),np.round(r2,2)))" ] }, { "cell_type": "markdown", "id": "50962ba1", "metadata": {}, "source": [ "While some are a big fan of tables, I perfer a bar chart (Figures 15 and 17)" ] }, { "cell_type": "code", "execution_count": null, "id": "542dbd19", "metadata": {}, "outputs": [], "source": [ "#some annotation helpers\n", "import matplotlib.patheffects as path_effects\n", "pe = [path_effects.withStroke(linewidth=2,\n", " foreground=\"k\")]\n", "pe2 = [path_effects.withStroke(linewidth=2,\n", " foreground=\"w\")]\n", "#make a 2 row, 2 column figure of size 5\" by 5\"\n", "fig,axes = plt.subplots(2,2,figsize=(5,5))\n", "#set facecolor to white so we can copy/paste it whereever\n", "fig.set_facecolor('w')\n", "\n", "#list of labels for the x-axis \n", "labels= ['LnR','DT','RF','GBT','SVM']\n", "\n", "#loop over all trained models \n", "for i,model in enumerate(model_list):\n", " #get predictions \n", " yhat = model.predict(X_validate)\n", " mae = get_mae(y_validate,yhat)\n", " rmse = get_rmse(y_validate,yhat)\n", " bias = get_bias(y_validate,yhat)\n", " r2 = get_r2(y_validate,yhat)\n", " \n", " ############### subplot 0,0: Bias ########################\n", " ax = axes[0,0]\n", " #put a bar at position i (from our loop)\n", " ax.bar(i,bias,width=0.95,color=colors[i])\n", " #make the annotation so we can see the numerical data on the plot \n", " annotate = str(int(np.round(bias))).rjust(3, ' ')\n", " ax.text(i-0.4,bias+5,annotate,color=colors[i],path_effects=pe2)\n", " ##########################################################\n", "\n", " ####### subplot 0,1: Mean Absolute Error #################\n", " ax = axes[0,1]\n", " #put a bar at position i (from our loop)\n", " ax.bar(i,mae,width=0.95,color=colors[i])\n", " #make the annotation so we can see the numerical data on the plot \n", " annotate = str(int(np.round(mae))).rjust(3, ' ')\n", " ax.text(i-0.4,mae+5,annotate,color=colors[i],path_effects=pe2)\n", " ##########################################################\n", " \n", " ####### subplot 1,0: Root Mean Squared Error #############\n", " ax = axes[1,0]\n", " ax.bar(i,rmse,width=0.95,color=colors[i])\n", " annotate = str(int(np.round(rmse))).rjust(3, ' ')\n", " ax.text(i-0.4,rmse+5,annotate,color=colors[i],path_effects=pe2)\n", " ##########################################################\n", " \n", " ####### subplot 1,1: Rsquared ###########################\n", " ax = axes[1,1]\n", " ax.bar(i,r2,width=0.95,color=colors[i])\n", " annotate = str(np.round(r2,2)).ljust(4, '0')\n", " ax.text(i-0.5,r2+0.05,annotate,color=colors[i],path_effects=pe2)\n", " ##########################################################\n", " \n", "\n", " \n", "\n", "#cosmetic things: \n", "ax = axes[0,0]\n", "ax.xaxis.set_ticks(np.arange(0,5))\n", "ax.xaxis.set_ticklabels(labels,rotation=45)\n", "ax.set_ylim([-130,130])\n", "ax.set_title(\"Bias\")\n", "ax.text(0.075, 0.25, annotate_list[0], transform=ax.transAxes,fontsize=fs4,\n", "verticalalignment='top', bbox=props)\n", "\n", "ax = axes[0,1]\n", "ax.set_ylim([0,200])\n", "ax.xaxis.set_ticks(np.arange(0,5))\n", "ax.xaxis.set_ticklabels(labels,rotation=45)\n", "ax.set_title(\"Mean Abs. Error\")\n", "ax.text(0.075, 0.25, annotate_list[1], transform=ax.transAxes,fontsize=fs4,\n", "verticalalignment='top', bbox=props)\n", "\n", "ax = axes[1,0]\n", "ax.set_ylim([0,300])\n", "ax.xaxis.set_ticks(np.arange(0,5))\n", "ax.xaxis.set_ticklabels(labels,rotation=45)\n", "ax.set_title(\"Root Mean Sq. Error\")\n", "ax.text(0.075, 0.25, annotate_list[2], transform=ax.transAxes,fontsize=fs4,\n", "verticalalignment='top', bbox=props)\n", "\n", "ax = axes[1,1]\n", "ax.set_ylim([-1,1])\n", "ax.xaxis.set_ticks(np.arange(0,5))\n", "ax.xaxis.set_ticklabels(labels,rotation=45)\n", "ax.set_title(\"$R^{2}$\")\n", "ax.text(0.075, 0.25, annotate_list[3], transform=ax.transAxes,fontsize=fs4,\n", "verticalalignment='top', bbox=props)\n", " \n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "id": "63c511a9", "metadata": {}, "source": [ "There ya go! You have successfully made a metric bar chart to compare the 5 ML regression models you trained. Like earlier in this notebook (the end of the classification task), I encourage you to extend this notebook to now include all predictors!" ] }, { "cell_type": "code", "execution_count": null, "id": "198f3af9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "atmos6010", "language": "python", "name": "u0035056" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }