diff --git a/posts/meta-cm-sql_and_ml_xai-20250126/image.jpg b/posts/meta-cm-sql_and_ml_xai-20250126/image.jpg new file mode 100644 index 00000000..3ec04c8c Binary files /dev/null and b/posts/meta-cm-sql_and_ml_xai-20250126/image.jpg differ diff --git a/posts/meta-cm-sql_and_ml_xai-20250126/index.ipynb b/posts/meta-cm-sql_and_ml_xai-20250126/index.ipynb new file mode 100644 index 00000000..d23203c4 --- /dev/null +++ b/posts/meta-cm-sql_and_ml_xai-20250126/index.ipynb @@ -0,0 +1,1473 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "de6ec63f", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "title: '[DA스터디/6주차] optuna, Autogluon'\n", + "author: 'Kibok Park'\n", + "date: '2025-01-26'\n", + "categories: [Python, optuna, Autogluon, 202412Study_DataAnalysis]\n", + "execute:\n", + " freeze: auto\n", + "toc: true\n", + "draft: false\n", + "format:\n", + " html:\n", + " code-fold: false\n", + "comments:\n", + " giscus:\n", + " repo: kr9268/giscus_for_blog\n", + "---\n", + "금융권 데이터를 활용한 분석 스터디 - 6주차" + ] + }, + { + "cell_type": "markdown", + "id": "5e861fae", + "metadata": {}, + "source": [ + "# 개요\n", + "\n", + "* 아래의 목적/이유로 참가한 스터디에 대한 기록\n", + " * SQLD취득 후 장기 미사용 & GPT를 통한 SQL사용 등으로 많이 잊은 SQL을 복기\n", + " * 기존에 사용해 본 Optuna가 아닌 Autogluon이 커리큘럼에 있어 익혀보고자 함\n", + " * 기존에 관심있던 XAI(설명가능한 AI)를 익히고자 함\n", + "\n", + "* 6주차 요약\n", + " * 모델별 주요 하이퍼 파라미터\n", + " * optuna\n", + " * Autogluon" + ] + }, + { + "cell_type": "markdown", + "id": "6ef59b99", + "metadata": {}, + "source": [ + "# 5주차 과제 내용정리\n", + "\n", + "* SHAP Force plot 여러개 비교 = Row(표본)별 비교\n", + "\n", + "# (추가)Multi label에 대한 Catboost실습" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c707081", + "metadata": {}, + "outputs": [], + "source": [ + "# importing Libraries\n", + "import pandas as pd\n", + "from catboost import CatBoostClassifier\n", + "from sklearn.metrics import accuracy_score, classification_report\n", + "from sklearn.model_selection import train_test_split\n", + "import ipywidgets as widgets\n", + "from IPython.display import display\n", + "import joblib\n", + "import numpy as np\n", + "from sklearn.datasets import load_iris\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9eb347b9", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAGJCAYAAABLvrEVAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAALfpJREFUeJzt3Qt0VNW9x/F/QkjCK+ERIUTCG0GeesPDFEt5SQBLpdBW1ErgIl4soJBWaa4ghuoN1isgGEC7kEgV8VFBpQqVQIKPRHkYwQfUUBQUCIomgSgJDXPXf681czN5kYSQM5v5ftY6JnPOmTP7nJnIb/b5n30CXC6XSwAAAAAfF+h0AwAAAIDqILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIwHnzwQQkICKiX1xo6dKiZ3NLT081rv/zyy/Xy+lOmTJGOHTuKLztz5ozccccdEhkZaY7NnDlznG6SNWx4fwHUDsEVuAylpqaasOOeQkNDJSoqSuLi4mT58uVy+vTpOnmdY8eOmcCbnZ0tvsaX21Yd//M//2Pex7vuukv++te/yu23317pl40LTaW/JPjS/m3atKlGzykoKJCkpCTp16+fNG3aVBo1aiS9e/eWefPmmfcbwOUvyOkGALh0Fi1aJJ06dZJz587JiRMnTM+m9twtWbJEXnvtNenbt69n3fnz58sf//jHGm1fw4IGCe3duuaaa6r9vH/84x9yqVXVtr/85S9y/vx58WXbt2+X6667ThYuXFjpOhMmTJCuXbt69dJq0P3lL39plrm1adNGfDG4/upXv5Lx48dXa/1//etfMnLkSDly5Ij8+te/ljvvvFOCg4Nl3759smbNGtm4caP885//vOTtBuAsgitwGRszZoz079/f8zgxMdEEop///Ofyi1/8Qj777DPTa6WCgoLMdCn98MMP0rhxYxM4nNSwYUPxdSdPnpSePXtWuY5+8Sj95ePbb781wVXn/fa3v73oNhQWFkqTJk3Eaf/+979NEM/NzTVfvq6//nqv5Q8//LA88sgjjrUPQP2hVADwM8OHD5cFCxbIl19+Kc8++2yVNa5vvfWWCQnNmzc3p2a7d+8u//3f/22WaYAYMGCA+X3q1Kme09J6elvp6Wk9jbtnzx4ZMmSICazu55atcXUrKSkx62hdpwYmDddHjx71Wkd7ULWGsazS27xQ2yqqgdSQ9vvf/16io6MlJCTE7Ov//u//isvl8lpPtzNr1ixzmlv3T9ft1auXbNmypdqBdNq0aaYXVEs49LT3M888U67e9/Dhw/L3v//d0/YvvvhCakPf59/97ndmf/RLSqtWrUyPZdntuctLMjIyzPqtW7eWdu3aeZanpKRI586dzTYGDhwob7/9doXvY1FRkekl1p5gPTZ6PO+77z4zv/Qx1OOt++3ev4reU7e//e1v8tFHH8n9999fLrSqsLAwE16rou/lT37yE7P/ug8xMTEV1lRX9Zl3W7FihXnP9TPdokUL8+Vw/fr1Xut8/fXX8p//+Z/mfXZ/Rp5++ulyr1edbQH4f/S4An5I6yX1H2M9ZT99+vQK1/nkk09Mz6z23mnJgf7jm5OTI++++65ZfvXVV5v5DzzwgDlt+9Of/tTM13DgdurUKdPrO2nSJNMDeKFT1ho+NMRozaIGvGXLlpnTw1qn6u4Zro7qtK00Dacaknfs2GFCpZYWbN26Ve69914TQJYuXeq1/jvvvCOvvPKKCXjNmjUzdcMTJ040p7E1GFXmxx9/NEFPj6OGXy3jeOmll0xoy8vLk3vuuce0XWta586da4Kjhml1xRVXSG3s2rVL3nvvPfMe6PY0sK5atcq049NPPzWBqTTdJ30tPXYaLpWur+3V46jt0m3oKX4NWqXDrZZf6HHU46PHXfdl//795vjpaXx3Tavun154pgFY11NdunSpdB+0rEVVVOdbXY8//rhp22233SbFxcWyYcMGE+A3b94sN954Y7U+8+4yk7vvvtuUOej7dfbsWVOu8P7778utt95q1tGeYS3zcH/J0eP55ptvms+W1um6L7SrzrYAlOECcNlZu3atdhO6du3aVek64eHhrmuvvdbzeOHCheY5bkuXLjWPv/nmm0q3odvXdfT1yvrZz35mlq1evbrCZTq57dixw6x75ZVXugoKCjzzX3zxRTP/8ccf98zr0KGDKz4+/oLbrKpt+nzdjtumTZvMug899JDXer/61a9cAQEBrpycHM88XS84ONhr3kcffWTmr1ixwlWVZcuWmfWeffZZz7zi4mJXbGysq2nTpl77ru278cYbXTWh75VuX99Ltx9++KHcepmZmWa9devWlfvMXH/99a5///vfnvlFRUWuVq1auQYMGOA6d+6cZ35qaqpZv/Qx/+tf/+oKDAx0vf32216vp58BXffdd9/1zGvSpEmF72NF9HOqn9fqKvv+VnQc9Lj37t3bNXz48Bp95m+66SZXr169qnz9adOmudq2bev69ttvveZPmjTJ7Ie7LdXZFgBvlAoAfkpPg1Y1uoCeKlWvvvpqrS9k0h4rPVVfXZMnTzY9mG7aE9W2bVt544035FLS7Tdo0MD0fpWmvZ2aVbW3rDTtBS7dQ6g9dHq6Wi8gutDraBnELbfc4lVvq6+rF1bpafq6VrqnWi/S015wPY2v7+/evXvLra898Hos3Hbv3m2eo/NL10Brz6X2uJamvcfay9qjRw9Tb+uetDxFaY92bWgvZenPxcUeh++//17y8/NND3LpY1Cdz7yu89VXX5me7Iro50VLG8aNG2d+L30cdFQPfV33a15oWwDKI7gCfkqDUlVh4Oabb5bBgwebU7p6il9PNb/44os1CrFXXnlljS7E6tatm9djPdWqIau29Z01qQPV4cLKHg8NYe7lpbVv377cNjTEaSC60OvoPgYGBlbrdeqClifoaX937W5ERIQ5da2lCRqiytLyhbJtVqVHL1AaYsvWCX/++efmdLtuv/R01VVXmeVa/lEb+qXgYodw05IAPX2vdcUtW7Y07dISiNLHoDqfeS1j0S99Wuag7+XMmTO9Sgm++eYbc2yfeuqpcsfB/SXOfRwutC0A5VHjCvgh7eXRf7DLhpGyPVQ7d+40vWR6kZBefPTCCy+Y3jOtjS3dK1fVNupaZTdJ0Au7qtOmulDZ65S9kMsXzJ49W9auXWvqKmNjYyU8PNwcQw1lFX0JuZj3TLfXp08fM9xaRTQ814b24H744YfmQr3abEMvJNP6Vr1IcOXKlaYXX3u69biUvhCqOp95/ZJx8OBBE4R1ufau6jb1y4EOv+Y+plrTHR8fX2F73CNBXGhbAMojuAJ+SC+OUXrqsiraMzhixAgzaRjRsTf1ym79h11Pl9f1nba0x65sENSLY0oP+aQ9m9qjVZb2DOpV7241aVuHDh1k27ZtplevdK/rgQMHPMvrgm5HL77RcFO617WuX6c0vXJeA9Rjjz3mmacXAVV0DCtrs9L3YdiwYV5DVGlPeOn3Rssn9Op//bxc6PjX5P3R0+7PP/+8GQVDh3SrKQ2E2tOqF9xpr7ObBteafuaVjnihvbM66YVeOlSXXliobdOeVf0M6Rcp9/pVqWpb2mYA3igVAPyMjuP6pz/9yZwS1jrFynz33Xfl5rkH8ncPbeQe47O6IehC1q1b53VKWEPX8ePHzcgEpcNRVlaW+UfeTXusyg6bVZO2jR071gSNJ554wmu+Xg2vAav0618MfR29EYT24pUOgDokkp4y/tnPfiZ1TXsJy/YE6+vp/laHDs+kIyXoFfDaVrfnnnuuXGnEb37zGzMKg65bUcmCe5QC9/tT3c+N1jprT64GuszMzHLL9TOj4bKqY6DvY+l91tBd9s5d1fnMa71vaVoKo+Pt6jHWGmJ9LR1hQsPyxx9/XG57WkrgdqFtASiPHlfgMqYXFWlvngYOHaJHQ6uOU6m9aDrEUFU9OjockJ421aGCdH2ty9PTmDr8kXssTQ2ReoHJ6tWrTS+ThpFBgwaVq5OsLq091G1rLaC2V4fD0nKG0kN2af2hBtrRo0eboHTo0CHTE1d2OKWatE179LQ3UcOPBhodW1VPDetFOnqKvaqhmmpCh3568sknzfBXOr6t1ojqvmhdo+7rxV6AVBEd3kl72LVEQEORBj/tXa5q2K6yYUrH+NWSAz1lrsdcj5GO+6rHpXTPqQ5XpTWhM2bMMD2UWi+qYVE/gzpfezzdN8TQcVS1HdqrqfXF+r7o+1MRPa2vw49pD6ae7tc26LZ1vtbU6ul+7YmvbCxX/Qzr6+hnRoeZ0s+yjkurny3tAa/JZ37UqFHmAjt9fa2D1Zt46BcefY77/Vu8eLHZf90f/ezqcddQrBdl6T67A3J1tgWgjDKjDAC4DLiHNnJPOnxTZGSk64YbbjBDS5Uedqmy4bDS0tLMcD1RUVHm+frzlltucf3zn//0et6rr77q6tmzpysoKMhr+CkdJqmyoX4qGw7r+eefdyUmJrpat27tatSokRkO6ssvvyz3/Mcee8wMnRUSEuIaPHiwa/fu3eW2WVXbKhou6fTp0665c+ea/WzYsKGrW7durkcffdR1/vx5r/V0OzNnzizXpsqG6SorNzfXNXXqVFdERIQ5rn369KlwyK66Gg7r+++/97yeDrkVFxfnOnDgQLn2XmgIteXLl5vn6DEfOHCgGdoqJibGNXr06HLDTD3yyCPmvdd1W7RoYdZLSkpy5efne9bTNgwZMsS8z/q61Tl2ui8PPPCAOWaNGzd2hYaGmiGt9DNz/Phxz3oVvb9r1qwx76m2qUePHmZ/a/OZf/LJJ027dYgw3VaXLl1c9957r9e+ud9n/ZxER0ebz5P+/Y0YMcL11FNP1XhbAP5fgP6nbJgFAKAqWqer9Zxak1lRaQAAXArUuAIAqqQXc5Xt49B6ZD3lXdGtewHgUqHHFQBQpfT0dHOrV71FqtbGaq3mmjVrzHBOWqtbk7F6AeBicHEWAKBKehGZjp+6fPly08uqF9HpXc70IiRCK4D6RI8rAAAArECNKwAAAKxAcAUAAIAVgvxhyJZjx46ZwZzr+vaUAAAAuHhauap3wdMbkpS+JbbfBVcNrXpRAQAAAHyb3r5b71bnt8HVfds8PRBhYWFONwcAAABlFBQUmI7GC93u+LIPru7yAA2tBFcAAADfdaGyTi7OAgAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYIcroBAPxHzL3rnG4CUM6eRyc73QQA1USPKwAAAKxAcAUAAIAVCK4AAACwgs8E18WLF0tAQIDMmTPHM+/s2bMyc+ZMadWqlTRt2lQmTpwoubm5jrYTAAAAfhxcd+3aJU8++aT07dvXa/7cuXPl9ddfl5deekkyMjLk2LFjMmHCBMfaCQAAAD8OrmfOnJHbbrtN/vKXv0iLFi088/Pz82XNmjWyZMkSGT58uMTExMjatWvlvffek6ysLEfbDAAAAD8MrloKcOONN8rIkSO95u/Zs0fOnTvnNb9Hjx7Svn17yczMrHR7RUVFUlBQ4DUBAADAfo6O47phwwbZu3evKRUo68SJExIcHCzNmzf3mt+mTRuzrDLJycmSlJQklxrjUcLXMBYlAOBy51iP69GjR+Wee+6R5557TkJDQ+tsu4mJiabMwD3p6wAAAMB+jgVXLQU4efKk/Md//IcEBQWZSS/AWr58uflde1aLi4slLy/P63k6qkBkZGSl2w0JCZGwsDCvCQAAAPZzrFRgxIgRsn//fq95U6dONXWs8+bNk+joaGnYsKGkpaWZYbDUwYMH5ciRIxIbG+tQqwEAAOB3wbVZs2bSu3dvr3lNmjQxY7a650+bNk0SEhKkZcuWpud09uzZJrRed911DrUaAAAAfnlx1oUsXbpUAgMDTY+rjhYQFxcnK1eudLpZAAAA8Pfgmp6e7vVYL9pKSUkxEwAAAPybTwVXAABQHkMwwtfscWgIRsdvQAAAAABUB8EVAAAAViC4AgAAwAoEVwAAAFiB4AoAAAArEFwBAABgBYIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYgeAKAAAAKxBcAQAAYAWCKwAAAKzgaHBdtWqV9O3bV8LCwswUGxsrb775pmf50KFDJSAgwGuaMWOGk00GAACAQ4LEQe3atZPFixdLt27dxOVyyTPPPCM33XSTfPjhh9KrVy+zzvTp02XRokWe5zRu3NjBFgMAAMAvg+u4ceO8Hj/88MOmFzYrK8sTXDWoRkZGOtRCAAAA+AqfqXEtKSmRDRs2SGFhoSkZcHvuueckIiJCevfuLYmJifLDDz9UuZ2ioiIpKCjwmgAAAGA/R3tc1f79+01QPXv2rDRt2lQ2btwoPXv2NMtuvfVW6dChg0RFRcm+fftk3rx5cvDgQXnllVcq3V5ycrIkJSXV4x4AAADAL4Jr9+7dJTs7W/Lz8+Xll1+W+Ph4ycjIMOH1zjvv9KzXp08fadu2rYwYMUIOHTokXbp0qXB72iubkJDgeaw9rtHR0fWyLwAAALiMg2twcLB07drV/B4TEyO7du2Sxx9/XJ588sly6w4aNMj8zMnJqTS4hoSEmAkAAACXF5+pcXU7f/68qVOtiPbMKu15BQAAgH9xtMdVT+uPGTNG2rdvL6dPn5b169dLenq6bN261ZQD6OOxY8dKq1atTI3r3LlzZciQIWbsVwAAAPgXR4PryZMnZfLkyXL8+HEJDw83gVRD6w033CBHjx6Vbdu2ybJly8xIA1qnOnHiRJk/f76TTQYAAIA/Btc1a9ZUukyDql6kBQAAAPhkjSsAAABQEYIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYgeAKAAAAKxBcAQAAYAWCKwAAAKxAcAUAAIAVCK4AAACwAsEVAAAAViC4AgAAwAoEVwAAAFiB4AoAAAArEFwBAABgBUeD66pVq6Rv374SFhZmptjYWHnzzTc9y8+ePSszZ86UVq1aSdOmTWXixImSm5vrZJMBAADgj8G1Xbt2snjxYtmzZ4/s3r1bhg8fLjfddJN88sknZvncuXPl9ddfl5deekkyMjLk2LFjMmHCBCebDAAAAIcEiYPGjRvn9fjhhx82vbBZWVkm1K5Zs0bWr19vAq1au3atXH311Wb5dddd51CrAQAA4Nc1riUlJbJhwwYpLCw0JQPaC3vu3DkZOXKkZ50ePXpI+/btJTMzs9LtFBUVSUFBgdcEAAAA+zkeXPfv32/qV0NCQmTGjBmyceNG6dmzp5w4cUKCg4OlefPmXuu3adPGLKtMcnKyhIeHe6bo6Oh62AsAAABc9sG1e/fukp2dLe+//77cddddEh8fL59++mmtt5eYmCj5+fme6ejRo3XaXgAAAPhhjavSXtWuXbua32NiYmTXrl3y+OOPy8033yzFxcWSl5fn1euqowpERkZWuj3tudUJAAAAlxfHe1zLOn/+vKlT1RDbsGFDSUtL8yw7ePCgHDlyxNTAAgAAwL842uOqp/XHjBljLrg6ffq0GUEgPT1dtm7daupTp02bJgkJCdKyZUszzuvs2bNNaGVEAQAAAP/jaHA9efKkTJ48WY4fP26Cqt6MQEPrDTfcYJYvXbpUAgMDzY0HtBc2Li5OVq5c6WSTAQAA4I/BVcdprUpoaKikpKSYCQAAAP7N52pcAQAAgIoQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYgeAKAAAAKxBcAQAAYAWCKwAAAKxAcAUAAIAVCK4AAACwAsEVAAAAViC4AgAAwAoEVwAAAFiB4AoAAAArEFwBAABgBYIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACs4GlyTk5NlwIAB0qxZM2ndurWMHz9eDh486LXO0KFDJSAgwGuaMWOGY20GAACAHwbXjIwMmTlzpmRlZclbb70l586dk1GjRklhYaHXetOnT5fjx497pj//+c+OtRkAAADOCBIHbdmyxetxamqq6Xnds2ePDBkyxDO/cePGEhkZ6UALAQAA4Ct8qsY1Pz/f/GzZsqXX/Oeee04iIiKkd+/ekpiYKD/88EOl2ygqKpKCggKvCQAAAPZztMe1tPPnz8ucOXNk8ODBJqC63XrrrdKhQweJioqSffv2ybx580wd7CuvvFJp3WxSUlI9thwAAAB+FVy11vXjjz+Wd955x2v+nXfe6fm9T58+0rZtWxkxYoQcOnRIunTpUm472iObkJDgeaw9rtHR0Ze49QAAAPCL4Dpr1izZvHmz7Ny5U9q1a1fluoMGDTI/c3JyKgyuISEhZgIAAMDlxdHg6nK5ZPbs2bJx40ZJT0+XTp06XfA52dnZ5qf2vAIAAMB/BDldHrB+/Xp59dVXzViuJ06cMPPDw8OlUaNGphxAl48dO1ZatWplalznzp1rRhzo27evk00HAACAPwXXVatWeW4yUNratWtlypQpEhwcLNu2bZNly5aZsV21VnXixIkyf/58h1oMAAAAvy0VqIoGVb1JAQAAAFCrcVw7d+4sp06dKjc/Ly/PLAMAAAB8Irh+8cUXUlJSUuHg/19//XVdtAsAAACofanAa6+95vl969at5iIqNw2yaWlp0rFjx5psEgAAAKj74Dp+/HjzMyAgQOLj472WNWzY0ITWxx57rCabBAAAAOo+uOptWZWOt7pr1y6JiIioydMBAACA+h1V4PDhw7V/RQAAAKA+h8PSeladTp486emJdXv66adru1kAAACg7oJrUlKSLFq0SPr3729uvao1rwAAAIDPBdfVq1dLamqq3H777XXfIgAAAKCuxnEtLi6Wn/zkJ7V5KgAAAFB/wfWOO+6Q9evX1+4VAQAAgPoqFTh79qw89dRTsm3bNunbt68Zw7W0JUuW1FX7AAAAgNoH13379sk111xjfv/444+9lnGhFgAAAHwmuO7YsaPuWwIAAADUdY0rAAAAYEWP67Bhw6osCdi+ffvFtAkAAACom+Dqrm91O3funGRnZ5t61/j4+NpsEgAAAKj74Lp06dIK5z/44INy5syZ2mwSAAAAqL8a19/+9rfy9NNP1+UmAQAAgLoPrpmZmRIaGlqXmwQAAABqXyowYcIEr8cul0uOHz8uu3fvlgULFtRmkwAAAEDdB9fw8HCvx4GBgdK9e3dZtGiRjBo1qjabBAAAAOo+uK5du7Y2TwMAAADqN7i67dmzRz777DPze69eveTaa6+tq3YBAAAAFx9cT548KZMmTZL09HRp3ry5mZeXl2duTLBhwwa54oorarNZAAAAoG5HFZg9e7acPn1aPvnkE/nuu+/MpDcfKCgokLvvvrva20lOTpYBAwZIs2bNpHXr1jJ+/Hg5ePCg1zpnz56VmTNnSqtWraRp06YyceJEyc3NrU2zAQAA4G/BdcuWLbJy5Uq5+uqrPfN69uwpKSkp8uabb1Z7OxkZGSaUZmVlyVtvvWXuwKUXdxUWFnrWmTt3rrz++uvy0ksvmfWPHTtWblQDAAAAXP5qVSpw/vx5adiwYbn5Ok+X1SQAl5aammp6XrV2dsiQIZKfny9r1qyR9evXy/Dhwz0Xhmlg1rB73XXX1ab5AAAA8JceVw2R99xzj+n9dPv6669N7+iIESNq3RgNqqply5bmpwZY7YUdOXKkZ50ePXpI+/btzc0OKlJUVGRKFkpPAAAA8NPg+sQTT5hA2LFjR+nSpYuZOnXqZOatWLGiVg3Rnto5c+bI4MGDpXfv3mbeiRMnJDg42HMBmFubNm3MssrqZnWcWfcUHR1dq/YAAADgMigV0DC4d+9e2bZtmxw4cMDM09P3pXtGa0prXfUCr3feeUcuRmJioiQkJHgea5gmvAIAAPhZj+v27dvNRVgaBgMCAuSGG24wIwzopKMD6Fiub7/9do0bMWvWLNm8ebPs2LFD2rVr55kfGRkpxcXFZqit0nRUAV1WkZCQEAkLC/OaAAAA4GfBddmyZTJ9+vQKw6Celv+v//ovWbJkSbW353K5TGjduHGjCcVablBaTEyMueArLS3NM0+Hyzpy5IjExsbWpOkAAADwp+D60UcfyejRoytdrkNZ6QVVNSkPePbZZ82oATqWq9at6vTjjz96wvC0adPMqX/tjdVtT5061YRWRhQAAADwLzWqcdVT9BUNg+XZWFCQfPPNN9Xe3qpVq8zPoUOHes3XIa+mTJlifl+6dKkEBgaaGw/oiAFxcXFmDFkAAAD4lxoF1yuvvNJcQNW1a9cKl+/bt0/atm1bo1KBCwkNDTU3NtAJAAAA/qtGpQJjx46VBQsWmNuwlqWn9xcuXCg///nP67J9AAAAQM17XOfPny+vvPKKXHXVVeaiqu7du5v5OiSW9oiWlJTI/fffX5NNAgAAAHUfXHXg//fee0/uuusuM16q+1S/Do2ltacaXnUdAAAAwPEbEHTo0EHeeOMN+f777yUnJ8eE127dukmLFi3qvHEAAADARd05S2lQ1ZsOAAAAAD53cRYAAADgFIIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYgeAKAAAAKxBcAQAAYAWCKwAAAKxAcAUAAIAVCK4AAACwgqPBdefOnTJu3DiJioqSgIAA2bRpk9fyKVOmmPmlp9GjRzvWXgAAAPhpcC0sLJR+/fpJSkpKpetoUD1+/Lhnev755+u1jQAAAPANQU6++JgxY8xUlZCQEImMjKz2NouKiszkVlBQcFFtBAAAgG/w+RrX9PR0ad26tXTv3l3uuusuOXXqVJXrJycnS3h4uGeKjo6ut7YCAADAT4OrlgmsW7dO0tLS5JFHHpGMjAzTQ1tSUlLpcxITEyU/P98zHT16tF7bDAAAgMuwVOBCJk2a5Pm9T58+0rdvX+nSpYvphR0xYkSlpQU6AQAA4PLi0z2uZXXu3FkiIiIkJyfH6aYAAACgnlkVXL/66itT49q2bVunmwIAAAB/KhU4c+aMV+/p4cOHJTs7W1q2bGmmpKQkmThxohlV4NChQ3LfffdJ165dJS4uzslmAwAAwN+C6+7du2XYsGGexwkJCeZnfHy8rFq1Svbt2yfPPPOM5OXlmZsUjBo1Sv70pz9RwwoAAOCHHA2uQ4cOFZfLVenyrVu31mt7AAAA4LusqnEFAACA/yK4AgAAwAoEVwAAAFiB4AoAAAArEFwBAABgBYIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYgeAKAAAAKxBcAQAAYAWCKwAAAKxAcAUAAIAVCK4AAACwgqPBdefOnTJu3DiJioqSgIAA2bRpk9dyl8slDzzwgLRt21YaNWokI0eOlM8//9yx9gIAAMBPg2thYaH069dPUlJSKlz+5z//WZYvXy6rV6+W999/X5o0aSJxcXFy9uzZem8rAAAAnBXk5IuPGTPGTBXR3tZly5bJ/Pnz5aabbjLz1q1bJ23atDE9s5MmTarn1gIAAMBJPlvjevjwYTlx4oQpD3ALDw+XQYMGSWZmZqXPKyoqkoKCAq8JAAAA9vPZ4KqhVWkPa2n62L2sIsnJySbguqfo6OhL3lYAAAD4cXCtrcTERMnPz/dMR48edbpJAAAAuJyDa2RkpPmZm5vrNV8fu5dVJCQkRMLCwrwmAAAA2M9ng2unTp1MQE1LS/PM03pVHV0gNjbW0bYBAADAz0YVOHPmjOTk5HhdkJWdnS0tW7aU9u3by5w5c+Shhx6Sbt26mSC7YMECM+br+PHjnWw2AAAA/C247t69W4YNG+Z5nJCQYH7Gx8dLamqq3HfffWas1zvvvFPy8vLk+uuvly1btkhoaKiDrQYAAIDfBdehQ4ea8Voro3fTWrRokZkAAADg33y2xhUAAAAojeAKAAAAKxBcAQAAYAWCKwAAAKxAcAUAAIAVCK4AAACwAsEVAAAAViC4AgAAwAoEVwAAAFiB4AoAAAArEFwBAABgBYIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYwaeD64MPPigBAQFeU48ePZxuFgAAABwQJD6uV69esm3bNs/joCCfbzIAAAAuAZ9PgRpUIyMjnW4GAAAAHObTpQLq888/l6ioKOncubPcdtttcuTIkSrXLyoqkoKCAq8JAAAA9vPp4Dpo0CBJTU2VLVu2yKpVq+Tw4cPy05/+VE6fPl3pc5KTkyU8PNwzRUdH12ubAQAA4IfBdcyYMfLrX/9a+vbtK3FxcfLGG29IXl6evPjii5U+JzExUfLz8z3T0aNH67XNAAAA8NMa19KaN28uV111leTk5FS6TkhIiJkAAABwefHpHteyzpw5I4cOHZK2bds63RQAAADUM58Orn/4wx8kIyNDvvjiC3nvvffkl7/8pTRo0EBuueUWp5sGAACAeubTpQJfffWVCamnTp2SK664Qq6//nrJysoyvwMAAMC/+HRw3bBhg9NNAAAAgI/w6VIBAAAAwI3gCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAABYgeAKAAAAKxBcAQAAYAWCKwAAAKxAcAUAAIAVCK4AAACwAsEVAAAAViC4AgAAwAoEVwAAAFiB4AoAAAArEFwBAABgBYIrAAAArEBwBQAAgBUIrgAAALACwRUAAABWILgCAADACgRXAAAAWMGK4JqSkiIdO3aU0NBQGTRokHzwwQdONwkAAAD1zOeD6wsvvCAJCQmycOFC2bt3r/Tr10/i4uLk5MmTTjcNAAAA9cjng+uSJUtk+vTpMnXqVOnZs6esXr1aGjduLE8//bTTTQMAAEA9ChIfVlxcLHv27JHExETPvMDAQBk5cqRkZmZW+JyioiIzueXn55ufBQUFddq2kqIf63R7wMWq68/4pcDfDXwRfzuA83837u25XC57g+u3334rJSUl0qZNG6/5+vjAgQMVPic5OVmSkpLKzY+Ojr5k7QR8QfiKGU43AbASfzuA7/zdnD59WsLDw+0MrrWhvbNaE+t2/vx5+e6776RVq1YSEBDgaNtQ/tuVfqE4evSohIWFOd0cwBr87QC1w9+O79KeVg2tUVFRVa7n08E1IiJCGjRoILm5uV7z9XFkZGSFzwkJCTFTac2bN7+k7cTF0f958D8QoOb42wFqh78d31RVT6sVF2cFBwdLTEyMpKWlefWg6uPY2FhH2wYAAID65dM9rkpP+8fHx0v//v1l4MCBsmzZMiksLDSjDAAAAMB/+Hxwvfnmm+Wbb76RBx54QE6cOCHXXHONbNmypdwFW7CPlnTo+LxlSzsAVI2/HaB2+NuxX4DrQuMOAAAAAD7Ap2tcAQAAADeCKwAAAKxAcAUAAIAVCK4AAACwAsEVjklJSZGOHTtKaGioDBo0SD744AOnmwT4tJ07d8q4cePMnWX0ToCbNm1yukmAz9NbwQ8YMECaNWsmrVu3lvHjx8vBgwedbhZqieAKR7zwwgtmjF4dlmTv3r3Sr18/iYuLk5MnTzrdNMBn6RjW+reiX/oAVE9GRobMnDlTsrKy5K233pJz587JqFGjzN8T7MNwWHCE9rDqN+AnnnjCc0c0vX/07Nmz5Y9//KPTzQN8nva4bty40fQeAag+HRtee1410A4ZMsTp5qCG6HFFvSsuLpY9e/bIyJEjPfMCAwPN48zMTEfbBgC4vOXn55ufLVu2dLopqAWCK+rdt99+KyUlJeXufqaP9e5oAABcCnp2b86cOTJ48GDp3bu3083B5XjLVwAAgLqgta4ff/yxvPPOO043BbVEcEW9i4iIkAYNGkhubq7XfH0cGRnpWLsAAJevWbNmyebNm83oHO3atXO6OaglSgVQ74KDgyUmJkbS0tK8Tt/o49jYWEfbBgC4vOg16Bpa9WLG7du3S6dOnZxuEi4CPa5whA6FFR8fL/3795eBAwfKsmXLzNAkU6dOdbppgM86c+aM5OTkeB4fPnxYsrOzzUUm7du3d7RtgC+XB6xfv15effVVM5ar+1qK8PBwadSokdPNQw0xHBYco0NhPfroo+Z/Itdcc40sX77cDJMFoGLp6ekybNiwcvP1S2BqaqojbQJsGDquImvXrpUpU6bUe3twcQiuAAAAsAI1rgAAALACwRUAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAOCjd/vZtGmT080AAJ9CcAUAB+itjmfPni2dO3eWkJAQiY6OlnHjxklaWprTTQMAnxXkdAMAwN988cUXMnjwYGnevLk8+uij0qdPHzl37pxs3bpVZs6cKQcOHHC6iQDgk+hxBYB69rvf/c6UAnzwwQcyceJEueqqq6RXr16SkJAgWVlZFT5n3rx5Zr3GjRubXtoFCxaYsOv20UcfybBhw6RZs2YSFhYmMTExsnv3brPsyy+/NL25LVq0kCZNmpjXeuONN+ptfwGgrtDjCgD16LvvvpMtW7bIww8/bEJkWdoLWxENpKmpqRIVFSX79++X6dOnm3n33XefWX7bbbfJtddeK6tWrZIGDRpIdna2NGzY0CzTXtzi4mLZuXOnec1PP/1UmjZteon3FADqHsEVAOpRTk6OuFwu6dGjR42eN3/+fM/vHTt2lD/84Q+yYcMGT3A9cuSI3HvvvZ7tduvWzbO+LtOeXS1JUNpjCwA2olQAAOqRhtbaeOGFF0xdbGRkpOkt1SCrgdRNywzuuOMOGTlypCxevFgOHTrkWXb33XfLQw89ZJ6/cOFC2bdvX53sCwDUN4IrANQj7QnV+taaXICVmZlpSgHGjh0rmzdvlg8//FDuv/9+c/rf7cEHH5RPPvlEbrzxRtm+fbv07NlTNm7caJZpoP3Xv/4lt99+uykz6N+/v6xYseKS7B8AXEoBrtp+/QcA1MqYMWNMgDx48GC5Ote8vDxT56rhVoPn+PHj5bHHHpOVK1d69aJqGH355ZfN+hW55ZZbpLCwUF577bVyyxITE+Xvf/87Pa8ArEOPKwDUs5SUFCkpKZGBAwfK3/72N/n888/ls88+k+XLl0tsbGyFvbRaFqA1rRpedT13b6r68ccfZdasWZKenm5GEHj33Xdl165dcvXVV5vlc+bMMUNtHT58WPbu3Ss7duzwLAMAm3BxFgDUM704SgOkjizw+9//Xo4fPy5XXHGFGcJKRwUo6xe/+IXMnTvXhNOioiJTDqDDYWl5gNJRBE6dOiWTJ0+W3NxciYiIkAkTJkhSUpJZriFZRxb46quvzFBZo0ePlqVLl9b7fgPAxaJUAAAAAFagVAAAAABWILgCAADACgRXAAAAWIHgCgAAACsQXAEAAGAFgisAAACsQHAFAACAFQiuAAAAsALBFQAAAFYguAIAAMAKBFcAAACIDf4PY8iCI9rMXX0AAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Load the Iris dataset\n", + "iris_df = load_iris()\n", + "X = iris_df.data\n", + "y = iris_df.target\n", + "\n", + "# Split the data into training and testing sets\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "\n", + "# Create a figure with specified dimensions (8x4)\n", + "plt.figure(figsize=(8, 4))\n", + "\n", + "# Generate a countplot to display the distribution of target classes (y_train)\n", + "sns.countplot(x=y_train)\n", + "\n", + "# Set the title, x-axis label, and y-axis label for the plot\n", + "plt.title('Distribution of Target Classes')\n", + "plt.xlabel('Class')\n", + "plt.ylabel('Count')\n", + "\n", + "# Display the plot\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "567f18e0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy: 1.0\n", + "Classification Report:\n", + " precision recall f1-score support\n", + "\n", + " 0 1.00 1.00 1.00 10\n", + " 1 1.00 1.00 1.00 9\n", + " 2 1.00 1.00 1.00 11\n", + "\n", + " accuracy 1.00 30\n", + " macro avg 1.00 1.00 1.00 30\n", + "weighted avg 1.00 1.00 1.00 30\n", + "\n" + ] + } + ], + "source": [ + "# 모델 설정 및 학습\n", + "model = CatBoostClassifier(\n", + " iterations=100, depth=6, learning_rate=0.1,\n", + " loss_function='MultiClass', verbose=False)\n", + "model.fit(X_train, y_train)\n", + "\n", + "# Make predictions on the test set\n", + "y_pred = model.predict(X_test)\n", + "\n", + "# Evaluate the model\n", + "accuracy = accuracy_score(y_test, y_pred)\n", + "report = classification_report(y_test, y_pred)\n", + "# printing metrics\n", + "print(f\"Accuracy: {accuracy}\")\n", + "print(\"Classification Report:\\n\", report)\n" + ] + }, + { + "cell_type": "markdown", + "id": "87231608", + "metadata": {}, + "source": [ + "# (추가)VS CODE 팝업이 떠서 찾아보는 Tensorboard\n", + "\n", + "* Catboost는 지원하지 않아 타 모델로 향후 실험" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99dbcdf3", + "metadata": {}, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "\n", + "# TensorBoard 로그를 저장할 디렉토리 설정\n", + "log_dir = \"logs/fit\"\n", + "\n", + "# SummaryWriter 설정\n", + "tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)\n", + "\n", + "# 모델 학습 시 callback 추가\n", + "model.fit(X_train, y_train, callbacks=[tensorboard_callback])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a202d44", + "metadata": {}, + "outputs": [], + "source": [ + "!tensorboard --logdir=logs/fit" + ] + }, + { + "cell_type": "markdown", + "id": "09976f5f", + "metadata": {}, + "source": [ + "# 6주차 수업정리\n", + "\n", + "## 하이퍼 파라미터\n", + "\n", + "* 하이퍼 파라미터 튜닝 : 모델의 초기 설정값을 최적의 값으로 구하는 것\n", + "\n", + "### 하이퍼 파라미터 - RF(Random Forest, Bagging)\n", + "\n", + "* 주요 설정값\n", + " * `n_estimators` : 트리의 수 (Default=100)\n", + " * 증가할수록 계산비용 & 성능 증가 (일정수준부터는 크게 상승하지 않음)\n", + " * ↔ 감소할수록 계산비용 감소. 그러나 과소적합 발생할 가능성\n", + " * (경험적으로)천 단위에서 마감하는 것이 좋음(만 단위에서 유의미한 성능향상 X)\n", + " * `max_depth` : 트리의 최대 깊이 (Default=None)\n", + " * 증가할수록 모델의 복잡도 & 과적합가능성 증대(복잡한 패턴을 익힐 수 있음)\n", + " * ↔ 감소할수록 과소적합 발생할 수 있음\n", + " * `min_samples_split` : 노드를 분할하기 위한 최소 샘플 수 (Default=2)\n", + " * 증가할수록 과적합 방지 & 성능 감소\n", + " * ↔ 감소할수록 과적합 위험이 커짐(트리가 얕아짐짐)\n", + " * `min_samples_leaf` : 분할이 모두 끝난 노드(리프노드)의 최소 샘플 크기 (Default=1)\n", + " * 증가할수록 과적합 방지 & 성능 감소\n", + " * ↔ 감소할수록 과적합 위험이 커짐\n", + " * `max_features` : 각 트리를 학습할 때 사용할 feature의 비율 (Default=1.0)\n", + " * 증가할수록 성능/계산비용용 향상 & 과적합 확률 증가\n", + "\n", + "### 하이퍼 파라미터 - XGB(XGBoost, Boosting)\n", + "\n", + "* 주요 설정값\n", + " * `n_estimators` : 부스팅 단계의 수 (Default=100)\n", + " * RF와 뜻은 다르지만 양상은 비슷함\n", + " * 클수록 계산비용/성능 향상(↔과소적합). 천단위 마감\n", + " * `max_depth` : 트리의 최대 깊이 (Default=6)\n", + " * 증가할수록 과적합 위험 증대\n", + " * `Learning_rate(eta)` : 부스팅 단계에서 학습률을 조절 (Default=0.3)\n", + " * 배깅이 아닌 부스팅이므로, 학습률 개념이 있음\n", + " * 증가할수록 학습속도가 빨라지고, 초반 데이터에 가중치(+최적해를 놓칠 위험 있음)\n", + " * ↔ 낮을수록 느리지만 최적해에 안정적으로 수렴\n", + " * `subsample` : 각 단계에서 사용할 데이터 샘플[row]의 비율 (Default=1.0)\n", + " * 증가할수록 과적합방지 & 성능적 단점\n", + " * `colsample_bytree` : 각 트리를 학습할 때 사용할 feature[column]의 비율 (Default=1.0)\n", + " * 증가할수록 성능/비용 증가 & 과적합 위험\n", + " * gamma : 노드 분할시 필요한 최소 손실 감소량 (Default=0)\n", + " * 0인 경우, Loss감소가 없더라도 성능에 부정적이지 않는다면 분할\n", + " * 증가할수록 노드 분할을 엄격히 수행(과적합 방지, 성능 하향)\n", + "\n", + "### 하이퍼 파라미터 - LGBM(Light GBM)\n", + "\n", + "* 주요 설정값\n", + " * 타 모델과 유사\n", + " * `n_estimators`[Default=100], `learning_rate`[Default=0.1], `subsample`(bagging_fraction, bagging) [Default=1.0], \n", + " * Feature_fraction(`colsample_bytree`)[Default=1.0], `max_depth`[Default=-1]\n", + " * Num_leaves : 하나의 트리가 가질 수 있는 최대의 (분할이 끝난)리프 노드 수(Default=31)\n", + " * 값이 증가할수록 트리가 복잡 & 과적합 & 메모리 사용 증가\n", + " * Min_data_in_leaf(min_child_samples) : 리프 노드의 최소 샘플 수 (Default=20)\n", + "\n", + "### 하이퍼 파라미터 - Catboost\n", + "\n", + "* 주요 설정값\n", + " * 타 모델과 유사\n", + " * n_estimators[Default=100], learning_rate[Default=0.03], subsample[Default=None], Feature_fraction[Default=None]\n", + " * `Depth` : 트리의 최대 깂이 (Default=-1)\n", + " * `Bagging_temperature` : 샘플링의 무작위성을 제어 (Default=1.0)\n", + " * 높을수록 샘플링이 다양해지며 일반화 성능 향상 (속도 느려짐)\n", + " * `L2_leaf_reg` : 리프 노드의 가중치에 부여하는 패널티의 정도 (Default=3)\n", + " * 클수록 과적합 방지\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "c7545351", + "metadata": {}, + "source": [ + "## optuna\n", + "\n", + "### optuna란\n", + "\n", + "* 기존의 Tuning Tools\n", + " * Grid Search : 오래걸리지만 Grid에 적절한 값이 없으면 최적값을 찾지 못함\n", + " * Random Search : 위의 Grid Search 문제를 해결해도 최적값을 꼭 찾지 못함\n", + "* optuna vs hyperopt\n", + " * 25년 1월 기준, github star 기준으로 optuna 11.3k > hyperopt 7.3k\n", + "* optuna\n", + " * 정의한 목적함수를 기반으로 최적화를 쉽게 진행\n", + " * 베이지안 최적화와 유사한 TPE알고리즘을 사용한 탐색을 지행(그 외의 다양한 전략도 커스터마이징 가능)\n", + "\n", + "### optuna 구성요소와 작동방식\n", + "\n", + "* 구성요소\n", + " * Study : 최적화 과정 전체를 관리. 각 하이퍼 파라미터 탐색(Trial)의 결과를 저장\n", + " * Search Space : 하이퍼 파라미터의 범위 지정\n", + " * Objective function : 목적함수. 모델을 학습시키고 평가지표를 반환\n", + " * Trial : 하이퍼파라미터의 조합을 나타내는 단위 (1번의 Trial로 목적함수를 실행)\n", + "* 작동방식\n", + " * Search Space(탐색공간)에서 int/float/categorical 등을 정의\n", + " * Objective function(목적함수)를 실행해 모델학습하고 성능지표 반환\n", + " * 정의된 조합을 효율적으로 선택해, 반복적으로(Trial) 성능 개선\n", + " * 최적의 하이퍼파라미터와 정보를 Study객체에 저장\n", + "* 참고\n", + " * Search Space에서 정의한 범위에서 조합하므로, 그 밖의 범위에서 최적해가 있으면 찾을 수 없음" + ] + }, + { + "cell_type": "markdown", + "id": "bbc98d08", + "metadata": {}, + "source": [ + "## optuna 실습\n", + "\n", + "### 데이터 로딩" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e39f0450", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, log_loss\n", + "import xgboost as xgb\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import time\n", + "\n", + "data = pd.read_csv(\"data_preprocessed.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "8316ab3d", + "metadata": {}, + "source": [ + "### LGBM Vanila모델" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41a19793", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "정확도 : 0.703\n", + "AUC : 0.7578\n", + "Vanilla LGBM의 CF :\n", + "[[39914 16640]\n", + " [ 1629 3320]]\n" + ] + } + ], + "source": [ + "from lightgbm import LGBMClassifier\n", + "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score\n", + "\n", + "X = data.drop(columns=[\"TARGET\"])\n", + "y = data[\"TARGET\"]\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "\n", + "# Full Model\n", + "full_model = LGBMClassifier(class_weight=\"balanced\", random_state=42)\n", + "full_model.fit(X_train, y_train)\n", + "\n", + "# Full Pred\n", + "y_pred_full = full_model.predict(X_test)\n", + "y_proba_full = full_model.predict_proba(X_test)\n", + "\n", + "# Full Results\n", + "accuracy_full = accuracy_score(y_test, y_pred_full)\n", + "auc_full = roc_auc_score(y_test, y_proba_full[:, 1])\n", + "cf_full = confusion_matrix(y_test, y_pred_full)\n", + "\n", + "print(f'정확도 : {round(accuracy_full,4)}')\n", + "print(f'AUC : {round(auc_full,4)}')\n", + "print(f'''Vanilla LGBM의 CF :\n", + "{cf_full}''')" + ] + }, + { + "cell_type": "markdown", + "id": "a7d8ae93", + "metadata": {}, + "source": [ + "### LGBM with optuna\n", + "\n", + "* 최적의 하이퍼파라미터 조합 찾기" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6be8bc0c", + "metadata": {}, + "outputs": [], + "source": [ + "import optuna\n", + "from lightgbm import LGBMClassifier, early_stopping\n", + "\n", + "# 데이터 분할\n", + "X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ccd9dd6", + "metadata": {}, + "outputs": [], + "source": [ + "# 목적 함수 정의\n", + "def objective(trial):\n", + " # LGBMClassifier 하이퍼파라미터 설정\n", + " param = {\n", + " 'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log = True), # default = 0.1\n", + " 'num_leaves': trial.suggest_int('num_leaves', 20, 100), # default = 31\n", + " # 'max_depth': trial.suggest_int('max_depth', -1, 50), # default = -1\n", + " 'n_estimators' : trial.suggest_int('n_estimators', 100, 2000), # default = 100\n", + " 'subsample': trial.suggest_float('subsample', 0.8, 1.0), # default = 1.0\n", + " 'min_child_samples' : trial.suggest_int('min_data_in_leaf', 1, 100), # default = 20\n", + " 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.8, 1.0), # default = 1.0\n", + " }\n", + "\n", + " # LGBMClassifier 모델 생성 및 학습\n", + " model = LGBMClassifier(**param, n_jobs=-1,\n", + " class_weight='balanced',\n", + " random_state=42,\n", + " # device_type='gpu', colab환경에서 정상적으로 gpu가 활용이 안되네요...이거 해결만 온종일 할 거 같아서 일단 cpu기준으로 학습합니다.\n", + " # https://stackoverflow.com/questions/75981883/can-not-use-lightgbm-gpu-in-colab-lightgbmerror-no-opencl-device-found\n", + " )\n", + " model.fit(X_train, y_train,\n", + " eval_set=[(X_val, y_val)],\n", + " eval_metric='auc',\n", + " callbacks=[early_stopping(stopping_rounds=50, verbose=False)]\n", + " )\n", + "\n", + " # 검증 데이터에 대한 예측\n", + " proba = model.predict(X_val)\n", + "\n", + " # 정확도 계산\n", + " roc_auc = roc_auc_score(y_val, proba)\n", + "\n", + " return roc_auc\n", + "\n", + "# Optuna 튜닝 실행\n", + "study = optuna.create_study(direction='maximize') # 높아질수록 좋은 roc_auc_score에 대해 maximize\n", + "study.optimize(objective, n_trials=50)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ef20e0d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'learning_rate': 0.02642230824686883, 'num_leaves': 34, 'n_estimators': 864, 'subsample': 0.8342595493628282, 'min_data_in_leaf': 60, 'colsample_bytree': 0.8360285193709883}\n" + ] + } + ], + "source": [ + "# 최적 파라미터 출력\n", + "best_params = study.best_trial.params\n", + "print(best_params)" + ] + }, + { + "cell_type": "markdown", + "id": "f539575a", + "metadata": {}, + "source": [ + "* 찾은 조합을 활용한 final model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55b5a6aa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60\n", + "[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60\n", + "정확도 : 0.7229\n", + "AUC : 0.7602\n", + "Optuna LGBM의 CF :\n", + "[[41245 15309]\n", + " [ 1735 3214]]\n" + ] + } + ], + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "\n", + "final_model = LGBMClassifier(**best_params,\n", + " n_jobs=-1,\n", + " class_weight='balanced',\n", + " random_state=42,)\n", + "final_model.fit(X_train, y_train)\n", + "\n", + "# optuna Pred\n", + "y_pred_final = final_model.predict(X_test)\n", + "y_proba_final = final_model.predict_proba(X_test)\n", + "\n", + "# optuna Results\n", + "accuracy_final = accuracy_score(y_test, y_pred_final)\n", + "auc_final = roc_auc_score(y_test, y_proba_final[:, 1])\n", + "cf_final = confusion_matrix(y_test, y_pred_final)\n", + "\n", + "print(f'정확도 : {round(accuracy_final,4)}')\n", + "print(f'AUC : {round(auc_final,4)}')\n", + "print(f'''Optuna LGBM의 CF :\n", + "{cf_final}''')" + ] + }, + { + "cell_type": "markdown", + "id": "495aee0c", + "metadata": {}, + "source": [ + "### LGBM 결과비교 (vanila모델 vs optuna)\n", + "\n", + "* LGBM Vanila모델\n", + " * 정확도 : 0.703\n", + " * AUC : 0.7578\n", + " * Vanilla LGBM의 CF :\n", + "[[39914 16640]\n", + " [ 1629 3320]]\n", + "\n", + "* LGBM with optuna\n", + " * 정확도 : 0.7229\n", + " * AUC : 0.7602\n", + " * Optuna LGBM의 CF :\n", + "[[41245 15309]\n", + " [ 1735 3214]]" + ] + }, + { + "cell_type": "markdown", + "id": "8d617c93", + "metadata": {}, + "source": [ + "## Autogluon\n", + "\n", + "### Autogluon이란\n", + "\n", + "* Autogluon : optuna와 비슷한 방식으로 최적화하며, 최소한의 코드\n", + "\n", + "### Autogluon구성요소와 설정값\n", + "\n", + "* 구성요소\n", + " * `TablePredictor` : (분류/회귀에 사용) Tabular데이터 처리\n", + " * `TimeSeriesPredictor` : 시계열 데이터 예측\n", + " * `TextPredictor` : (연관성 분석 등) 자연어 처리\n", + " * `ImagePredictor` : 이미지 처리\n", + "* 설정값\n", + " * `Time_limit` : 학습 제한시간(Default=None)\n", + " * `Presets` : 사전 설정된 학습 전략 (best/high/good/medium quality. medium이 기본값)\n", + " * `hyperparameters` : (dict) 사용할 모델의 하이퍼파라미터들\n", + " * `Auto_stack` : 배깅 및 스택 앙상블링을 자동으로 활용할지 여부(Default=False)\n", + " * True(더 오래/정확히 학습)인 경우, `num_bag_fold` & `num_stack_levels` 자동 설정\n", + " * `num_bag_fold` : 배깅에 사용되는 폴드 수. 성능을 높이고 싶다면 5~10 사이를 권장(10이하의 값을 가짐)\n", + " * `num_stack_levels` : 스택에 사용되는 스태킹 레벨 수. 성능을 높이고 싶다면 2~3 사이를 권장(3이하의 값을 가짐)\n", + "* Leader board" + ] + }, + { + "cell_type": "markdown", + "id": "8e87ed90", + "metadata": {}, + "source": [ + "## Autogluon실습\n", + "\n", + "### 하단 코드의 실행내역 이해하기 \n", + "\n", + "* 실행내역 중 일부만 발췌함\n", + "```\n", + "# 0,1로 추정되는 값 bool 변환\n", + "Stage 1 Generators:\n", + " Fitting AsTypeFeatureGenerator...\n", + " Note: Converting 48 features to boolean dtype as they only contain 2 unique values.\n", + "\n", + "# Null값 처리\n", + "Stage 2 Generators:\n", + " Fitting FillNaFeatureGenerator...\n", + "\n", + "# 그대로 사용할 값 처리(변환X)\n", + "Stage 3 Generators:\n", + " Fitting IdentityFeatureGenerator...\n", + "\n", + "# Unique값 처리\n", + "Stage 4 Generators:\n", + " Fitting DropUniqueFeatureGenerator...\n", + "\n", + "# 중복값 처리\n", + "Stage 5 Generators:\n", + " Fitting DropDuplicatesFeatureGenerator...\n", + "\n", + "# Feature 처리(앞서 전처리한 데이터이기는 하나, autogluon이 판단하여 추가 처리)\n", + "Unused Original Features (Count: 4): ['FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_12']\n", + " These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.\n", + " Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.\n", + " These features do not need to be present at inference time.\n", + " ('int', []) : 4 | ['FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_12']\n", + "\n", + "# 사용할 모델\n", + "User-specified model hyperparameters to be fit:\n", + "{\n", + "\t'NN_TORCH': [{}],\n", + "\t'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],\n", + "\t'CAT': [{}],\n", + "\t'XGB': [{}],\n", + "\t'FASTAI': [{}],\n", + "\t'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],\n", + "\t'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],\n", + "\t'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],\n", + "}\n", + "\n", + "# sequential하게 trial 진행\n", + "Fitting 13 L1 models, fit_strategy=\"sequential\" ...\n", + "Fitting model: KNeighborsUnif ... Training model for up to 3595.93s of the 3595.92s of remaining time.\n", + "\t0.5271\t = Validation score (roc_auc)\n", + "\t1.62s\t = Training runtime\n", + "\t50.83s\t = Validation runtime\n", + "Fitting model: KNeighborsDist ... Training model for up to 3543.23s of the 3543.23s of remaining time.\n", + "\t0.5298\t = Validation score (roc_auc)\n", + "\t0.63s\t = Training runtime\n", + "\t49.39s\t = Validation runtime\n", + "...\n", + "\n", + "# 실행시간 및 best model\n", + "AutoGluon training complete, total runtime = 1561.52s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 4907.4 rows/s (24601 batch size)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36c13813", + "metadata": {}, + "outputs": [], + "source": [ + "from autogluon.tabular import TabularPredictor\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import roc_auc_score\n", + "import pandas as pd\n", + "\n", + "path = 'autogluon_results' # 저장할 경로로\n", + "\n", + "# 데이터 분할 (이전 코드에서 X_train, y_train, X_val, y_val, X_test, y_test를 준비했다고 가정)\n", + "\n", + "# 데이터프레임으로 변환\n", + "train_data = X_train\n", + "train_data['target'] = y_train\n", + "\n", + "val_data = X_val\n", + "val_data['target'] = y_val\n", + "\n", + "# AutoGluon 학습\n", + "predictor = TabularPredictor(label='target', eval_metric='roc_auc', verbosity=2, path=path)\n", + "predictor.fit(train_data, tuning_data=val_data, time_limit=3600) # 시간 제한 설정 (1시간)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a0754be", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "정확도 : 0.9199\n", + "AUC : 0.761\n", + "autogluon LGBM의 CF : \n", + "[[56492 62]\n", + " [ 4864 85]]\n" + ] + } + ], + "source": [ + "# 예측 및 평가\n", + "test_data = X_test\n", + "test_data['target'] = y_test\n", + "\n", + "y_pred_proba_ag = predictor.predict_proba(test_data)\n", + "y_pred_ag = predictor.predict(test_data)\n", + "\n", + "# optuna Results\n", + "accuracy_ag = accuracy_score(y_test, y_pred_ag)\n", + "auc_ag = roc_auc_score(y_test, y_pred_proba_ag.iloc[:, 1])\n", + "cf_ag = confusion_matrix(y_test, y_pred_ag)\n", + "\n", + "print(f'정확도 : {round(accuracy_ag,4)}')\n", + "print(f'AUC : {round(auc_ag,4)}')\n", + "print(f'''autogluon LGBM의 CF :\n", + "{cf_ag}''')" + ] + }, + { + "cell_type": "markdown", + "id": "31c97ed3", + "metadata": {}, + "source": [ + "### 결과 비교하기 (optuna vs autogluon)\n", + "\n", + "* LGBM with optuna\n", + " * 정확도 : 0.7229\n", + " * AUC : 0.7602\n", + " * Optuna LGBM의 CF :\n", + " [[41245 15309]\n", + " [ 1735 3214]]\n", + "\n", + "* Autogluon\n", + " * 정확도 : 0.9199\n", + " * AUC : 0.761\n", + " * autogluon LGBM의 CF : \n", + "[[56492 62]\n", + " [ 4864 85]]" + ] + }, + { + "cell_type": "markdown", + "id": "5779a4c3", + "metadata": {}, + "source": [ + "### Autogluon leaderboard\n", + "\n", + "* Leaderboard : Autogluon이 학습했던 모델들의 Score확인 가능\n", + " * 각 모델의 score나 time 등을 확인 가능" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dfc43063", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " model score_val eval_metric pred_time_val fit_time \\\n", + "0 WeightedEnsemble_L2 0.754022 roc_auc 5.013031 663.924621 \n", + "1 CatBoost 0.753185 roc_auc 0.050006 108.363011 \n", + "2 LightGBM 0.750591 roc_auc 0.188981 20.328846 \n", + "3 XGBoost 0.749953 roc_auc 0.238468 74.874593 \n", + "4 LightGBMXT 0.749184 roc_auc 0.669894 49.791667 \n", + "5 LightGBMLarge 0.748378 roc_auc 0.653803 39.147023 \n", + "6 NeuralNetTorch 0.740312 roc_auc 0.203650 140.524567 \n", + "7 NeuralNetFastAI 0.737314 roc_auc 0.325043 136.662297 \n", + "8 RandomForestEntr 0.731284 roc_auc 1.649255 304.078007 \n", + "9 ExtraTreesEntr 0.722843 roc_auc 1.512995 135.927222 \n", + "10 ExtraTreesGini 0.722248 roc_auc 1.357874 120.171119 \n", + "11 RandomForestGini 0.720812 roc_auc 2.502675 288.912408 \n", + "12 KNeighborsDist 0.529842 roc_auc 49.391568 0.634750 \n", + "13 KNeighborsUnif 0.527099 roc_auc 50.827020 1.620968 \n", + "\n", + " pred_time_val_marginal fit_time_marginal stack_level can_infer \\\n", + "0 0.005133 1.482976 2 True \n", + "1 0.050006 108.363011 1 True \n", + "2 0.188981 20.328846 1 True \n", + "3 0.238468 74.874593 1 True \n", + "4 0.669894 49.791667 1 True \n", + "5 0.653803 39.147023 1 True \n", + "6 0.203650 140.524567 1 True \n", + "7 0.325043 136.662297 1 True \n", + "8 1.649255 304.078007 1 True \n", + "9 1.512995 135.927222 1 True \n", + "10 1.357874 120.171119 1 True \n", + "11 2.502675 288.912408 1 True \n", + "12 49.391568 0.634750 1 True \n", + "13 50.827020 1.620968 1 True \n", + "\n", + " fit_order \n", + "0 14 \n", + "1 7 \n", + "2 4 \n", + "3 11 \n", + "4 3 \n", + "5 13 \n", + "6 12 \n", + "7 10 \n", + "8 6 \n", + "9 9 \n", + "10 8 \n", + "11 5 \n", + "12 2 \n", + "13 1 \n" + ] + } + ], + "source": [ + "# 최적 모델 요약 출력\n", + "print(predictor.leaderboard())" + ] + }, + { + "cell_type": "markdown", + "id": "9f8703b1", + "metadata": {}, + "source": [ + "## paramater를 직접 지정하기\n", + "- Autogluon은 이렇게 기본으로 사용해도 훌륭한 성능을 보여줍니다. 자동으로 파라미터를 지정하고, 앙상블까지 해주죠.\n", + "- 이런 베이스 모델을 생성한 뒤에는, 쓸모 없거나 너무 무거운 모델은 제거하고 다시 학습시킬 수도 있습니다.\n", + "- LGBM, XGB, CAT 세 개가 성능도 괜찮고 학습시간도 짧네요. 이거 세 개만 써봅시다.\n", + "- 위에서 Optuna에서 찾은 파라미터를 여기서 사용하실 수도 있겠죠?\n", + "\n", + "### Autogluon 하이퍼파라미터 설정\n", + "\n", + "* 베이스 모델을 생성한 뒤, 쓸모 없거나 너무 무거운 모델을 제거하고 학습 가능(Leaderboard로 확인)\n", + "* 위의 Leaderboard 결과에서, LGBM/XGB/CAT을 가지고 하이퍼파라미터를 설정" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d133882c", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Warning: path already exists! This predictor may overwrite an existing predictor! path=\"drive/MyDrive/Metacode/Week6/selected_model\"\n", + "Verbosity: 2 (Standard Logging)\n", + "=================== System Info ===================\n", + "AutoGluon Version: 1.2\n", + "Python Version: 3.11.11\n", + "Operating System: Linux\n", + "Platform Machine: x86_64\n", + "Platform Version: #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024\n", + "CPU Count: 2\n", + "Memory Avail: 8.70 GB / 12.67 GB (68.6%)\n", + "Disk Space Avail: 0.50 GB / 15.00 GB (3.3%)\n", + "\tWARNING: Available disk space is low and there is a risk that AutoGluon will run out of disk during fit, causing an exception. \n", + "\tWe recommend a minimum available disk space of 10 GB, and large datasets may require more.\n", + "===================================================\n", + "No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...\n", + "\tRecommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):\n", + "\tpresets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.\n", + "\tpresets='best' : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.\n", + "\tpresets='high' : Strong accuracy with fast inference speed.\n", + "\tpresets='good' : Good accuracy with very fast inference speed.\n", + "\tpresets='medium' : Fast training time, ideal for initial prototyping.\n", + "Beginning AutoGluon training ... Time limit = 3600s\n", + "AutoGluon will save models to \"/content/drive/MyDrive/Metacode/Week6/selected_model\"\n", + "Train Data Rows: 246008\n", + "Train Data Columns: 131\n", + "Tuning Data Rows: 24601\n", + "Tuning Data Columns: 131\n", + "Label Column: target\n", + "AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n", + "\t2 unique label values: [0, 1]\n", + "\tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])\n", + "Problem Type: binary\n", + "Preprocessing data ...\n", + "Selected class <--> label mapping: class 1 = 1, class 0 = 0\n", + "Using Feature Generators to preprocess the data ...\n", + "Fitting AutoMLPipelineFeatureGenerator...\n", + "\tAvailable Memory: 8776.27 MB\n", + "\tTrain Data (Original) Memory Usage: 270.46 MB (3.1% of available memory)\n", + "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", + "\tStage 1 Generators:\n", + "\t\tFitting AsTypeFeatureGenerator...\n", + "\t\t\tNote: Converting 48 features to boolean dtype as they only contain 2 unique values.\n", + "\tStage 2 Generators:\n", + "\t\tFitting FillNaFeatureGenerator...\n", + "\tStage 3 Generators:\n", + "\t\tFitting IdentityFeatureGenerator...\n", + "\tStage 4 Generators:\n", + "\t\tFitting DropUniqueFeatureGenerator...\n", + "\tStage 5 Generators:\n", + "\t\tFitting DropDuplicatesFeatureGenerator...\n", + "\tUnused Original Features (Count: 6): ['FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_17', 'CODE_GENDER_XNA']\n", + "\t\tThese features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.\n", + "\t\tFeatures can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.\n", + "\t\tThese features do not need to be present at inference time.\n", + "\t\t('float', []) : 1 | ['CODE_GENDER_XNA']\n", + "\t\t('int', []) : 5 | ['FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_17']\n", + "\tTypes of features in original data (raw dtype, special dtypes):\n", + "\t\t('float', []) : 88 | ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', ...]\n", + "\t\t('int', []) : 37 | ['SK_ID_CURR', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', ...]\n", + "\tTypes of features in processed data (raw dtype, special dtypes):\n", + "\t\t('float', []) : 69 | ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', ...]\n", + "\t\t('int', []) : 14 | ['SK_ID_CURR', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', ...]\n", + "\t\t('int', ['bool']) : 42 | ['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', ...]\n", + "\t4.1s = Fit runtime\n", + "\t125 features in original data used to generate 125 features in processed data.\n", + "\tTrain Data (Processed) Memory Usage: 182.20 MB (2.1% of available memory)\n", + "Data preprocessing and feature engineering runtime = 4.62s ...\n", + "AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc'\n", + "\tThis metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()\n", + "\tTo change this, specify the eval_metric parameter of Predictor()\n", + "User-specified model hyperparameters to be fit:\n", + "{\n", + "\t'GBM': [{'learning_rate': 0.02642230824686883, 'num_leaves': 34, 'n_estimators': 864, 'subsample': 0.8342595493628282, 'min_data_in_leaf': 60, 'colsample_bytree': 0.8360285193709883, 'n_jobs': -1, 'class_weight': 'balanced'}],\n", + "\t'CAT': [{'scale_pos_weight': 11.377138257194607}],\n", + "\t'XGB': [{'scale_pos_weight': 11.377138257194607}],\n", + "}\n", + "Fitting 3 L1 models, fit_strategy=\"sequential\" ...\n", + "Fitting model: LightGBM ... Training model for up to 3595.38s of the 3595.38s of remaining time.\n", + "/usr/local/lib/python3.11/dist-packages/lightgbm/engine.py:204: UserWarning: Found `n_estimators` in params. Will use it instead of argument\n", + " _log_warning(f\"Found `{alias}` in params. Will use it instead of argument\")\n", + "\t0.8437\t = Validation score (roc_auc)\n", + "\t93.76s\t = Training runtime\n", + "\t2.52s\t = Validation runtime\n", + "Fitting model: CatBoost ... Training model for up to 3498.78s of the 3498.77s of remaining time.\n", + "\t0.9877\t = Validation score (roc_auc)\n", + "\t1941.71s\t = Training runtime\n", + "\t0.38s\t = Validation runtime\n", + "Fitting model: XGBoost ... Training model for up to 1556.46s of the 1556.46s of remaining time.\n", + "\t0.9982\t = Validation score (roc_auc)\n", + "\t1556.76s\t = Training runtime\n", + "\t11.44s\t = Validation runtime\n", + "Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.00s of the -11.97s of remaining time.\n", + "\tEnsemble Weights: {'XGBoost': 1.0}\n", + "\t0.9982\t = Validation score (roc_auc)\n", + "\t0.35s\t = Training runtime\n", + "\t0.0s\t = Validation runtime\n", + "AutoGluon training complete, total runtime = 3614.24s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 2149.4 rows/s (24601 batch size)\n", + "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/content/drive/MyDrive/Metacode/Week6/selected_model\")\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from autogluon.tabular import TabularPredictor\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import roc_auc_score\n", + "import pandas as pd\n", + "import os\n", + "\n", + "from autogluon.tabular import TabularPredictor\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import roc_auc_score\n", + "import pandas as pd\n", + "\n", + "path = 'drive/MyDrive/Metacode/Week6/selected_model'\n", + "os.makedirs(path, exist_ok=True)\n", + "\n", + "train_data = X_train\n", + "train_data['target'] = y_train\n", + "\n", + "val_data = X_val\n", + "val_data['target'] = y_val\n", + "\n", + "scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1]) # 불균형 데이터 가중치 설정\n", + "\n", + "# 모델별 하이퍼파라미터 지정\n", + "custom_hyperparameters = {\n", + " 'GBM': {\n", + " 'learning_rate': 0.02642230824686883,\n", + " 'num_leaves': 34,\n", + " 'n_estimators': 864,\n", + " 'subsample': 0.8342595493628282,\n", + " 'min_data_in_leaf': 60,\n", + " 'colsample_bytree': 0.8360285193709883,\n", + " 'n_jobs':-1,\n", + " 'class_weight':'balanced'\n", + " },\n", + " 'CAT': {\n", + " 'scale_pos_weight': scale_pos_weight\n", + " },\n", + " 'XGB': {\n", + " 'scale_pos_weight': scale_pos_weight\n", + " }\n", + "}\n", + "\n", + "# AutoGluon 학습\n", + "predictor = TabularPredictor(label='target', eval_metric='roc_auc', verbosity=2, path=path)\n", + "predictor.fit(\n", + " train_data,\n", + " tuning_data=val_data,\n", + " time_limit=3600,\n", + " hyperparameters=custom_hyperparameters\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58b942ee", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "정확도 : 0.8675\n", + "AUC : 0.7107\n", + "autogluon CF :\n", + "[[51892 4662]\n", + " [ 3486 1463]]\n" + ] + } + ], + "source": [ + "# 예측 및 평가\n", + "test_data = X_test\n", + "test_data['target'] = y_test\n", + "\n", + "y_pred_proba_ag = predictor.predict_proba(test_data)\n", + "y_pred_ag = predictor.predict(test_data)\n", + "\n", + "# optuna Results\n", + "accuracy_ag = accuracy_score(y_test, y_pred_ag)\n", + "auc_ag = roc_auc_score(y_test, y_pred_proba_ag.iloc[:, 1])\n", + "cf_ag = confusion_matrix(y_test, y_pred_ag)\n", + "\n", + "print(f'정확도 : {round(accuracy_ag,4)}')\n", + "print(f'AUC : {round(auc_ag,4)}')\n", + "print(f'''autogluon CF :\n", + "{cf_ag}''')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8025e543", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"predictor\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"model\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"WeightedEnsemble_L2\",\n \"LightGBM\",\n \"XGBoost\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"score_val\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0756612918828574,\n \"min\": 0.843690128482574,\n \"max\": 0.9981761541420456,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.9981761541420456,\n 0.9877217974956707,\n 0.843690128482574\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"eval_metric\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"roc_auc\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"pred_time_val\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.833938564108144,\n \"min\": 0.3849499225616455,\n \"max\": 11.445684909820557,\n \"num_unique_values\": 4,\n \"samples\": [\n 11.445684909820557\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fit_time\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 816.1285596253546,\n \"min\": 93.75901937484741,\n \"max\": 1941.7070152759552,\n \"num_unique_values\": 4,\n \"samples\": [\n 1557.1112411022186\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"pred_time_val_marginal\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.351480742903106,\n \"min\": 0.00453639030456543,\n \"max\": 11.441148519515991,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.00453639030456543\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fit_time_marginal\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 995.9715188903901,\n \"min\": 0.3473982810974121,\n \"max\": 1941.7070152759552,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.3473982810974121\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"stack_level\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 2,\n \"num_unique_values\": 2,\n \"samples\": [\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"can_infer\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fit_order\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 4,\n \"num_unique_values\": 4,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
modelscore_valeval_metricpred_time_valfit_timepred_time_val_marginalfit_time_marginalstack_levelcan_inferfit_order
0XGBoost0.998176roc_auc11.4411491556.76384311.4411491556.7638431True3
1WeightedEnsemble_L20.998176roc_auc11.4456851557.1112410.0045360.3473982True4
2CatBoost0.987722roc_auc0.3849501941.7070150.3849501941.7070151True2
3LightGBM0.843690roc_auc2.51918993.7590192.51918993.7590191True1
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " model score_val eval_metric pred_time_val fit_time \\\n", + "0 XGBoost 0.998176 roc_auc 11.441149 1556.763843 \n", + "1 WeightedEnsemble_L2 0.998176 roc_auc 11.445685 1557.111241 \n", + "2 CatBoost 0.987722 roc_auc 0.384950 1941.707015 \n", + "3 LightGBM 0.843690 roc_auc 2.519189 93.759019 \n", + "\n", + " pred_time_val_marginal fit_time_marginal stack_level can_infer \\\n", + "0 11.441149 1556.763843 1 True \n", + "1 0.004536 0.347398 2 True \n", + "2 0.384950 1941.707015 1 True \n", + "3 2.519189 93.759019 1 True \n", + "\n", + " fit_order \n", + "0 3 \n", + "1 4 \n", + "2 2 \n", + "3 1 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "predictor.leaderboard()" + ] + }, + { + "cell_type": "markdown", + "id": "75c19626", + "metadata": {}, + "source": [ + "### 결과 비교하기 (optuna vs autogluon)\n", + "\n", + "* Autogluon기본\n", + " * 정확도 : 0.9199\n", + " * AUC : 0.761\n", + " * autogluon LGBM의 CF : \n", + "[[56492 62]\n", + " [ 4864 85]]\n", + "\n", + "* Autogluon 일부 모델만 추려낸 것 (낮아짐)\n", + " * 정확도 : 0.8675\n", + " * AUC : 0.7107\n", + " * autogluon CF :\n", + "[[51892 4662]\n", + " [ 3486 1463]]" + ] + }, + { + "cell_type": "markdown", + "id": "23309b50", + "metadata": {}, + "source": [ + "### 기타\n", + "\n", + "* 다른 AutoML로 pycaret도 있으나, Autogluon이 더 좋음\n", + " * 코드 한줄정도로 구현이 가능하지만, Autogluon성능이 더 좋았음\n", + "* 실무적으로는, **medium quality세팅으로 리더보드를 보면서, feature engineering(파생변수 생성 등)을 진행하게 됨**\n", + " * 성능의 영향은 데이터(GIGO). 하이퍼파라미터 튜닝 등 보다는 **데이터에 집중**\n", + " * EDA를 하다보면, 추가 할법한 (파생)변수가 보이기도 함\n", + " * SHAP를 확인하다보면 넣으면 좋을 것 같은 변수가 보이기도 함" + ] + }, + { + "cell_type": "markdown", + "id": "b8ea13f6", + "metadata": {}, + "source": [ + "# cuML 도커에서 설치하기\n", + "- https://rapids.ai/#quick-start" + ] + }, + { + "cell_type": "markdown", + "id": "7fee6a38", + "metadata": {}, + "source": [ + "# 과제 : 최종과제 준비하기\n", + "\n", + "* 구성 샘플\n", + " * 과제 목표\n", + " * 학습데이터\n", + " * EDA\n", + " * SHAP\n", + " * 학습모델\n", + " * 메트릭\n", + " * 리더보드\n", + " * appendix(hyper)\n", + "* 인당 15분 발표 + 5분 질문\n", + "* 데이콘에 결과 제출해보기 (+private score공유)\n", + "* 데이콘 우승자 발표 참고하면 좋음" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}