Wrote section 6 and conclusion, also added random number variable instead of writing it out 3 times

91e7d44e · Jonathan Poalses · d71b418f · 91e7d44e
Commit 91e7d44e authored May 26, 2023 by Jonathan Poalses
Hide whitespace changes
Inline Side-by-side

Showing with 77 additions and 61 deletions

numbers_ml.ipynb numbers_ml.ipynb +77 -61

No files found.
--- a/numbers_ml.ipynb
+++ b/numbers_ml.ipynb
@@ -24,7 +24,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 748,
+   "execution_count": 838,
   "outputs": [],
   "source": [
    "# Importing pyplot so we can visualize things\n",
@@ -78,8 +78,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.485Z",
+     "end_time": "2023-05-26T11:53:03.732178Z",
-     "start_time": "2023-05-26T11:20:46.377856Z"
+     "start_time": "2023-05-26T11:53:03.620468Z"
    }
   }
  },
@@ -95,13 +95,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 749,
+   "execution_count": 839,
   "outputs": [
    {
     "data": {
      "text/plain": "array([0, 1, 2, ..., 8, 9, 8])"
     },
-     "execution_count": 749,
+     "execution_count": 839,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -116,8 +116,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.486348Z",
+     "end_time": "2023-05-26T11:53:03.732946Z",
-     "start_time": "2023-05-26T11:20:46.392825Z"
+     "start_time": "2023-05-26T11:53:03.627219Z"
    }
   }
  },
@@ -132,13 +132,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 750,
+   "execution_count": 840,
   "outputs": [
    {
     "data": {
      "text/plain": "array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],\n        [ 0.,  0., 13., ..., 15.,  5.,  0.],\n        [ 0.,  3., 15., ..., 11.,  8.,  0.],\n        ...,\n        [ 0.,  4., 11., ..., 12.,  7.,  0.],\n        [ 0.,  2., 14., ..., 12.,  0.,  0.],\n        [ 0.,  0.,  6., ...,  0.,  0.,  0.]],\n\n       [[ 0.,  0.,  0., ...,  5.,  0.,  0.],\n        [ 0.,  0.,  0., ...,  9.,  0.,  0.],\n        [ 0.,  0.,  3., ...,  6.,  0.,  0.],\n        ...,\n        [ 0.,  0.,  1., ...,  6.,  0.,  0.],\n        [ 0.,  0.,  1., ...,  6.,  0.,  0.],\n        [ 0.,  0.,  0., ..., 10.,  0.,  0.]],\n\n       [[ 0.,  0.,  0., ..., 12.,  0.,  0.],\n        [ 0.,  0.,  3., ..., 14.,  0.,  0.],\n        [ 0.,  0.,  8., ..., 16.,  0.,  0.],\n        ...,\n        [ 0.,  9., 16., ...,  0.,  0.,  0.],\n        [ 0.,  3., 13., ..., 11.,  5.,  0.],\n        [ 0.,  0.,  0., ..., 16.,  9.,  0.]],\n\n       ...,\n\n       [[ 0.,  0.,  1., ...,  1.,  0.,  0.],\n        [ 0.,  0., 13., ...,  2.,  1.,  0.],\n        [ 0.,  0., 16., ..., 16.,  5.,  0.],\n        ...,\n        [ 0.,  0., 16., ..., 15.,  0.,  0.],\n        [ 0.,  0., 15., ..., 16.,  0.,  0.],\n        [ 0.,  0.,  2., ...,  6.,  0.,  0.]],\n\n       [[ 0.,  0.,  2., ...,  0.,  0.,  0.],\n        [ 0.,  0., 14., ..., 15.,  1.,  0.],\n        [ 0.,  4., 16., ..., 16.,  7.,  0.],\n        ...,\n        [ 0.,  0.,  0., ..., 16.,  2.,  0.],\n        [ 0.,  0.,  4., ..., 16.,  2.,  0.],\n        [ 0.,  0.,  5., ..., 12.,  0.,  0.]],\n\n       [[ 0.,  0., 10., ...,  1.,  0.,  0.],\n        [ 0.,  2., 16., ...,  1.,  0.,  0.],\n        [ 0.,  0., 15., ..., 15.,  0.,  0.],\n        ...,\n        [ 0.,  4., 16., ..., 16.,  6.,  0.],\n        [ 0.,  8., 16., ..., 16.,  8.,  0.],\n        [ 0.,  1.,  8., ..., 12.,  1.,  0.]]])"
     },
-     "execution_count": 750,
+     "execution_count": 840,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -150,8 +150,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.486460Z",
+     "end_time": "2023-05-26T11:53:03.733270Z",
-     "start_time": "2023-05-26T11:20:46.411001Z"
+     "start_time": "2023-05-26T11:53:03.642899Z"
    }
   }
  },
@@ -176,7 +176,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 751,
+   "execution_count": 841,
   "outputs": [
    {
     "data": {
@@ -203,8 +203,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.640352Z",
+     "end_time": "2023-05-26T11:53:03.883296Z",
-     "start_time": "2023-05-26T11:20:46.430010Z"
+     "start_time": "2023-05-26T11:53:03.653496Z"
    }
   }
  },
@@ -219,13 +219,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 752,
+   "execution_count": 842,
   "outputs": [
    {
     "data": {
      "text/plain": "array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],\n       [ 0.,  0.,  0., ..., 10.,  0.,  0.],\n       [ 0.,  0.,  0., ..., 16.,  9.,  0.],\n       ...,\n       [ 0.,  0.,  1., ...,  6.,  0.,  0.],\n       [ 0.,  0.,  2., ..., 12.,  0.,  0.],\n       [ 0.,  0., 10., ..., 12.,  1.,  0.]])"
     },
-     "execution_count": 752,
+     "execution_count": 842,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -238,20 +238,20 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.645409Z",
+     "end_time": "2023-05-26T11:53:03.889159Z",
-     "start_time": "2023-05-26T11:20:46.640033Z"
+     "start_time": "2023-05-26T11:53:03.880239Z"
    }
   }
  },
  {
   "cell_type": "code",
-   "execution_count": 753,
+   "execution_count": 843,
   "outputs": [
    {
     "data": {
      "text/plain": "True"
     },
-     "execution_count": 753,
+     "execution_count": 843,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -263,8 +263,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.650714Z",
+     "end_time": "2023-05-26T11:53:03.903956Z",
-     "start_time": "2023-05-26T11:20:46.646514Z"
+     "start_time": "2023-05-26T11:53:03.888898Z"
    }
   }
  },
@@ -294,6 +294,7 @@
   "source": [
    "## 5: Training and Testing\n",
    "For training and testing these algorithms, there were three variations of each algorithm, each using a different distribution of training and testing data. One has a 25% test 75% training split, which will likely result in the best test results, and could potentially result in overfitting, the next has a 50/50 split, likely resulting in a good balance, and the last has a 75% test 25% training split, which could potentially result in underfitting.\n",
+    "\n",
    "Each algorithm was tested using all three training and testing sets, creating three different models per algorithm, resulting in 18 models total."
   ],
   "metadata": {
@@ -302,19 +303,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 754,
+   "execution_count": 844,
   "outputs": [],
   "source": [
    "# We'll start by splitting the data into training and testing, going with a 75% train, 25% test split, a 50/50 split, and a 25% train 75% test split.\n",
-    "X_train, X_test, y_train, y_test = train_test_split(flat_images, data.target, test_size=0.25, random_state=2023)\n",
+    "random_number = 2023\n",
-    "X_train2, X_test2, y_train2, y_test2 = train_test_split(flat_images, data.target, test_size=0.50, random_state=2023)\n",
+    "X_train, X_test, y_train, y_test = train_test_split(flat_images, data.target, test_size=0.25, random_state=random_number)\n",
-    "X_train3, X_test3, y_train3, y_test3 = train_test_split(flat_images, data.target, test_size=0.75, random_state=2023)"
+    "X_train2, X_test2, y_train2, y_test2 = train_test_split(flat_images, data.target, test_size=0.50, random_state=random_number)\n",
+    "X_train3, X_test3, y_train3, y_test3 = train_test_split(flat_images, data.target, test_size=0.75, random_state=random_number)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:46.697098Z",
+     "end_time": "2023-05-26T11:53:03.945085Z",
-     "start_time": "2023-05-26T11:20:46.654924Z"
+     "start_time": "2023-05-26T11:53:03.903708Z"
    }
   }
  },
@@ -329,7 +331,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 755,
+   "execution_count": 845,
   "outputs": [],
   "source": [
    "# First the Gaussian Bayes\n",
@@ -356,14 +358,14 @@
    "lr.fit(X_train, y_train)\n",
    "lr2.fit(X_train2, y_train2)\n",
    "lr3.fit(X_train3, y_train3)\n",
-    "# Nonsense code to prevent SVC object being output below\n",
+    "# Nonsense code to prevent LR object being output below\n",
    "nothing = 1"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:49.728828Z",
+     "end_time": "2023-05-26T11:53:06.851081Z",
-     "start_time": "2023-05-26T11:20:46.663509Z"
+     "start_time": "2023-05-26T11:53:03.904257Z"
    }
   }
  },
@@ -378,7 +380,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 756,
+   "execution_count": 846,
   "outputs": [
    {
     "name": "stdout",
@@ -407,14 +409,14 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:49.741728Z",
+     "end_time": "2023-05-26T11:53:06.864130Z",
-     "start_time": "2023-05-26T11:20:49.730504Z"
+     "start_time": "2023-05-26T11:53:06.853426Z"
    }
   }
  },
  {
   "cell_type": "code",
-   "execution_count": 757,
+   "execution_count": 847,
   "outputs": [
    {
     "name": "stdout",
@@ -443,14 +445,14 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:49.826269Z",
+     "end_time": "2023-05-26T11:53:06.947095Z",
-     "start_time": "2023-05-26T11:20:49.744184Z"
+     "start_time": "2023-05-26T11:53:06.866645Z"
    }
   }
  },
  {
   "cell_type": "code",
-   "execution_count": 758,
+   "execution_count": 848,
   "outputs": [
    {
     "name": "stdout",
@@ -479,22 +481,22 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:50.068746Z",
+     "end_time": "2023-05-26T11:53:07.165868Z",
-     "start_time": "2023-05-26T11:20:49.828344Z"
+     "start_time": "2023-05-26T11:53:06.947952Z"
    }
   }
  },
  {
   "cell_type": "code",
-   "execution_count": 759,
+   "execution_count": 849,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "0.8666666666666667\n",
+      "0.8466666666666667\n",
-      "0.8353726362625139\n",
+      "0.8331479421579533\n",
-      "0.7781899109792285\n"
+      "0.7418397626112759\n"
     ]
    }
   ],
@@ -515,22 +517,22 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:50.084031Z",
+     "end_time": "2023-05-26T11:53:07.171868Z",
-     "start_time": "2023-05-26T11:20:50.066978Z"
+     "start_time": "2023-05-26T11:53:07.168020Z"
    }
   }
  },
  {
   "cell_type": "code",
-   "execution_count": 760,
+   "execution_count": 850,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "0.98\n",
+      "0.9777777777777777\n",
-      "0.9632925472747497\n",
+      "0.9610678531701891\n",
-      "0.9458456973293768\n"
+      "0.9399109792284867\n"
     ]
    }
   ],
@@ -551,14 +553,14 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:50.118233Z",
+     "end_time": "2023-05-26T11:53:07.215842Z",
-     "start_time": "2023-05-26T11:20:50.083636Z"
+     "start_time": "2023-05-26T11:53:07.175125Z"
    }
   }
  },
  {
   "cell_type": "code",
-   "execution_count": 761,
+   "execution_count": 851,
   "outputs": [
    {
     "name": "stdout",
@@ -587,8 +589,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:50.124837Z",
+     "end_time": "2023-05-26T11:53:07.223374Z",
-     "start_time": "2023-05-26T11:20:50.120256Z"
+     "start_time": "2023-05-26T11:53:07.218152Z"
    }
   }
  },
@@ -596,7 +598,15 @@
   "cell_type": "markdown",
   "source": [
    "## 6: Evaluation and Model Selection\n",
-    "blah blah"
+    "Now that we've trained and tested the models, we can evaluate them and select the best one.\n",
+    "\n",
+    "The evaluation is based off of the models accuracy when predicted the test data's true label, with a higher accuracy being better. The scores were all collected into a dictionary linking them to a String which identifies which model it was in a human-readable way, then further dictionaries were made linking the scores to the predicted values and test values, so it can be easily and programmatically identified and shown which model was best.\n",
+    "\n",
+    "Finally, an average was found for each algorthm using basic statistical analysis, and an average for each test and training data split ratio was also found.\n",
+    "\n",
+    "The best model, algorithm, and split ratio are then printed, and a confusion matrix is generated for the best model. With this, whether you want the best single model, the best algorithm, or the best testing training data split ratio, you can easily see it.\n",
+    "\n",
+    "The best algorithm indicated what would be ideal if you were unsure of the size of the data you're training with, as the less data you have the less of the data can be used for testing, as it's needed for training. The best split can be used if you're unsure of which algorithm you want to use, or if you intend to use an algorithm that wasn't tested here. The best overall model can be used if you don't intend to do further research, and would rather just use the results of this project."
   ],
   "metadata": {
    "collapsed": false
@@ -604,7 +614,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 762,
+   "execution_count": 852,
   "outputs": [
    {
     "name": "stdout",
@@ -726,8 +736,8 @@
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
-     "end_time": "2023-05-26T11:20:50.371133Z",
+     "end_time": "2023-05-26T11:53:07.506847Z",
-     "start_time": "2023-05-26T11:20:50.134150Z"
+     "start_time": "2023-05-26T11:53:07.222988Z"
    }
   }
  },
@@ -735,7 +745,13 @@
   "cell_type": "markdown",
   "source": [
    "## Conclusion\n",
-    "blah blah"
+    "As can be seen, the best model was the K Nearest Neighbour model, trained on a 25% test data split, so if you were to directly choose a model based off of this, that would be the one.\n",
+    "\n",
+    "The best algorithm was the Support Vector Classification algorithm, so if you wanted to ensure reliable results with a data set of indeterminate size, that would likely be your best bet.\n",
+    "\n",
+    "Finally, the best split ratio is rather expectedly the 25% Test and 75% training split; If you wished to simply get the best results for any algorithm, choosing that test training ratio would be best.\n",
+    "\n",
+    "So in summary, it appears that if you wanted to train a ML model to recognise written text, at least if that text is numeric, then using the K Nearest Neighbour algorithm, and training it with a 25% test and 75% training ration, would be the best choice."
   ],
   "metadata": {
    "collapsed": false