"Six different algorithms will be used, with three different distributions of testing and training data from the dataset, resulting in 18 different models to be evaluated.\n",
"\n",
"The algorithms will be the **Gaussian Naive Bayes** (GNB), **K Nearest Neighbour** (KNN), **Support Vector Classification** (SVC) algorithms, **Decision Tree Classifier** (DTC), **Random Forest Classifier** (RFC) and **Linear Regression** (LR). This is a collection of the six top machine learning algorithms for classification, and as such testing all of them we should be able to determine which should be the best for the problem of recognising written text."
"The algorithms will be the **Gaussian Naive Bayes** (GNB), **K Nearest Neighbour** (KNN), **Support Vector Classification** (SVC) algorithms, **Decision Tree Classifier** (DTC), **Random Forest Classifier** (RFC) and **Linear Regression** (LR). This is a collection of the six top machine learning algorithms for classification, and as such testing all of them we should be able to determine which should be the best for the problem of recognising written text.\n",
"\n",
"GNB is based on Bayes' Theorem, which seeks to calculate probability based on prior knowledge, while assuming that each feature is independent of each other.\n",
"\n",
"KNN represents each value in a dimensional space which varies in dimensionality based on the data. It classifies predicted data based off of its proximity to the nearest training data points.\n",
"\n",
"SVC attempts to classify data based on the position of a given value relative to a border between the different possible classes, this border is meant to maximize the distance between values from different classes.\n",
"\n",
"DTC constructs a tree of different logical branches, separating the given data until it can conclude the class of the data, the so-called leaves.\n",
"\n",
"RFC is a collection of decision trees, it essentially applies an aggregate conclusion based off of multiple DTC models, with the majority vote being the predicted class; It should naturally be more accurate than DTC.\n",
"\n",
"LR uses a sigmoid function to find the probability of a given data resulting in a given classification, it's most often used when the problem is binary, but still works fine when it is not, such as in this case.\n",
"\n",
"Any of these algorithms could work for this problem, and so they must be tested to determine which is the most suitable."
],
"metadata": {
"collapsed": false
...
...
@@ -303,7 +317,7 @@
},
{
"cell_type": "code",
"execution_count": 844,
"execution_count": 934,
"outputs": [],
"source": [
"# We'll start by splitting the data into training and testing, going with a 75% train, 25% test split, a 50/50 split, and a 25% train 75% test split.\n",
...
...
@@ -315,8 +329,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:03.945085Z",
"start_time": "2023-05-26T11:53:03.903708Z"
"end_time": "2023-05-26T14:17:35.834391Z",
"start_time": "2023-05-26T14:17:35.801032Z"
}
}
},
...
...
@@ -331,7 +345,7 @@
},
{
"cell_type": "code",
"execution_count": 845,
"execution_count": 935,
"outputs": [],
"source": [
"# First the Gaussian Bayes\n",
...
...
@@ -364,8 +378,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:06.851081Z",
"start_time": "2023-05-26T11:53:03.904257Z"
"end_time": "2023-05-26T14:17:38.890650Z",
"start_time": "2023-05-26T14:17:35.810495Z"
}
}
},
...
...
@@ -380,7 +394,7 @@
},
{
"cell_type": "code",
"execution_count": 846,
"execution_count": 936,
"outputs": [
{
"name": "stdout",
...
...
@@ -409,14 +423,14 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:06.864130Z",
"start_time": "2023-05-26T11:53:06.853426Z"
"end_time": "2023-05-26T14:17:38.902252Z",
"start_time": "2023-05-26T14:17:38.891565Z"
}
}
},
{
"cell_type": "code",
"execution_count": 847,
"execution_count": 937,
"outputs": [
{
"name": "stdout",
...
...
@@ -445,14 +459,14 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:06.947095Z",
"start_time": "2023-05-26T11:53:06.866645Z"
"end_time": "2023-05-26T14:17:38.996083Z",
"start_time": "2023-05-26T14:17:38.913608Z"
}
}
},
{
"cell_type": "code",
"execution_count": 848,
"execution_count": 938,
"outputs": [
{
"name": "stdout",
...
...
@@ -481,22 +495,22 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:07.165868Z",
"start_time": "2023-05-26T11:53:06.947952Z"
"end_time": "2023-05-26T14:17:39.209822Z",
"start_time": "2023-05-26T14:17:38.996267Z"
}
}
},
{
"cell_type": "code",
"execution_count": 849,
"execution_count": 939,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8466666666666667\n",
"0.8331479421579533\n",
"0.7418397626112759\n"
"0.8377777777777777\n",
"0.8487208008898777\n",
"0.7789317507418397\n"
]
}
],
...
...
@@ -517,22 +531,22 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:07.171868Z",
"start_time": "2023-05-26T11:53:07.168020Z"
"end_time": "2023-05-26T14:17:39.216072Z",
"start_time": "2023-05-26T14:17:39.212717Z"
}
}
},
{
"cell_type": "code",
"execution_count": 850,
"execution_count": 940,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9777777777777777\n",
"0.9610678531701891\n",
"0.9399109792284867\n"
"0.9844444444444445\n",
"0.9599555061179088\n",
"0.9473293768545994\n"
]
}
],
...
...
@@ -553,14 +567,14 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:07.215842Z",
"start_time": "2023-05-26T11:53:07.175125Z"
"end_time": "2023-05-26T14:17:39.259071Z",
"start_time": "2023-05-26T14:17:39.217697Z"
}
}
},
{
"cell_type": "code",
"execution_count": 851,
"execution_count": 941,
"outputs": [
{
"name": "stdout",
...
...
@@ -589,8 +603,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:07.223374Z",
"start_time": "2023-05-26T11:53:07.218152Z"
"end_time": "2023-05-26T14:17:39.267734Z",
"start_time": "2023-05-26T14:17:39.263262Z"
}
}
},
...
...
@@ -614,7 +628,7 @@
},
{
"cell_type": "code",
"execution_count": 852,
"execution_count": 942,
"outputs": [
{
"name": "stdout",
...
...
@@ -736,8 +750,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:53:07.506847Z",
"start_time": "2023-05-26T11:53:07.222988Z"
"end_time": "2023-05-26T14:17:39.516773Z",
"start_time": "2023-05-26T14:17:39.267995Z"
}
}
},
...
...
@@ -751,7 +765,7 @@
"\n",
"Finally, the best split ratio is rather expectedly the 25% Test and 75% training split; If you wished to simply get the best results for any algorithm, choosing that test training ratio would be best.\n",
"\n",
"So in summary, it appears that if you wanted to train a ML model to recognise written text, at least if that text is numeric, then using the K Nearest Neighbour algorithm, and training it with a 25% test and 75% training ration, would be the best choice."
"So in summary, it appears that if you wanted to train a ML model to recognise written text, at the very least if that text is numeric, then using the K Nearest Neighbour algorithm, and training it with a 25% test and 75% training ration, would be the best choice."