"For training and testing these algorithms, there were three variations of each algorithm, each using a different distribution of training and testing data. One has a 25% test 75% training split, which will likely result in the best test results, and could potentially result in overfitting, the next has a 50/50 split, likely resulting in a good balance, and the last has a 75% test 25% training split, which could potentially result in underfitting.\n",
"For training and testing these algorithms, there were three variations of each algorithm, each using a different distribution of training and testing data. One has a 25% test 75% training split, which will likely result in the best test results, and could potentially result in overfitting, the next has a 50/50 split, likely resulting in a good balance, and the last has a 75% test 25% training split, which could potentially result in underfitting.\n",
"\n",
"Each algorithm was tested using all three training and testing sets, creating three different models per algorithm, resulting in 18 models total."
"Each algorithm was tested using all three training and testing sets, creating three different models per algorithm, resulting in 18 models total."
],
],
"metadata": {
"metadata": {
...
@@ -302,19 +303,20 @@
...
@@ -302,19 +303,20 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 754,
"execution_count": 844,
"outputs": [],
"outputs": [],
"source": [
"source": [
"# We'll start by splitting the data into training and testing, going with a 75% train, 25% test split, a 50/50 split, and a 25% train 75% test split.\n",
"# We'll start by splitting the data into training and testing, going with a 75% train, 25% test split, a 50/50 split, and a 25% train 75% test split.\n",
"# Nonsense code to prevent SVC object being output below\n",
"# Nonsense code to prevent LR object being output below\n",
"nothing = 1"
"nothing = 1"
],
],
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:49.728828Z",
"end_time": "2023-05-26T11:53:06.851081Z",
"start_time": "2023-05-26T11:20:46.663509Z"
"start_time": "2023-05-26T11:53:03.904257Z"
}
}
}
}
},
},
...
@@ -378,7 +380,7 @@
...
@@ -378,7 +380,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 756,
"execution_count": 846,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
...
@@ -407,14 +409,14 @@
...
@@ -407,14 +409,14 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:49.741728Z",
"end_time": "2023-05-26T11:53:06.864130Z",
"start_time": "2023-05-26T11:20:49.730504Z"
"start_time": "2023-05-26T11:53:06.853426Z"
}
}
}
}
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 757,
"execution_count": 847,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
...
@@ -443,14 +445,14 @@
...
@@ -443,14 +445,14 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:49.826269Z",
"end_time": "2023-05-26T11:53:06.947095Z",
"start_time": "2023-05-26T11:20:49.744184Z"
"start_time": "2023-05-26T11:53:06.866645Z"
}
}
}
}
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 758,
"execution_count": 848,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
...
@@ -479,22 +481,22 @@
...
@@ -479,22 +481,22 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.068746Z",
"end_time": "2023-05-26T11:53:07.165868Z",
"start_time": "2023-05-26T11:20:49.828344Z"
"start_time": "2023-05-26T11:53:06.947952Z"
}
}
}
}
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 759,
"execution_count": 849,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"0.8666666666666667\n",
"0.8466666666666667\n",
"0.8353726362625139\n",
"0.8331479421579533\n",
"0.7781899109792285\n"
"0.7418397626112759\n"
]
]
}
}
],
],
...
@@ -515,22 +517,22 @@
...
@@ -515,22 +517,22 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.084031Z",
"end_time": "2023-05-26T11:53:07.171868Z",
"start_time": "2023-05-26T11:20:50.066978Z"
"start_time": "2023-05-26T11:53:07.168020Z"
}
}
}
}
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 760,
"execution_count": 850,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"0.98\n",
"0.9777777777777777\n",
"0.9632925472747497\n",
"0.9610678531701891\n",
"0.9458456973293768\n"
"0.9399109792284867\n"
]
]
}
}
],
],
...
@@ -551,14 +553,14 @@
...
@@ -551,14 +553,14 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.118233Z",
"end_time": "2023-05-26T11:53:07.215842Z",
"start_time": "2023-05-26T11:20:50.083636Z"
"start_time": "2023-05-26T11:53:07.175125Z"
}
}
}
}
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 761,
"execution_count": 851,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
...
@@ -587,8 +589,8 @@
...
@@ -587,8 +589,8 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.124837Z",
"end_time": "2023-05-26T11:53:07.223374Z",
"start_time": "2023-05-26T11:20:50.120256Z"
"start_time": "2023-05-26T11:53:07.218152Z"
}
}
}
}
},
},
...
@@ -596,7 +598,15 @@
...
@@ -596,7 +598,15 @@
"cell_type": "markdown",
"cell_type": "markdown",
"source": [
"source": [
"## 6: Evaluation and Model Selection\n",
"## 6: Evaluation and Model Selection\n",
"blah blah"
"Now that we've trained and tested the models, we can evaluate them and select the best one.\n",
"\n",
"The evaluation is based off of the models accuracy when predicted the test data's true label, with a higher accuracy being better. The scores were all collected into a dictionary linking them to a String which identifies which model it was in a human-readable way, then further dictionaries were made linking the scores to the predicted values and test values, so it can be easily and programmatically identified and shown which model was best.\n",
"\n",
"Finally, an average was found for each algorthm using basic statistical analysis, and an average for each test and training data split ratio was also found.\n",
"\n",
"The best model, algorithm, and split ratio are then printed, and a confusion matrix is generated for the best model. With this, whether you want the best single model, the best algorithm, or the best testing training data split ratio, you can easily see it.\n",
"\n",
"The best algorithm indicated what would be ideal if you were unsure of the size of the data you're training with, as the less data you have the less of the data can be used for testing, as it's needed for training. The best split can be used if you're unsure of which algorithm you want to use, or if you intend to use an algorithm that wasn't tested here. The best overall model can be used if you don't intend to do further research, and would rather just use the results of this project."
],
],
"metadata": {
"metadata": {
"collapsed": false
"collapsed": false
...
@@ -604,7 +614,7 @@
...
@@ -604,7 +614,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 762,
"execution_count": 852,
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
...
@@ -726,8 +736,8 @@
...
@@ -726,8 +736,8 @@
"metadata": {
"metadata": {
"collapsed": false,
"collapsed": false,
"ExecuteTime": {
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.371133Z",
"end_time": "2023-05-26T11:53:07.506847Z",
"start_time": "2023-05-26T11:20:50.134150Z"
"start_time": "2023-05-26T11:53:07.222988Z"
}
}
}
}
},
},
...
@@ -735,7 +745,13 @@
...
@@ -735,7 +745,13 @@
"cell_type": "markdown",
"cell_type": "markdown",
"source": [
"source": [
"## Conclusion\n",
"## Conclusion\n",
"blah blah"
"As can be seen, the best model was the K Nearest Neighbour model, trained on a 25% test data split, so if you were to directly choose a model based off of this, that would be the one.\n",
"\n",
"The best algorithm was the Support Vector Classification algorithm, so if you wanted to ensure reliable results with a data set of indeterminate size, that would likely be your best bet.\n",
"\n",
"Finally, the best split ratio is rather expectedly the 25% Test and 75% training split; If you wished to simply get the best results for any algorithm, choosing that test training ratio would be best.\n",
"\n",
"So in summary, it appears that if you wanted to train a ML model to recognise written text, at least if that text is numeric, then using the K Nearest Neighbour algorithm, and training it with a 25% test and 75% training ration, would be the best choice."