"Being able to recognise the written word through the use of **Artificial Intelligence** (AI) is incredibly beneficial, as it allows menial jobs that brought nothing to peoples lives to be taken by AI instead, freeing those people to pursue something more meaningful. To be able to recognise handwriting specific to a single person is a difficult task, and to be able to recognise handwriting written by anyone is even more so, as such one should know that the algorithm and settings used to train the **Machine Learning** (ML) model are the best for the task.\n",
"\n",
"To solve this problem, three different ML classifier algorithms will be tested, each with 3 different splits of the data used for training and testing, resulting in nine different models. The results of this training can then be compared and contrasted, evaluating the whole process to not only determine which is the best single model, but also the best algorithm, and the best split of training and testing data."
"To solve this problem, six different ML classifier algorithms will be tested, each with 3 different splits of the data used for training and testing, resulting in eighteen different models. The results of this training can then be compared and contrasted, evaluating the whole process to not only determine which is the best single model, but also the best algorithm, and the best split of training and testing data."
],
"metadata": {
"collapsed": false
...
...
@@ -24,7 +24,7 @@
},
{
"cell_type": "code",
"execution_count": 585,
"execution_count": 748,
"outputs": [],
"source": [
"# Importing pyplot so we can visualize things\n",
"# Create the classifiers, 3 of each type to determine which is best\n",
"gnb = GaussianNB()\n",
...
...
@@ -58,13 +61,25 @@
"\n",
"svc = SVC()\n",
"svc2 = SVC()\n",
"svc3 = SVC()"
"svc3 = SVC()\n",
"\n",
"dtc = DecisionTreeClassifier()\n",
"dtc2 = DecisionTreeClassifier()\n",
"dtc3 = DecisionTreeClassifier()\n",
"\n",
"rfc = RandomForestClassifier()\n",
"rfc2 = RandomForestClassifier()\n",
"rfc3 = RandomForestClassifier()\n",
"\n",
"lr = LogisticRegression(max_iter=20000)\n",
"lr2 = LogisticRegression(max_iter=20000)\n",
"lr3 = LogisticRegression(max_iter=20000)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.126180Z",
"start_time": "2023-05-26T10:45:04.046288Z"
"end_time": "2023-05-26T11:20:46.485Z",
"start_time": "2023-05-26T11:20:46.377856Z"
}
}
},
...
...
@@ -72,7 +87,7 @@
"cell_type": "markdown",
"source": [
"## 2: Dataset selection\n",
"blah blah"
"To determine which algorithm is best for the task of recognising the written word, it's necessary to test a dataset that creates an equivalent challenge, while not being so large and complete that it would simply be performing the task itself. As such, the digits dataset was chosen, as it is both readily available, and therefore can be easily tested with other algorithms or settings if further investigation is desired, while also being an approximate representation of recognising the written word, albeit limited to numbers."
"As can be seen, the features contain arrays of integers ranging from 0 through 16, representing the pixels in the images."
"As can be seen, the features contain 8x8 2D arrays of integers ranging from 0 through 16, representing the pixels in the images."
],
"metadata": {
"collapsed": false
...
...
@@ -153,7 +168,7 @@
"cell_type": "markdown",
"source": [
"## 3: Exploring the Data\n",
"blah blah"
"Now we should look through the data, confirm it contains what we expect, transform it if needed such that it can be used, and test to confirm it's viable."
"Six different algorithms will be used, with three different distributions of testing and training data from the dataset, resulting in 18 different models to be evaluated.\n",
"\n",
"The algorithms will be the **Gaussian Naive Bayes** (GNB), **K Nearest Neighbour** (KNN), **Support Vector Classification** (SVC) algorithms, **Decision Tree Classifier** (DTC), **Random Forest Classifier** (RFC) and **Linear Regression** (LR). This is a collection of the six top machine learning algorithms for classification, and as such testing all of them we should be able to determine which should be the best for the problem of recognising written text."
],
"metadata": {
"collapsed": false
...
...
@@ -276,7 +293,8 @@
"cell_type": "markdown",
"source": [
"## 5: Training and Testing\n",
"blah blah"
"For training and testing these algorithms, there were three variations of each algorithm, each using a different distribution of training and testing data. One has a 25% test 75% training split, which will likely result in the best test results, and could potentially result in overfitting, the next has a 50/50 split, likely resulting in a good balance, and the last has a 75% test 25% training split, which could potentially result in underfitting.\n",
"Each algorithm was tested using all three training and testing sets, creating three different models per algorithm, resulting in 18 models total."
],
"metadata": {
"collapsed": false
...
...
@@ -284,7 +302,7 @@
},
{
"cell_type": "code",
"execution_count": 591,
"execution_count": 754,
"outputs": [],
"source": [
"# We'll start by splitting the data into training and testing, going with a 75% train, 25% test split, a 50/50 split, and a 25% train 75% test split.\n",
...
...
@@ -295,8 +313,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.299839Z",
"start_time": "2023-05-26T10:45:04.296199Z"
"end_time": "2023-05-26T11:20:46.697098Z",
"start_time": "2023-05-26T11:20:46.654924Z"
}
}
},
...
...
@@ -311,7 +329,7 @@
},
{
"cell_type": "code",
"execution_count": 592,
"execution_count": 755,
"outputs": [],
"source": [
"# First the Gaussian Bayes\n",
...
...
@@ -322,18 +340,30 @@
"knc.fit(X_train, y_train)\n",
"knc2.fit(X_train2, y_train2)\n",
"knc3.fit(X_train3, y_train3)\n",
"# And finally the support vector classifier\n",
"# And then the support vector classifier\n",
"svc.fit(X_train, y_train)\n",
"svc2.fit(X_train2, y_train2)\n",
"svc3.fit(X_train3, y_train3)\n",
"# And then the decision tree classifier\n",
"dtc.fit(X_train, y_train)\n",
"dtc2.fit(X_train2, y_train2)\n",
"dtc3.fit(X_train3, y_train3)\n",
"# And then the random forest classifier\n",
"rfc.fit(X_train, y_train)\n",
"rfc2.fit(X_train2, y_train2)\n",
"rfc3.fit(X_train3, y_train3)\n",
"# And finally the logistic regression\n",
"lr.fit(X_train, y_train)\n",
"lr2.fit(X_train2, y_train2)\n",
"lr3.fit(X_train3, y_train3)\n",
"# Nonsense code to prevent SVC object being output below\n",