Commit d71b418f authored by Jonathan Poalses's avatar Jonathan Poalses

Added three more algorithms and wrote sections 2, 3, 4, and 5.

parent c7de502c
......@@ -16,7 +16,7 @@
"## 1: Overview\n",
"Being able to recognise the written word through the use of **Artificial Intelligence** (AI) is incredibly beneficial, as it allows menial jobs that brought nothing to peoples lives to be taken by AI instead, freeing those people to pursue something more meaningful. To be able to recognise handwriting specific to a single person is a difficult task, and to be able to recognise handwriting written by anyone is even more so, as such one should know that the algorithm and settings used to train the **Machine Learning** (ML) model are the best for the task.\n",
"\n",
"To solve this problem, three different ML classifier algorithms will be tested, each with 3 different splits of the data used for training and testing, resulting in nine different models. The results of this training can then be compared and contrasted, evaluating the whole process to not only determine which is the best single model, but also the best algorithm, and the best split of training and testing data."
"To solve this problem, six different ML classifier algorithms will be tested, each with 3 different splits of the data used for training and testing, resulting in eighteen different models. The results of this training can then be compared and contrasted, evaluating the whole process to not only determine which is the best single model, but also the best algorithm, and the best split of training and testing data."
],
"metadata": {
"collapsed": false
......@@ -24,7 +24,7 @@
},
{
"cell_type": "code",
"execution_count": 585,
"execution_count": 748,
"outputs": [],
"source": [
"# Importing pyplot so we can visualize things\n",
......@@ -46,6 +46,9 @@
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.svm import SVC\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# Create the classifiers, 3 of each type to determine which is best\n",
"gnb = GaussianNB()\n",
......@@ -58,13 +61,25 @@
"\n",
"svc = SVC()\n",
"svc2 = SVC()\n",
"svc3 = SVC()"
"svc3 = SVC()\n",
"\n",
"dtc = DecisionTreeClassifier()\n",
"dtc2 = DecisionTreeClassifier()\n",
"dtc3 = DecisionTreeClassifier()\n",
"\n",
"rfc = RandomForestClassifier()\n",
"rfc2 = RandomForestClassifier()\n",
"rfc3 = RandomForestClassifier()\n",
"\n",
"lr = LogisticRegression(max_iter=20000)\n",
"lr2 = LogisticRegression(max_iter=20000)\n",
"lr3 = LogisticRegression(max_iter=20000)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.126180Z",
"start_time": "2023-05-26T10:45:04.046288Z"
"end_time": "2023-05-26T11:20:46.485Z",
"start_time": "2023-05-26T11:20:46.377856Z"
}
}
},
......@@ -72,7 +87,7 @@
"cell_type": "markdown",
"source": [
"## 2: Dataset selection\n",
"blah blah"
"To determine which algorithm is best for the task of recognising the written word, it's necessary to test a dataset that creates an equivalent challenge, while not being so large and complete that it would simply be performing the task itself. As such, the digits dataset was chosen, as it is both readily available, and therefore can be easily tested with other algorithms or settings if further investigation is desired, while also being an approximate representation of recognising the written word, albeit limited to numbers."
],
"metadata": {
"collapsed": false
......@@ -80,13 +95,13 @@
},
{
"cell_type": "code",
"execution_count": 586,
"execution_count": 749,
"outputs": [
{
"data": {
"text/plain": "array([0, 1, 2, ..., 8, 9, 8])"
},
"execution_count": 586,
"execution_count": 749,
"metadata": {},
"output_type": "execute_result"
}
......@@ -101,8 +116,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.144643Z",
"start_time": "2023-05-26T10:45:04.052117Z"
"end_time": "2023-05-26T11:20:46.486348Z",
"start_time": "2023-05-26T11:20:46.392825Z"
}
}
},
......@@ -117,13 +132,13 @@
},
{
"cell_type": "code",
"execution_count": 587,
"execution_count": 750,
"outputs": [
{
"data": {
"text/plain": "array([[[ 0., 0., 5., ..., 1., 0., 0.],\n [ 0., 0., 13., ..., 15., 5., 0.],\n [ 0., 3., 15., ..., 11., 8., 0.],\n ...,\n [ 0., 4., 11., ..., 12., 7., 0.],\n [ 0., 2., 14., ..., 12., 0., 0.],\n [ 0., 0., 6., ..., 0., 0., 0.]],\n\n [[ 0., 0., 0., ..., 5., 0., 0.],\n [ 0., 0., 0., ..., 9., 0., 0.],\n [ 0., 0., 3., ..., 6., 0., 0.],\n ...,\n [ 0., 0., 1., ..., 6., 0., 0.],\n [ 0., 0., 1., ..., 6., 0., 0.],\n [ 0., 0., 0., ..., 10., 0., 0.]],\n\n [[ 0., 0., 0., ..., 12., 0., 0.],\n [ 0., 0., 3., ..., 14., 0., 0.],\n [ 0., 0., 8., ..., 16., 0., 0.],\n ...,\n [ 0., 9., 16., ..., 0., 0., 0.],\n [ 0., 3., 13., ..., 11., 5., 0.],\n [ 0., 0., 0., ..., 16., 9., 0.]],\n\n ...,\n\n [[ 0., 0., 1., ..., 1., 0., 0.],\n [ 0., 0., 13., ..., 2., 1., 0.],\n [ 0., 0., 16., ..., 16., 5., 0.],\n ...,\n [ 0., 0., 16., ..., 15., 0., 0.],\n [ 0., 0., 15., ..., 16., 0., 0.],\n [ 0., 0., 2., ..., 6., 0., 0.]],\n\n [[ 0., 0., 2., ..., 0., 0., 0.],\n [ 0., 0., 14., ..., 15., 1., 0.],\n [ 0., 4., 16., ..., 16., 7., 0.],\n ...,\n [ 0., 0., 0., ..., 16., 2., 0.],\n [ 0., 0., 4., ..., 16., 2., 0.],\n [ 0., 0., 5., ..., 12., 0., 0.]],\n\n [[ 0., 0., 10., ..., 1., 0., 0.],\n [ 0., 2., 16., ..., 1., 0., 0.],\n [ 0., 0., 15., ..., 15., 0., 0.],\n ...,\n [ 0., 4., 16., ..., 16., 6., 0.],\n [ 0., 8., 16., ..., 16., 8., 0.],\n [ 0., 1., 8., ..., 12., 1., 0.]]])"
},
"execution_count": 587,
"execution_count": 750,
"metadata": {},
"output_type": "execute_result"
}
......@@ -135,15 +150,15 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.145084Z",
"start_time": "2023-05-26T10:45:04.066397Z"
"end_time": "2023-05-26T11:20:46.486460Z",
"start_time": "2023-05-26T11:20:46.411001Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"As can be seen, the features contain arrays of integers ranging from 0 through 16, representing the pixels in the images."
"As can be seen, the features contain 8x8 2D arrays of integers ranging from 0 through 16, representing the pixels in the images."
],
"metadata": {
"collapsed": false
......@@ -153,7 +168,7 @@
"cell_type": "markdown",
"source": [
"## 3: Exploring the Data\n",
"blah blah"
"Now we should look through the data, confirm it contains what we expect, transform it if needed such that it can be used, and test to confirm it's viable."
],
"metadata": {
"collapsed": false
......@@ -161,7 +176,7 @@
},
{
"cell_type": "code",
"execution_count": 588,
"execution_count": 751,
"outputs": [
{
"data": {
......@@ -188,8 +203,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.283331Z",
"start_time": "2023-05-26T10:45:04.075539Z"
"end_time": "2023-05-26T11:20:46.640352Z",
"start_time": "2023-05-26T11:20:46.430010Z"
}
}
},
......@@ -204,13 +219,13 @@
},
{
"cell_type": "code",
"execution_count": 589,
"execution_count": 752,
"outputs": [
{
"data": {
"text/plain": "array([[ 0., 0., 5., ..., 0., 0., 0.],\n [ 0., 0., 0., ..., 10., 0., 0.],\n [ 0., 0., 0., ..., 16., 9., 0.],\n ...,\n [ 0., 0., 1., ..., 6., 0., 0.],\n [ 0., 0., 2., ..., 12., 0., 0.],\n [ 0., 0., 10., ..., 12., 1., 0.]])"
},
"execution_count": 589,
"execution_count": 752,
"metadata": {},
"output_type": "execute_result"
}
......@@ -223,20 +238,20 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.288267Z",
"start_time": "2023-05-26T10:45:04.285586Z"
"end_time": "2023-05-26T11:20:46.645409Z",
"start_time": "2023-05-26T11:20:46.640033Z"
}
}
},
{
"cell_type": "code",
"execution_count": 590,
"execution_count": 753,
"outputs": [
{
"data": {
"text/plain": "True"
},
"execution_count": 590,
"execution_count": 753,
"metadata": {},
"output_type": "execute_result"
}
......@@ -248,8 +263,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.293597Z",
"start_time": "2023-05-26T10:45:04.290103Z"
"end_time": "2023-05-26T11:20:46.650714Z",
"start_time": "2023-05-26T11:20:46.646514Z"
}
}
},
......@@ -266,7 +281,9 @@
"cell_type": "markdown",
"source": [
"## 4: Model Explanation\n",
"blah blah"
"Six different algorithms will be used, with three different distributions of testing and training data from the dataset, resulting in 18 different models to be evaluated.\n",
"\n",
"The algorithms will be the **Gaussian Naive Bayes** (GNB), **K Nearest Neighbour** (KNN), **Support Vector Classification** (SVC) algorithms, **Decision Tree Classifier** (DTC), **Random Forest Classifier** (RFC) and **Linear Regression** (LR). This is a collection of the six top machine learning algorithms for classification, and as such testing all of them we should be able to determine which should be the best for the problem of recognising written text."
],
"metadata": {
"collapsed": false
......@@ -276,7 +293,8 @@
"cell_type": "markdown",
"source": [
"## 5: Training and Testing\n",
"blah blah"
"For training and testing these algorithms, there were three variations of each algorithm, each using a different distribution of training and testing data. One has a 25% test 75% training split, which will likely result in the best test results, and could potentially result in overfitting, the next has a 50/50 split, likely resulting in a good balance, and the last has a 75% test 25% training split, which could potentially result in underfitting.\n",
"Each algorithm was tested using all three training and testing sets, creating three different models per algorithm, resulting in 18 models total."
],
"metadata": {
"collapsed": false
......@@ -284,7 +302,7 @@
},
{
"cell_type": "code",
"execution_count": 591,
"execution_count": 754,
"outputs": [],
"source": [
"# We'll start by splitting the data into training and testing, going with a 75% train, 25% test split, a 50/50 split, and a 25% train 75% test split.\n",
......@@ -295,8 +313,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.299839Z",
"start_time": "2023-05-26T10:45:04.296199Z"
"end_time": "2023-05-26T11:20:46.697098Z",
"start_time": "2023-05-26T11:20:46.654924Z"
}
}
},
......@@ -311,7 +329,7 @@
},
{
"cell_type": "code",
"execution_count": 592,
"execution_count": 755,
"outputs": [],
"source": [
"# First the Gaussian Bayes\n",
......@@ -322,18 +340,30 @@
"knc.fit(X_train, y_train)\n",
"knc2.fit(X_train2, y_train2)\n",
"knc3.fit(X_train3, y_train3)\n",
"# And finally the support vector classifier\n",
"# And then the support vector classifier\n",
"svc.fit(X_train, y_train)\n",
"svc2.fit(X_train2, y_train2)\n",
"svc3.fit(X_train3, y_train3)\n",
"# And then the decision tree classifier\n",
"dtc.fit(X_train, y_train)\n",
"dtc2.fit(X_train2, y_train2)\n",
"dtc3.fit(X_train3, y_train3)\n",
"# And then the random forest classifier\n",
"rfc.fit(X_train, y_train)\n",
"rfc2.fit(X_train2, y_train2)\n",
"rfc3.fit(X_train3, y_train3)\n",
"# And finally the logistic regression\n",
"lr.fit(X_train, y_train)\n",
"lr2.fit(X_train2, y_train2)\n",
"lr3.fit(X_train3, y_train3)\n",
"# Nonsense code to prevent SVC object being output below\n",
"nothing = 1"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.363452Z",
"start_time": "2023-05-26T10:45:04.301313Z"
"end_time": "2023-05-26T11:20:49.728828Z",
"start_time": "2023-05-26T11:20:46.663509Z"
}
}
},
......@@ -348,7 +378,7 @@
},
{
"cell_type": "code",
"execution_count": 593,
"execution_count": 756,
"outputs": [
{
"name": "stdout",
......@@ -377,14 +407,14 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.372713Z",
"start_time": "2023-05-26T10:45:04.364786Z"
"end_time": "2023-05-26T11:20:49.741728Z",
"start_time": "2023-05-26T11:20:49.730504Z"
}
}
},
{
"cell_type": "code",
"execution_count": 594,
"execution_count": 757,
"outputs": [
{
"name": "stdout",
......@@ -413,14 +443,14 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.443910Z",
"start_time": "2023-05-26T10:45:04.374509Z"
"end_time": "2023-05-26T11:20:49.826269Z",
"start_time": "2023-05-26T11:20:49.744184Z"
}
}
},
{
"cell_type": "code",
"execution_count": 595,
"execution_count": 758,
"outputs": [
{
"name": "stdout",
......@@ -433,7 +463,7 @@
}
],
"source": [
"# Finally the Support Vector Classification\n",
"# Now the Support Vector Classification\n",
"svc_predicted = svc.predict(X_test)\n",
"svc2_predicted = svc2.predict(X_test2)\n",
"svc3_predicted = svc3.predict(X_test3)\n",
......@@ -449,8 +479,116 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.681772Z",
"start_time": "2023-05-26T10:45:04.446851Z"
"end_time": "2023-05-26T11:20:50.068746Z",
"start_time": "2023-05-26T11:20:49.828344Z"
}
}
},
{
"cell_type": "code",
"execution_count": 759,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8666666666666667\n",
"0.8353726362625139\n",
"0.7781899109792285\n"
]
}
],
"source": [
"# Now the decision tree classifier\n",
"dtc_predicted = dtc.predict(X_test)\n",
"dtc2_predicted = dtc2.predict(X_test2)\n",
"dtc3_predicted = dtc3.predict(X_test3)\n",
"\n",
"# And let's see how they did\n",
"dtc_score = metrics.accuracy_score(y_test, dtc_predicted)\n",
"dtc2_score = metrics.accuracy_score(y_test2, dtc2_predicted)\n",
"dtc3_score = metrics.accuracy_score(y_test3, dtc3_predicted)\n",
"print(dtc_score)\n",
"print(dtc2_score)\n",
"print(dtc3_score)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.084031Z",
"start_time": "2023-05-26T11:20:50.066978Z"
}
}
},
{
"cell_type": "code",
"execution_count": 760,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.98\n",
"0.9632925472747497\n",
"0.9458456973293768\n"
]
}
],
"source": [
"# Now the random forest classifier\n",
"rfc_predicted = rfc.predict(X_test)\n",
"rfc2_predicted = rfc2.predict(X_test2)\n",
"rfc3_predicted = rfc3.predict(X_test3)\n",
"\n",
"# And let's see how they did\n",
"rfc_score = metrics.accuracy_score(y_test, rfc_predicted)\n",
"rfc2_score = metrics.accuracy_score(y_test2, rfc2_predicted)\n",
"rfc3_score = metrics.accuracy_score(y_test3, rfc3_predicted)\n",
"print(rfc_score)\n",
"print(rfc2_score)\n",
"print(rfc3_score)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.118233Z",
"start_time": "2023-05-26T11:20:50.083636Z"
}
}
},
{
"cell_type": "code",
"execution_count": 761,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.98\n",
"0.9599555061179088\n",
"0.9525222551928784\n"
]
}
],
"source": [
"# Finally the logistic regression\n",
"lr_predicted = lr.predict(X_test)\n",
"lr2_predicted = lr2.predict(X_test2)\n",
"lr3_predicted = lr3.predict(X_test3)\n",
"\n",
"# And let's see how they did\n",
"lr_score = metrics.accuracy_score(y_test, lr_predicted)\n",
"lr2_score = metrics.accuracy_score(y_test2, lr2_predicted)\n",
"lr3_score = metrics.accuracy_score(y_test3, lr3_predicted)\n",
"print(lr_score)\n",
"print(lr2_score)\n",
"print(lr3_score)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T11:20:50.124837Z",
"start_time": "2023-05-26T11:20:50.120256Z"
}
}
},
......@@ -466,7 +604,7 @@
},
{
"cell_type": "code",
"execution_count": 596,
"execution_count": 762,
"outputs": [
{
"name": "stdout",
......@@ -496,7 +634,16 @@
" knc3_score : \"K Nearest Neighbour 3rd Model\",\n",
" svc_score : \"Support Vector Classification 1st Model\",\n",
" svc2_score : \"Support Vector Classification 2nd Model\",\n",
" svc3_score : \"Support Vector Classification 3rd Model\"\n",
" svc3_score : \"Support Vector Classification 3rd Model\",\n",
" dtc_score : \"Decision Tree Classifier 1st Model\",\n",
" dtc2_score : \"Decision Tree Classifier 2nd Model\",\n",
" dtc3_score : \"Decision Tree Classifier 3rd Model\",\n",
" rfc_score : \"Random Forest Classifier 1st Model\",\n",
" rfc2_score : \"Random Forest Classifier 2nd Model\",\n",
" rfc3_score : \"Random Forest Classifier 3rd Model\",\n",
" lr_score : \"Linear Regression 1st Model\",\n",
" lr2_score : \"Linear Regression 2nd Model\",\n",
" lr3_score : \"Linear Regression 3rd Model\"\n",
" }\n",
"# Prepare a dictionary to get the predicted values\n",
"prediction_dictionary = {gnb_score : gnb_predicted,\n",
......@@ -507,7 +654,16 @@
" knc3_score : knc3_predicted,\n",
" svc_score : svc_predicted,\n",
" svc2_score : svc2_predicted,\n",
" svc3_score : svc3_predicted\n",
" svc3_score : svc3_predicted,\n",
" dtc_score : dtc_predicted,\n",
" dtc2_score : dtc2_predicted,\n",
" dtc3_score : dtc3_predicted,\n",
" rfc_score : rfc_predicted,\n",
" rfc2_score : rfc2_predicted,\n",
" rfc3_score : rfc3_predicted,\n",
" lr_score : lr_predicted,\n",
" lr2_score : lr2_predicted,\n",
" lr3_score : lr3_predicted\n",
" }\n",
"# And finally a dictionary for the test values\n",
"test_dictionary = {gnb_score : y_test,\n",
......@@ -518,19 +674,34 @@
" knc3_score : y_test3,\n",
" svc_score : y_test,\n",
" svc2_score : y_test2,\n",
" svc3_score : y_test3\n",
" svc3_score : y_test3,\n",
" dtc_score : y_test,\n",
" dtc2_score : y_test2,\n",
" dtc3_score : y_test3,\n",
" rfc_score : y_test,\n",
" rfc2_score : y_test2,\n",
" rfc3_score : y_test3,\n",
" lr_score : y_test,\n",
" lr2_score : y_test2,\n",
" lr3_score : y_test3\n",
" }\n",
"# Get the average scores and put those in a dictionary\n",
"gnb_average = statistics.fmean((gnb_score, gnb2_score, gnb3_score))\n",
"knc_average = statistics.fmean((knc_score, knc2_score, knc3_score))\n",
"svc_average = statistics.fmean((svc_score, svc2_score, svc3_score))\n",
"dtc_average = statistics.fmean((dtc_score, dtc2_score, dtc3_score))\n",
"rfc_average = statistics.fmean((rfc_score, rfc2_score, rfc3_score))\n",
"lr_average = statistics.fmean((lr_score, lr2_score, lr3_score))\n",
"average_dictionary = {gnb_average : \"Gaussian Naive Bayes Algorithm\",\n",
" knc_average : \"K Nearest Neighbour Algorithm\",\n",
" svc_average : \"Support Vector Classification Algorithm\"}\n",
" svc_average : \"Support Vector Classification Algorithm\",\n",
" dtc_average : \"Decision Tree Classifier Algorithm\",\n",
" rfc_average : \"Random Forest Classifier Algorithm\",\n",
" lr_average : \"Linear Regression Algorithm\",}\n",
"# And get the average scores for each of the different train_test_split settings used\n",
"first_settings_average = statistics.fmean((gnb_score, knc_score, svc_score))\n",
"second_settings_average = statistics.fmean((gnb2_score, knc2_score, svc2_score))\n",
"third_settings_average = statistics.fmean((gnb3_score, knc3_score, svc3_score))\n",
"first_settings_average = statistics.fmean((gnb_score, knc_score, svc_score, dtc_score, rfc_score, lr_score))\n",
"second_settings_average = statistics.fmean((gnb2_score, knc2_score, svc2_score, dtc2_score, rfc2_score, lr2_score))\n",
"third_settings_average = statistics.fmean((gnb3_score, knc3_score, svc3_score, dtc3_score, rfc3_score, lr3_score))\n",
"average_settings_dictionary = {first_settings_average : \"25% Test Split\",\n",
" second_settings_average : \"50% Test Split\",\n",
" third_settings_average : \"75% Test Split\"}\n",
......@@ -555,8 +726,8 @@
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-05-26T10:45:04.947827Z",
"start_time": "2023-05-26T10:45:04.681929Z"
"end_time": "2023-05-26T11:20:50.371133Z",
"start_time": "2023-05-26T11:20:50.134150Z"
}
}
},
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment