Understanding 21St Century Bordeaux Wines from Wine Reviews Through Natural Language Processing and Classifications
Total Page:16
File Type:pdf, Size:1020Kb
UNDERSTANDING 21ST CENTURY BORDEAUX WINES FROM WINE REVIEWS THROUGH NATURAL LANGUAGE PROCESSING AND CLASSIFICATIONS By Zeqing Dong A thesis presented to the Department of Computer Science and the Graduate School of the University of Central Arkansas in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Conway, Arkansas May 2020 ProQuest Number:27834599 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. ProQuest 27834599 Published by ProQuest LLC (2020). Copyright of the Dissertation is held by the Author. All Rights Reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 TO THE OFFICE OF GRADUATE STUDIES: The members of the Committee approve the thesis of ______________________________________ presented on ___________________________________________________________. _______________________________________ Committee Chairperson ______________________________________ Committee Member ______________________________________ Committee Member ______________________________________ Committee Member ______________________________________ Committee Member PERMISSION Title Understanding 21st Century Bordeaux Wines from Wine Reviews through Natural Language Processing and Classification Department Computer Science Degree Master of Science In presenting this thesis/dissertation in partial fulfillment of the requirements for a graduate degree from the University of Central Arkansas, I agree that the Library of this University shall make it freely available for inspections. I further agree that permission for extensive copying for scholarly purposes may be granted by the professor who supervised my thesis/dissertation work, or, in the professor’s absence, by the Chair of the Department or the Dean of the Graduate School. It is understood that due recognition shall be given to me and to the University of Central Arkansas in any scholarly use which may be made of any material in my thesis/dissertation. Zeqing Dong April 03, 2020 ABSTRACT Wine has been popular with the public for centuries; in the market, there are a variety of wines to choose from. Among all, Bordeaux, France, is considered as the most famous wine region in the world. In this research, we try to understand Bordeaux wines made in the 21st century through Wineinformatics study. We developed and studied two datasets: the first dataset is all the Bordeaux wine in 21st century (from 2000 to 2016); and the second one is all wines listed in a famous collection of Bordeaux wines, 1855 Bordeaux Wine Official Classification, in 21st century (from 2000 to 2016). A total of 14,349 wine reviews are collected in the first dataset, and 1359 wine reviews in the second dataset. In order to understand the relation between wine quality and characteristics, Naïve Bayes classifier, a white box classification classifier, is applied to predict the qualities (90+/89−) of wines; Support Vector Machine (SVM), a black box classification classifier, is also applied as a comparison. In the first dataset, SVM classifier achieves the best accuracy of 86.97%; in the second dataset, Naïve Bayes classifier achieves the best accuracy of 84.62%. Precision, recall, and f-score are also used as our measures to describe the performance of our models. Meaningful characteristics associate with high quality 21st century Bordeaux wines are able to be presented through this research. Also, a novel voting system based on advanced NLP is designed in the research to further improve the model performance. iv TABLE OF CONTENTS ABSTRACT ...................................................................................................................... iv LIST OF TABLES ......................................................................................................... viii LIST OF FIGURES .......................................................................................................... x CHAPTER 1 INTRODUCTION ..................................................................................... 1 1.1 Data Science................................................................................................................ 1 1.2 Wineinformatics .......................................................................................................... 3 1.3 Bordeaux Wine ........................................................................................................... 4 1.4 Goal of the Research ................................................................................................... 6 CHAPTER 2 WINE DATA.............................................................................................. 8 2.1 Data Source: Wine Spectator ...................................................................................... 8 2.2 Dataset Creation and Cleaning.................................................................................. 10 2.3 The Computational Wine Wheel .............................................................................. 14 2.4 Datasets ..................................................................................................................... 16 2.4.1 All Bordeaux Wine Dataset ........................................................................... 17 2.4.2 Bordeaux Wine Official Classification in 1855 ............................................. 18 CHAPTER 3 PREDICTION METHODOLOGY AND EVALUATIONS ............... 20 3.1 Classification Algorithms ......................................................................................... 20 3.1.1 Naïve Bayes Classifier ................................................................................... 21 3.1.1.1 Gaussian Naive Bayes classifier ....................................................... 22 v 3.1.1.2 Bernoulli Naive Bayes Classifier ..................................................... 22 3.1.2 Support Vector Machines .............................................................................. 23 3.2 Cross-validation ........................................................................................................ 25 3.3 Evaluation Metrics .................................................................................................... 26 CHAPTER 4 UNDERSTANDING 21ST CENTURY BORDEAUX WINES ............ 28 4.1 ALL Bordeaux Wine ................................................................................................. 28 4.2 Bordeaux Wine Official Classification in 1855 ........................................................ 29 4.3 Visualization of Bordeaux Wine Official Classification in 1855 ............................. 30 4.4 Top 20 keywords....................................................................................................... 32 CHAPTER 5 NATURAL LANGUAGE PROCESSING ............................................ 36 5.1 Data Preprocessing.................................................................................................... 36 5.2 Voting System .......................................................................................................... 40 5.3 Classifications and Results ...................................................................................... 42 5.3.1 ALL Bordeaux Wine..................................................................................... 43 5.3.2 Bordeaux Wine Official Classification in 1855 ............................................ 45 CHAPTER 6 CONCLUSION AND FUTURE WORK ............................................... 48 6.1 Conclusion ................................................................................................................ 48 6.1 Future Work .............................................................................................................. 48 REFERENCES ................................................................................................................ 51 APPENDIX A. THE 1855 CLASSIFICATION, REVISED IN 1973 ......................... 57 vi APPENDIX A.1. RED WINES ............................................................................ 57 APPENDIX A.2. WHITE WINES ...................................................................... 60 APPENDIX B. THE LIST OF WINE AND VINTAGES WE CANNOT FIND ....... 62 vii LIST OF TABLES Table 1. Wine Spectator's 100-Point Scale. …………………………………………….. 10 Table 2. Simplified computational wine wheel. ………………………………………... 15 Table 3. Accuracy, precision, recall, and f-score in different classifiers. ………………. 29 Table 4. Accuracy, precision, recall, and f-score in different classifiers. ………………. 30 Table 5. Common keywords between 90+ and 89− wines from ALL Bordeaux Wine dataset. …………………………………………………………………………………...33 Table 6. Distinct keywords between 90+ and 89− wines from ALL Bordeaux Wine dataset in 90+ wines. ……………………………………………………………………. 33 Table 7. Distinct keywords between 90+ and 89− wines from ALL Bordeaux Wine dataset in 89− wines. ……………………………………………………………………. 33 Table 8. Common keywords between 90+ and 89− wines from 1855 Bordeaux Wine Official classification dataset. …………………………………………………………... 34 Table 9. Distinct keywords between 90+ and 89− wines from