MSIS 2534/2634: NATURAL LANGUAGE PROCESSING (NLP) Syllabus, Fall 2020 Professor S​ anjiv Das​ (​http://srdas.github.io)​ 2 unit course ​ (10 classes, 10 weeks, 16 hours, ~1 hr 35 mins/class)

OVERVIEW

Natural Language Processing (NLP) has recently found several applications in business. There is now a foundation of content that students who wish to work in this field need to know and this course is aimed at providing students with a conceptual understanding of the field and its business applications, and technical toolkit to implement NLP models. The course will include using ​AWS SageMaker,​ a cloud platform for . Free credits are made available on A​ WS Educate.​

LEARNING OBJECTIVES

1. Explain the business areas in which NLP may be applied. 2. Describe the important concepts and mathematical models for NLP. 3. Implement programming languages and toolkits on NLP models for business applications. 4. Learn how to build and deploy NLP models on cloud infrastructure.

PREREQUISITES

Students will need an understanding of Linear Algebra (familiarity with matrix multiplication, inverses, determinants, eigensystems, etc.) plus some calculus-based probability, and multivariate calculus (e.g., FNCE 2502, OMIS 3200). Students should also be able to program in Python and R (see related MSBA/MSIS courses). These requirements may be waived with the permission of the instructor if the student has taken this coursework in another program.

BOOKS/VIDEOS

● Dive into (h​ ttp://d2l.ai/)​ , NLP chapters: ○ https://d2l.ai/chapter_natural-language-processing-pretraining/index.html, https://d2l.ai/chapter_natural-language-processing-applications/index.html ● NLTK: h​ ttp://www.nltk.org/book/ ● My book on Text Analytics for Finance: http://srdas.github.io/Papers/Das_TextAnalyticsInFinance.pdf ● ML Class Notes: ​https://srdas.github.io/MLBook2/ ● AWS Accelerated ​NLP course on YouTube

EVALUATION

Group HW and research paper presentation (40%); mid-term (20%); project (40%). HW and the project is in groups, the mid-term is individual. Paper presentation? Every class, one group will present a summary of a seminal research paper in the last 10 minutes of class (including Q&A discussion).

COURSE VIDEO https://drive.google.com/file/d/1tAjxQ1IFGqhcGL4Rs89mPAJ3MzHxbDKh/view

DATE TOPIC (MM/DD) DATA​ for all classes; I​ MAGES ​ for all classes [Keep these links for all classes]

Class 1 Introduction to Text Analytics and Basic string handling; regular expressions; N​ LP_Tools​; R​ egex​, Ch7&8 in A​ TBSWP​; Week of 01-IntroTextAnalytics.ipynb​; Introduction to A​ WS SageMaker;​ I​ nstructions​; G​ et Started Slides​; SageMaker EDU ​info slides​; 09/21 SageMaker workshop​; S​ ageMaker examples;​ more broadly, see A​ WS Tutorials​; AWS u​ sage information​. HW: Answer all questions (including practice projects) at the end of ​Ch 7 (in a NB)​ in A​ TBSWP

Class 2 Basic Linear Algebra for text analytics; Using N​ LTK​: Ch0, Ch1, Corpora; entity extraction and identification with ​AWS Week of Comprehend.​ NLTK Ch2; 0​ 2-EntityExtraction_moreTextHandling.ipynb​; 0​ 21-LinearAlgebra_Gradients_Optimization.ipynb​; 09/28 022-LinearAlgebra_Eigensystems_Decompositions.ipynb​; HW: create a Jupyter notebook with answers to Ch01, sec 8. How to ​Create an AWS IAM User;​ U​ sing SageMaker with IAM for Comprehend,​ 0​ 2_Comprehend.ipynb

Class 3 Reading in URLs and Beautiful Soup, Dictionaries, Lexicons, Negation Tagging, Scoring using a dictionary. Ch3; Using Twitter, Week of Sentiment Analysis with NLTK Vader, using Selector Gadget, using "rvest", A​ WS Textract;​ 10/05 03-WebScraping_Dictionaries_Sentiment.ipynb​; 0​ 3_R_Code.ipynb

Class 4 Text transformations: punctuation, numbers, stemming, stopwords, making a corpus, term-document matrix, Classification Week of with (i) GLMNet (t​ ext2vec​) and (ii) F​ astText ​from Facebook (​GluonNLP​); using S​ paCy​; 10/12 04-TextTransformations_Classification.ipynb;​ 0​ 4_R_Code.ipynb​; T​ extract.ipynb;​ ​HW​ on Comprehend and Textract. R​ eview ML and AUC.

Class 5 Term-Document Matrix (TDM) applied: TF-IDF, WordClouds, Cosine distance, Readability, Text Summarization, using the Week of Reuters news corpus (C​ h1.4​). 0​ 5-TDM_Summarization.ipynb​; Discussion paper: “​Textual Analysis in Finance”​ (2020, 10/19 Loughran & McDonald); HW on SEC Filings. 0​ 5_R_Code.ipynb

Class 6 Topic modeling with Latent Dirichlet Allocation (LDA), text2vec for LDA, dimension reduction with LDA; week of Text_Cleaning_Functions.ipynb;​ 0​ 6-TopicModeling_Classification.ipynb​; ​06_R_Code.ipynb;​ T​ he Structure of Economic News;​ 10/26 Discussion paper: F​ OMC Transcripts Analysis​; Project proposal presentations

Class 7a MID-TERM EXAM (on-screen, virtually in class); separate from class. Date 11/2, 4:00-5:15 pm (in office hours).

Class 7b Word Embeddings (, bow, skip-gram, GloVe), t-SNE, Doc2Vec, document clustering for business applications; Week of 07-Embeddings_Clustering.ipynb;​ Discussion paper: t-SNE, v​ an der Maaten & Hinton (2008);​ semantic similarity for 11/02 text-based portfolio construction.

Class 8 Deep learning introduction (for using word embeddings with NLP). 1​ 1-DeepLearning_Introduction.ipynb​; Deep Learning Week of course n​ otes​. 11/09

Class 9 Transformers and BERT; Text classification with deep learning; ​08-LanguageModels_BERT.ipynb​; S​ lides;​ T​ ransformers Week of paper​; B​ ERT paper​; R​ oBERTa paper​; H​ ugging Face;​ G​ luon BERT pre-training​; S​ AS Note on BERT 11/16

Class 10 Explainability of NLP classifiers. Shapley values. S​ HAP.​ Graph Theory; a brief introduction to Knowledge graphs, text Week of generation, text completion, translation. 11/30

Exam Week FINAL PROJECT PRESENTATIONS : Monday, December 7 (we can start at 4 pm and keep going till all student groups from 12/7 M and W are completed). Each presentation is 15 minutes, including 3 minutes of Q&A.

Course Resources

Resources that are useful for the finance side of NLP

● Text and Context: Language Analytics in Finance ​ (2014) ● Loughran & McDonald ​word lists ● Text as Data​ (Gentzkow, Kelly, Taddy, 2019) ● Loughran & McDonald: T​ extual Analysis in Finance​ (2020)

Some more focused papers

● Do Actions Speak Louder than Words? Evidence from Microblogs ● 10 Ways News Sentiment is Providing Value to Investors ● Media Sentiment and International Asset Prices ● Text Sentiment from Earnings Calls

Papers on 10-K, 10-Q

● https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1331573 ● https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2331544

Text Corpora Sources of (publicly available and free) labeled text corpora:

1. Reuters newswire in 1987​ indexed by category, aka Reuters-21578, contains 21,578 news articles, though only about 12 thousand are manually indexed across 135 categories; best for training classification algorithms. 2. Reuters Corpus Volume 1.​ T​okenized version of the Reuters RCV1 corpus by David D Lewis et al. ​ All Reuters ​ Corpora.​ RCV1 ​paper.​ 3. Scitkit Learn version of Reuters d​ ataset.​ Vectors are cosine-normalized, log TF-IDF vectors. Various Reuters data from MIT.​ 4. The 20 Newsgroups dataset​ contains close to 20 thousand documents categorized across 20 groups; best for training on classification and clustering. 5. MPQA Opinion Corpus​ contains under one thousand news articles and other documents that are annotated manually for opinions, beliefs, emotions, speculations 6. This ​ corpus contains about 16 thousand annotated wikipedia tables to study fact verification. 7. Stanford labeled Rotten Tomatoes dataset ​ for sentiment analysis, includes paper and code. 8. Stanford 25 thousand labeled and 25 thousand test datasets with IMDB movie reviews ​ for sentiment analysis. 9. The​ t​ raining data for Sentiment140​ is a collection of just under 200 thousand labeled tweets for sentiment analysis. 10. An​ a​ ggregated corpus of more than 10 different sources,​ including tweets, news articles. Blogs, dialogues,, mapped to a unified tagging schema for emotion classification resulting in more than 20 thousand statements for 6 different emotions. 11. SMS Spam Collection​ contains just over 5 thousand English mobile text messages labelled according to whether they are spam or not.

12. Dataturks A set of 405 mostly Spanish reviews for academic papers ​ submitted to an international computing conference, with the reviewers’ scores, and another set of scores labeled by readers of the reviews. 13. Huffington Post a​ rticles​. ~200k news headlines from the year 2012 to 2018 obtained from HuffPost. 31 topics (labels). Data contains category and headline. P​ aper​. 14. Other datasets (not just NLP) on A​ WS OpenData 15. Financial Phrasebank,​ news articles with sentiment tags (negative, neutral, positive). 16. BBC ​Datasets.​ Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech). 17. Kaggle text ​datasets 18. TREC W​ ashington Post Corpus​. 671,947 news articles and blog posts (unlabeled) from January 2012 through December 2019. 19. CNN Daily Mail d​ ataset 20. Airlines ​tweets ​ data

Using SageMaker Studio

Create an IAM User using these ​step by step slides.​

Please see this video: ​https://www.youtube.com/watch?v=7QSsysGX14w Deep dive: h​ ttps://pages.awscloud.com/Amazon-SageMaker-Studio-Deep-Dive_2020_0226-MCL_OD.html

Did you know that if you use SageMaker Studio instead of SageMaker Notebooks, you can share notebooks in the same way as Colab? To get into Studio, just click on the link in the console, as shown here:

SageMaker Studio is much more versatile than Notebooks and has nicer UX. Please use it and give me feedback on how it works. To share see this video: https://www.youtube.com/watch?v=9LbZHcGN38U

You can now use the following machines in i​ ncreasing order of speed and memory​: ml.t3.medium, ml.t3.large, ml.c5.large, and ml.m5.large.

Here are two hands-on tutorials: [1] Build a fraud detection model ​ [2] Studio onboarding tour. Build a churn prediction model ​ Both come with sample notebook, screen shots, and step-by-step guides - right from initial set-up to final model deployment. Here are few video tutorials too: SageMaker Experiments | SageMaker Autopilot | SageMaker Studio ​ ​ ​ ​ ​

AWS Educate Support: ​https://aws.amazon.com/education/awseducate/contact-us/

The Fine Print!

Academic Integrity

The Academic Integrity pledge is an expression of the University’s commitment to fostering an understanding of -- and commitment to -- a culture of integrity at Santa Clara University. The Academic Integrity pledge, which applies to all students, states:

“I am committed to being a person of integrity. I pledge, as a member of the Santa Clara University community, to abide by and uphold the standards of academic integrity contained in the Student Conduct Code.”

Students are expected to uphold the principles of this pledge for all work in this class. For more information about Santa Clara University's academic integrity pledge and resources about ensuring academic integrity in your work, see www.scu.edu/academic-integrity.​

A student who is guilty of a dishonest act in an examination, homework, or other work required for a course, or who assists others in such an act, may, at the discretion of the instructor, receive a grade of “F” for the course. We have had to give students F's for the course because of this in the past. It's a painful process for all those involved. Please note that collaboration within a homework group is expected and encouraged, but collaboration across homework groups is not permitted.

Disabilities Resources

If you have a disability for which accommodations may be required in this class, please contact Disabilities Resources, Benson 216, http://www.scu.edu/disabilities as soon as possible to discuss your needs and register for accommodations with the University. If you have already arranged accommodations through Disabilities Resources, please discuss them with us during office hours. Students who have medical needs related to pregnancy or parenting may be eligible for accommodations.

While we are happy to assist you, we are unable to provide accommodations until we have received verification from Disabilities Resources. The Disabilities Resources office will work with students and faculty to arrange proctored exams for students whose accommodations include double time for exams and/or assisted technology. (Students with approved accommodations of time-and-a-half should talk with us as soon as possible). Disabilities Resources must be contacted in advance to schedule proctored examinations or to arrange other accommodations. The Disabilities Resources office would be grateful for the advance notice of at least two weeks. For more information, you may contact Disabilities Resources at 408-554-4109.

Accommodations for Pregnancy and Parenting

In alignment with Title IX of the Education Amendments of 1972, and with the California Education Code, Section 66281.7, Santa Clara University provides reasonable accommodations to students who are pregnant, have recently experienced childbirth, and/or have medically-related needs. Pregnant and parenting students can often arrange accommodations by working directly with their instructors, supervisors, or departments. Alternatively, a pregnant or parenting student experiencing related medical conditions may request accommodations through Disability Resources.