<<

LOW COST DATA TOOLS & RESOURCES

A PRESENTATION FOR MCCDEC

REUBEN TERNES, OAKLAND UNIVERSITY

PATRICK HERTA, LANSING COMMUNITY COLLEGE

AUGUST 2020 Presentation Goals

1. Our goals are simple a. Provide an overview of low-cost or free tools for those that use data . Provide enough details so that you can figure out which tools are worthy of further exploration 2. Save you money 3. Save you time Overview

1. Excel and Power BI 2. SPSS/PSPP 3. Python 4. 5. Google Suite 6. Summaries Section 1

Excel and PowerBI Excel - Data Cleaning

● Sorting, filtering, and simple data changes are easy with point-and-click interface ● Other useful functions ○ Pivot tables ○ Adding data to the “data model” allows for finding distinct counts ○ VLOOKUP formula for joining data ○ Often the easiest tool to format dates ○ Excel as an “in-between” from raw data to final software you’d like to use Excel -

● Excel comes with built in analytic tools supported by that most don’t know about. ● You need to enable them. They are not turned on by default. ● File -> Options -> Add-Ins -> Hit ‘Go’ next to the bottom button ‘Manage Add-Ins’-> Check ‘Analysis Tool Pak’ ● You can run t-tests, ANOVAs, Chi Square, , and a few other things. ● There’ also a ton of third party add-ons that undoubtedly have more options. ○ They vary in both cost and quality. Some are cheap. Some are expensive.

Excel Dashboards

● Excel has the capacity to host interactive dashboards ● The most complex dashboards require knowledge of VBA and Excel macros ● But you can do a lot with pivot tables and slicers ○ Create pivot and pivot charts you want to see ○ Use slicers to interactively filter the charts

PowerBI

● PowerBI is Microsoft’s answer to Tableau. ● It extends Excel’s slicer functionality to a full blown tool. ● (It is not an analysis tool - only good for data visualization). ● It is not free, but it is fairly low cost. About $100 or so per user. ○ A user is someone that makes the visualizations. ○ You can ‘publish’ the visualizations to your own website for free. ○ Data lives on Microsoft servers though ○ OU aggregates their data beforehand ○ Live version from OU can be found here: ● Additional Resources and Guides Are Available

Section 2

SPSS/PSPP SPSS

● SPSS is very efficient for ad hoc style data requests and cleaning . ● Excellent and fast GUI. ● Simple backend (good for students!) ● SPSS is not a low cost tool. It’s anywhere between $600 to $1000 per person per year. ● But you don’t need the Cadillac version to clean data. In fact, you mostly just need the standard version (though the custom tables package is useful). PSPP

● PSPP is the 100% free version of SPSS. ● Install it from here (Windows): HTTP://PSPP.AWARDSPACE.INFO/

● Less bells and whistles than SPSS. Still pretty good. ● As an added bonus, it can do quite a bit of statistical work (some , K means clustering, linear/) as well as some basic graphing (histograms, scatterplots, bar charts etc.) ● It doesn’t appear to have a GUI interface for file merging (something I do A LOT), but there is a syntax command for it, so it can still be done fairly easily.

● When you close a file, it tells you how many seconds it has been since you last saved the file! PSPP

● PSPP can do quite a bit of analytical work. ○ Not as much as SPSS can do though. ○ And you do not get some of the integrated tools (for example, the ability to run R packages within SPSS). ○ But many of your basic tools are still there. ○ Regression, Logistic Regression, ANOVA, K-Means Clustering, Factor Analysis, Reliability, and a few more. Section 3

Python Python Programming

● Our next two solutions will be about Python (and then R) ● These are programming languages. ● They are free. ● Both require the user to choose a ‘development environment’, or a programming ‘shell’ to utilize. ● For Python I recommend Jupiter Notebooks or Google’s Colaboratory. ○ Both are free! ● Both Python and R have a high learning curve but a large payoff for learning. Python - Data Cleaning

● Python is not the best tool to explore and clean ad hoc data sets. ● But it still has a large number of robust tools for data cleaning and exploration. ● It IS really good at automation. ● Python (and R) will shine when: ○ You can save and reuse code ○ Or when you need to automate the cleaning of data that is predictable Python Example - Recoding Variables

● This example shows code on #Recode Gender (this is a comment line, not actual code) how to ‘recode’ a categorical variable (Gender) into a binary def Gender_recode (series): variable. if series == 'F': ● Most Python packages do not return 1 deal well with categorical data, else: and they need to be converted return 0 to numerical. df['Gender_recode'] = df['GENDER'].(Gender_recode) ● ‘df’ stands for dataframe, and is previously assigned (it’s your dataset). Python - Data Analysis

● Python offers a large number of analytical and statistical packages ○ Almost any statistical technique is available ○ This includes most machine learning techniques ● Python has the added benefit of being the ‘base’ language of Google’s TensorFlow, a free and relatively simple to use machine learning system specifically designed to optimize machine learning tasks. ○ You can actually use several different languages to program in TensorFlow, including R. ○ Click here to access an IR friendly beginner’s guide to TensorFlow # split into input (X) and output (y) variables.

x = FakeData.iloc[1:7000,6:11] Python - Data Analysis y = FakeData.iloc[1:7000,11] # define the keras model

● You can also do more advanced model = tf.keras.Sequential() things in both Python and R. model.add(Dense(12, input_dim=5, activation='relu')) ● To the right is example code for creating a Neural Net using Keras (a model.add(Dense(8, activation='relu'))

high level API in TensorFlow) that model.add(Dense(1, activation='sigmoid')) uses the Python language. ● Assuming you have your data # compile the keras model already loaded, this is basically the entire code (for training). model.compile(loss='binary_crossentropy', ● The dataset here is called optimizer='adam', metrics=['accuracy']) ‘FakeData’. # fit the keras model on the dataset

model.fit(x, y, epochs=10, batch_size=10) Section 4

R R Overview

● Is coding/script based

● Steep learning curve, but can use the same code for multiple projects ○ Write code once, copy and paste a thousand times

● A few packages cover 90% of data cleaning work I ever need to do

● Good for jobs that are tedious/impossible in Excel

● All steps of your work are saved to a script and reproducible -- no more finding yourself looking at a and asking ‘How did I get here?’

● RStudio is a free “development environment” that makes using R easier ○ Basically just a nice user-interface for R ○ A definite must-have R/RStudio Interface R - Data Cleaning

● Can do a lot with a few lines of code ● Creating a part-time/full-time flag based off of total credit hours

● Merging two data files

● Grouping student age and labeling the new groups R - Data Cleaning - One more example

● You have: CIP code is duplicated for each associated major ● You want: One line per CIP code, with all associated majors rolled-up into a single cell R - Data Cleaning - One more example R - Data Analysis ● Simple statistical techniques available without extra packages

○ Find the average GPA

○ Find the average GPA by semester

○ Find the average GPA by semester and declared major, for only female students, and save the results to a CSV

● Lots of online advice and help for any statistical or machine learning technique R - Data Visualizations

● Three main categories ○ Typical charts/graphs to be copy-pasted into reports and documents ○ Interactive dashboarding ○ Professional-quality reports and ● For the type of person that reads phone books and dictionaries for fun ○ Anything is possible, from simple bar charts to intricate, motion-enabled data visualizations hosted on a website, to plotting GIS data with Google maps imagery ○ Just like Rome, the chart you want will not be built in a day ○ But, if a professional-quality, annotated, one-of-a-kind graph is desired, R can do it ○ Disclaimer: I do use R for some data visualizations R - Charts to Copy/Paste

● Lots of code to create, and this data was already cleaned! R Shiny/RMarkdown

● R Shiny is an that allows you to make interactive web apps and dashboards ● Requires strong R skills and some knowledge of HTML ● See examples here

● RMarkdown is a free RStudio add-on that allows you to typeset and format your reports entirely within R. Charts and tables generated from R code can be intermixed with text and the whole report generated with a push of a button. ● Can embed LaTex code for mathematical formulas and symbols ● Steep learning curve -- recommended only for reports that need to be generated regularly or for professional-quality articles intended for publication Section 5

Google Suite Google Sheets

● Google’s alternative to Excel ○ Google Sheets can do many of the same things Excel can do ○ Unfortunately, Google does not provide its own analytic/ add-on. ○ Third parties do, but I have not explored any of them. Google Studio

● Google Studio operates much like PowerBI ● Currently I find it harder to work with, but that might just be me. ● Expect a moderate to mild learning curve. ● Some potential nice integration between things like google forms, sheets, and studio.

Section 6

Tool Summaries Excel & PowerBI

● Been around forever (Excel) ● Is not going away ● Is ubiquitous in almost every organization ● Worth investing in as it will almost always transfer to other organizations ● PowerBI is trying to position itself to be the ‘Excel’ of visualizations (cheap, used by everyone, essential, etc.). R

● If you do a lot of data cleaning, strongly consider R ● Script/code means you can re-trace your work step-by-step ● Can automate routine data cleaning tasks ● Can perform complicated data restructuring that would otherwise require knowledge of Excel macros ● Data cleaning and analysis done in the same tool/software ● An overabundance of help available ○ The code syntax is fairly simple, but choosing the solution you want among the hundred other, equally valid solutions, can be overwhelming ○ Example: You want to open a spreadsheet. R’s response: “Do you want to open a CSV, or maybe an XLSX? Do you want to hard-code the file path in, or just the file name? Do you want to eschew all file names entirely and open a GUI window that lets you search through your PC to find the file? Do you want to change all blanks to NAs, or ‘X’s, or something else? Hey where are you going….?” ● Ignore data visualization capability until you are more familiar Python

● Many of the same pros/cons of R. ● ‘Perhaps’ more of a general than R (less specific to data analysis). ● Plays well with servers (‘seamless’ integration). ○ Can both pull data and add data to server PSPP

● A fantastic tool for ad-hoc data cleaning projects ● A fair GUI with programming capabilities and easy to learn syntax ● Great for student employees! ● Easy to pick up quickly ● Can perform some automating tasks, but can be difficult to integrate into other systems. Google Suite

● Very similar to the analogous Microsoft tools ● Less developed ● Instant version controlling! ● Seamless simultaneous collaboration! Which Tools Should I Explore More?

● Choose one ○ Pick a tool that gives you new abilities ○ Or pick a tool that improves your efficiency ● Reuben spends a lot of time cleaning data, so if you can’t do that task quickly, definitely check out PSPP and learn to use the GUI (which is usually faster than coding). ● PSPP will have serious limitations. With R or Python, the sky is the limit. Questions?