Use Case: Data Analysis Trends and Highlights
Total Page:16
File Type:pdf, Size:1020Kb
When Should I Use Python vs. R? Python and R are both great programming languages for data science and analytics. Since they’re open-source, they’re free to download for everyone, unlike commercial tools like SAS and SPSS. Find out their strengths and weaknesses and fgure out which is beter for your specifc use cases. Purpose Either language is suitable for almost any data science task, from data manipulation and automation to ad-hoc analysis and exploring datasets. Users may leverage both languages for diferent purposes, e.g., conducting early- stage data analysis and exploration in R, then switching to Python when it’s time to ship some data products. Choosing Python vs. R It’s up to the individual data scientist or data analyst to choose the language that best fts their unique needs. The following questions may help with that decision. 1 Which language do your colleagues use? The benefts of being able to share code with your colleagues and maintaining a simpler sofware stack outweigh any benefts of one language over another. 2 What problems do you want to solve and what tasks do you need to accomplish? 3 What are the net costs of learning a language? It will take time to learn a new system that is beter aligned for the problem you want to solve, but staying with the system you know may not be a ft for that problem. 4 What are the commonly used tool(s) in your feld? Who It’s Used By Python is used by programmers that R has been used primarily in academics want to delve into data analysis or apply and research and is great for exploratory statistical techniques, and by developers data analysis. In recent years, enterprise and programmers that turn to data usage has rapidly expanded. science. Statisticians, engineers, and scientists Python is a production-ready language, without computer programming skills. It’s meaning it has the capacity to be a popular in academia, fnance, single tool that integrates with every pharmaceuticals, media, and marketing. part of your workfow! Usability People with a sofware engineering If you have no coding experience, then background may fnd Python comes R may be easier to learn. more naturally to them than R. Statistical models can be writen with Coding and debugging is easy because only a few lines. of the simple syntax. The same piece of functionality can be The indentation of code afects its writen in several ways with R. meaning. Any piece of functionality is always writen the same way with Python. Ecosystem Python has a robust ecosystem and is R has a rich ecosystem of cuting-edge commonly considered one of the easier interface packages available to programming languages to read and communicate between open-source learn. Its programming syntax is simple languages. and its commands mimic the English language. This allows users to string their E.g. print(“Hello world!”) workfows together, which is especially useful for data analysis. Python code is syntactically clear and elegant, easily interpretable, and easy Packages are available at: to type. Comprehensive R Archive Network (CRAN): “Task Views” page lists a It’s great for building data science wide range of tasks for which R pipelines and machine learning packages are available and users products integrated with web can easily contribute to. frameworks at scale. But watch out for dependencies and installing Python Bioconductor: Open source sofware libraries! for bioinformatics It’s great for building data science GitHub: Web-based Git repository pipelines and machine learning hosting service products integrated with web frameworks at scale. But watch out for Search through these sources easily dependencies and installing Python with Rdocumentation libraries! Packages are collections of R functions, The Python Package Index (PyPi) and data, and compiled code. They can be Anaconda are repositories of Python installed in R with one line. sofware with all libraries. Users can contribute to these repositories, but it’s a bit complicated in practice to do so. Flexibility Python is fexible for creating something It’s easy to use complex functions in R. All that has never been done before. kinds of statistical tests and models are Developers can also use it for scripting readily available and easily used. websites or other applications. Ease of Learning Python’s focus on readability and R is easier to learn when you start out, but simplicity means its learning curve is the intricacies of advanced functionalities relatively linear and smooth. makes it more difcult to develop expertise. Python is considered a good language for beginner programmers. R is not hard for experienced programmers to learn. Advantages General-purpose programming Widely considered the best tool for languages are useful beyond just data making beautiful graphs and analysis. visualizations. Has gained popularity for its code Has many functionalities for data readability, speed, and many analysis. functionalities. Great for statistical analysis. Great for mathematical computation Built around a command line, but the and learning how algorithms work. majority of R users work inside of Has high ease of deployment and RStudio, an environment that includes a reproducibility. data editor, debugging support, and a window to hold graphics as well. Disadvantages Python doesn’t have as many libraries For people with no sofware for data science as R. engineering experience, base R can be more difcult to learn because it was Python requires rigorous testing as developed by statisticians, not to make errors show up in runtime. coding easier. But R has a set of Visualizations are more convoluted in packages known as the Tidyverse, Python than in R, and results are not as which provides powerful yet easy-to- eye-pleasing or informative. learn tools for importing, manipulating, visualizing, and reporting on data. Python packages for data visualization: Finding the right packages to use in R seaborn: Library based on Matplotlib may be time consuming. Bokeh: Interactive visualization There are many dependencies between library R libraries. Pygal: Create dynamic dynamic svg charts R can be considered slow if code is writen poorly. Not as popular as Python for deep learning and NLP. Use Case: Data Analysis Usage Python is generally used when the data R is mainly used when the data analysis analysis tasks need to be integrated with tasks require standalone computing or web apps or if statistics code needs to be analysis on individual servers. incorporated into a production database. For exploratory work, R is easier for Since it’s a full-fedged programming beginners. Statistical models can be language, Python is a good tool to writen with a few lines of code. implement algorithms for use in production. Data Handling Capabilities Python requires users to install packages R is great for data analysis because of its for data analysis, and these packages huge number of packages, readily usable have greatly improved in recent years. tests, and the advantage of using formulas. NumPy and pandas, among others, are popular for data analysis. It can handle basic data analysis without needing to install packages. Big datasets require the use of packages such as data.table and dplyr. Geting Started IDE There are many Python IDEs to choose RStudio is the most popular R IDE. It’s from which drastically reduce the available in two formats: RStudio Desktop overhead of organizing code, output, and for running locally as a regular desktop notes fles. Jupyter Notebooks and Spyder application and RStudio Server for access are popular, and Jupyter Lab is gaining via web browser while running on a remote traction. Linux server. Tip: Also try Rodeo, the “data science IDE for Python.” Popular Libraries and Packages pandas to easily manipulate data dplyr, tidyr and data.table to easily manipulate data SciPy and NumPy for scientifc computing stringr to manipulate strings Scikit-learn for machine learning zoo to work with regular and irregular time series Matplotlib and seaborn to make graphics ggplot2 to visualize data statsmodels to explore data, estimate caret for machine learning statistical models, and perform statistical tests and unit tests “R is currently head-and-shoulders above Python for data analysis, but I remain convinced that Python can catch up, easily and quickly.” JAN GALKOWSKI, COMPUTATIONAL ENGINEER Support and Communities DataCamp Slack Community DataCamp Slack Community Stack Overfow Stack Overfow Reddit Python Reddit rstats PyLadies Rdocumentation pydata R-help pystatsmodels ROpenSci numpy-discussion and sci-py-user Jumping Rivers list of local R User Groups Trends and Highlights Popularity Rankings, Python vs. R Python R 2016 #3 #5 2017 #1 #6 2018 #1 #7 2019 #1 #5 Source: IEEE Spectrum, Sep 2019 Popularity on Stack Overfow, Python vs. R Python R 9.00% 8.00% 7.00% 6.00% 5.00% 4.00% 3.00% 2.00% 1.00% 0.00% % OF STACK OVERFLOW QUESTIONS THAT MONTH THAT QUESTIONS OVERFLOW % OF STACK 2009 2010 2011 2012 2013 2014 2015 2016 2017 YEAR Source: DZone Python, R or both or other platforms for data science Python R Both Other 70% 60% 50% 42% 41% 40% 34% 36% 30% 20% 16% 12% 11% 9% 10% 0% SHARE IN 2016 SHARE IN 2017 Source: DZone SAS, R or Python preference by years of experience SAS R Python 60% 50% 48% 47% 40% 38% 36% 33% 31% 30% 26% 27% 20% 14% 10% 0% 0-5 YRS 6-15 YRS 16+ Source: Butch Works SAS, R or Python preference by industry 50% SAS R Python 45% 43% 41% 40% 42% 39% 39% 40% 37% 36% 35% 35% 34% 33% 30% 29% 29% 28% 30% 30% 29% 27% 27% 26% 25% 25% 24% 20% 15% 10% 5% 0% TECH/ CONSULTING OTHER ADV/MARKETING FINANCIAL RETAIL/CPG HEALTHCARE/ TELECOM CORPORATIONS SERVICES SERVICES PHARMA Source: Butch Works User loyalty, Python vs.