Uncategorized

Most Useful R Functions You Might Not Know

Almost every R user knows about popular packages like dplyr and ggplot2. But with 10,000+ packages on CRAN and yet more on GitHub, it’s not always easy to unearth libraries with great R functions. Here are the ten most useful R functions you might not know that makes my life easier working in R. If you already know them all, sorry for wasting your reading time, and please consider adding a comment with something else that you find useful for the benefit of other readers.

Useful R Functions You Might Not Know
Most Useful R Functions You Might Not Know

1. RStudio shortcut keys

This is less an R hack and more about the RStudio IDE, but the shortcut keys available for common commands are super useful and can save a lot of typing time. My two favourites are Ctrl+Shift+M for the pipe operator %>% and Alt+- for the assignment operator<-. If you want to see a full set of these awesome shortcuts just type Atl+Shift+K in RStudio.

2. Automate tidyverse styling with styler

It’s been a tough day, you’ve had a lot on your plate. Your code isn’t as neat as you’d like and you don’t have time to line edit it. Fear not. The stylerpackage has numerous functions to allow automatic restyling of your code to match tidyverse style. It’s as simple as running  styler::style_file() on your messy script and it will do a lot (though not all) of the work for you.

3. The Switch function

I LOVE switch(). It’s basically a convenient shortening of an if a statement that chooses its value according to the value of another variable. I find it particularly useful when I am writing code that needs to load a different dataset according to a prior choice you make. For example, if you have a variable called animal and you want to load a different set of data according to whether animal is a dog, cat or rabbit you might write this:

data <- read.csv(
  switch(animal, 
         "dog" = "dogdata.csv", 
         "cat" = "catdata.csv",
         "rabbit" = "rabbitdata.csv")
)

4. k-means on long data

k-means is an increasingly popular statistical method to cluster observations in data, often to simplify a large number of data points into a smaller number of clusters or archetypes. The kml package now allows k-means clustering to take place on longitudinal data, where the ‘data points’ are actually data series. This is super useful where the data points you are studying are actually readings over time. This could be the clinical observation of weight gain or loss in hospital patients or compensation trajectories of employees.

kml works by first transforming data into an object of the class ClusterLongDatausing the cld function. Then it partitions the data using a ‘hill climbing’ algorithm, testing several values of k 20 times each. Finally, the choice()function allows you to view the results of the algorithm for each k graphically and decide what you believe to be an optimal clustering.

5. Text searching

If you’ve been using regular expressions to search for text that starts or ends with a certain character string, there’s an easier way. “startsWith() and endsWith() — did I really not know these?” tweeted data scientist Jonathan Carroll. “That’s it, I’m sitting down and reading through dox for every #rstats function.”

6. The req and validate functions in R Shiny

R Shiny development can be frustrating, especially when you get generic error messages that don’t help you understand what is going wrong under the hood. As Shiny develops, more and more validation and testing functions are being added to help better diagnose and alert when specific errors occur. The req() function allows you to prevent an action from occurring unless another variable is present in the environment, but does so silently and without displaying an error. So you can make the display of UI elements conditional on previous actions. For example:


output$go_button <- shiny::renderUI({
  # only display button if an animal input has been chosen
  
  shiny::req(input$animal)
  # display button
  shiny::actionButton("go", 
                      paste("Conduct", input$animal, "analysis!") 
  )
})

validate() checks before rendering output and enables you to return a tailored error message should a certain condition not be fulfilled, for example, if the user uploaded the wrong file:

# get csv input file
inFile <- input$file1
data <- inFile$datapath
# render table only if it is dogs
shiny::renderTable({
  # check that it is the dog file, not cats or rabbits
  shiny::validate(
    need("Dog Name" %in% colnames(data)),
    "Dog Name column not found - did you load the right file?"
  )
  data
})

7. revealjs

revealjs is a package which allows you to create beautiful presentations in HTML with an intuitive slide navigation menu, with embedded R code. It can be used inside R Markdown and has very intuitive HTML shortcuts to allow you to create a nested, logical structure of pretty slides with a variety of styling options. The fact that the presentation is in HTML means that people can follow along on their tablets or phones as they listen to you speak, which is really handy. You can set up a revealjspresentation by installing the package and then calling it in your YAML header. Here’s an example YAML header of a talk I gave recently using revealjs

---
title: "Exporing the Edge of the People Analytics Universe"
author: "Keith McNulty"
output:
  revealjs::revealjs_presentation:
    center: yes
    template: starwars.html
    theme: black
date: "HR Analytics Meetup London - 18 March, 2019"
resource_files:
- darth.png
- deathstar.png
- hanchewy.png
- millenium.png
- r2d2-threepio.png
- starwars.html
- starwars.png
- stormtrooper.png
---

8. Datatables in RMarkdown or Shiny using DT

 The DT package is an interface from R to the DataTables javascript library. This allows a very easy display of tables within a shiny app or R Markdown document that has a lot of in-built functionality and responsiveness. This prevents you from having to code separate data download functions, gives the user flexibility around the presentation and the ordering of the data and has a data search capability built in. For example, a simple command such as :

DT::datatable(
  head(iris),
  caption = 'Table 1: This is a simple caption for the table.'
)

9. Pimp your RMarkdown with prettydoc

prettydoc is a package by Yixuan Qiu which offers a simple set of themes to create a different, prettier look and feel for your RMarkdown documents. This is super helpful when you just want to jazz up your documents a little but don’t have time to get into the styling of them yourself. It’s really easy to use. Simple edits to the YAML header of your document can invoke a specific style theme throughout the document, with numerous themes available. For example, this will invoke a lovely clean blue colouring and style across titles, tables, embedded code and graphics:

---
title: "My doc"
author: "Me"
date: June 3, 2019
output:
  prettydoc::html_pretty:
    theme: architect
    highlight: github
---

10. Get minimum and maximum values with a single command. 

Talking about the useful R functions you might not know how can I miss to find the minimum and maximum values in a vector? Base R’s range() function does just that, returning a 2-value vector with the lowest and highest values. The help file says range() works on numeric and character values, but I’ve also had success using it with date objects.

Solving System of Equations in R With Example

In this article, we will discuss solving a system of equations in R Programming Language. solve() function in R Language is used to solve the equation. Here equation is like a*x = b, where b is a vector or matrix and x is a variable whose value is going to be calculated.

Syntax: solve(a, b)

Parameters:

  • a: coefficients of the equation
  • b: vector or matrix of the equation

Example 1: Solving system equation of three equations

Given Equations:
x + 2y + 3z = 20  
2x + 2y + 3z = 100  
3x + 2y + 8z = 200

Matrix A and B for solution using coefficient of equations:
A->
1   2   3
2   2   3
3   2   8
B->
20
100
200

To solve this using two matrices in R we use the following code:

# create matrix A and B using given equations
A <- rbind(c(1, 2, 3),
		c(2, 2, 3),
		c(3, 2, 8))
B <- c(20, 100, 200)

# Solve them using solve function in R
solve(A, B)

Output:

80 -36 3.99999999999999

Example 2: Solving system equation of three equations

To get solutions in form of fractions, we use library MASS in R Language and wrap solve function in fractions.

Given Equations:
19x + 32y + 31z = 1110  
22x + 28y + 13z = 1406  
31x + 12y + 81z = 3040
Matrix A and B for solution using coefficient of equations:
A->
19   32   31
22   28   13
31   12   81
B->
1110
1406
3040

To solve this using two matrices in R we use the following code:

  • R
# Load package MASS
library(MASS)

# create matrix A and B using given equations
A <- rbind(c(19, 32, 31),
		c(22, 28, 31),
		c(31, 12, 81))
B <- c(1110, 1406, 3040)

# Solve them using solve
# function wrapped in fractions
fractions(solve(A, B))

Output:

[1] 159950/2243 -92039/4486  29784/2243 which means x=159950/2243 , y=-92039/4486 and z=29784/2243 is the solution for the above given linear equation.

Example 3: Solving Inverse matrix

  • R
# create matrix A and B using given equations
A <- matrix(c(4, 7, 3, 6), ncol = 2)
print(A)

print("Inverse matrix")

# Solve them using solve function in R
print(solve(A))

Output:

     [,1] [,2]
[1,]    4    3
[2,]    7    6
[1] "Inverse matrix"
          [,1]      [,2]
[1,]  2.000000 -1.000000
[2,] -2.333333  1.333333

Related post: Solving a System of Equations in Pure<br>Python without Numpy or Scipy

An Introduction to Statistical Learning with Applications in R

An Introduction to Statistical Learning with Applications in R is intended for anyone who is interested in using modern statistical methods for modelling and prediction from data. This group includes scientists, engineers, data analysts, data scientists, and quants, but also less technical individuals with degrees in non-quantitative fields such as the social sciences or business.

Writers expect that the reader will have had at least one elementary course in statistics. Background in linear regression is also useful, though not required since we review the key concepts behind linear regression in Chapter 3. The mathematical level of this book is modest, and detailed knowledge of matrix operations is not required. This book provides an introduction to the statistical programming language R. Previous exposure to a programming language, such as MATLAB or Python, is useful but not required.

The first edition of this book has been used to teach masters and PhD students in business, economics, computer science, biology, earth sciences, psychology, and many other areas of the physical and social sciences. It has also been used to teach advanced undergraduates who have already taken a course on linear regression. In the context of a more mathematically rigorous course in which ESL serves as the primary textbook, ISL could be used as a supplementary text for teaching computational aspects of the various approaches.

Power BI for Beginners: A Step-by-Step Training Guide

Power BI is a Business Intelligence tool developed by Microsoft. It helps you interactively
visualise your data and make intelligence-based business decisions as a result. Key features of Power BI:
• Quick set-up comparative to traditional BI
• Interactive visualisations
• Supports different data sources (Microsoft or otherwise)
• The ability to publish to the web (app.powerbi.com)
• Cloud-based, no on-premise infrastructure needed
• Scalable
• Accessibility – view the dashboards/reports on iPad, iPhone, Android, and Windows
devices Scheduled data refresh

In this how-to guide, the writer gives you an overview course and how it can be used to load, manipulate, model, and report on data to assist with your reporting requirements. The scenario we’ll run through is how to report on internet sales for the fictitious AdventureWorks bicycle company and add some common time intelligence measures, for example, period to date and period against previous period reporting. We will take you through the typical loading data, modelling data and then visualising the data.

Power BI Desktop is used to access data sources, shape, analyse and visualise data,
and publish reports. Once installed on your local computer, it lets you connect to data from different sources, transform, and visualise your data. Power BI Desktop is available for free via a direct download link here.

Contents

  1. Introduction
  2. Overview of Power BI
  3. Getting Started
  4. Connecting to Data Sources
  5. Modelling the Data – Creating Relationships
  6. Reporting on the Data – Creating Visualisations
  7. Conclusion
  8. Appendix

COMMON STATISTICAL DISTRIBUTIONS

Statistical Distributions are an important tool in data science. A distribution helps us to understand a variable by giving us an idea of the values that the variable is most likely to obtain.

Besides, when knowing the distribution of a variable, we can do all sorts of probability calculations, to compute probabilities of certain situations occurring.

In this article, I share 6 Statistical Distributions with intuitive examples that often occur in real-life data.

COMMON STATISTICAL DISTRIBUTIONS
COMMON STATISTICAL DISTRIBUTIONS

1. Normal or Gaussian distribution

COMMON STATISTICAL DISTRIBUTIONS

The Normal or Gaussian distribution is arguably the most famous distribution, as it occurs in many natural situations. A normal distribution shows the probability density for a population of continuous data (for example height in cm for all NBA players)

In other words, it shows how likely is it that any player from the NBA is of a certain height. Most players are around the mean/average height, fewer are much taller, or much shorter. A normal distribution is symmetrical on both sides of the mean.

2. T-Distribution

COMMON STATISTICAL DISTRIBUTIONS

Just like a normal distribution, a t-distribution is symmetrical around the mean, and the breadth is based on the deviation within the data. While a normal distribution works with a population – a t-distribution is designed for situations where the sample size is small. The shape of the T distribution becomes broader as the sample size decreases, to take into account the extra uncertainty we are faced with.

The shape of a t-distribution relates to the number of degrees of freedom which is calculated as the sample size minus one. As the sample size, and thus the degrees of freedom gets larger, the distribution tends towards a normal distribution – as with a larger sample we’re more certain about estimating the true population statistics.

3. Binomial Distribution

COMMON STATISTICAL DISTRIBUTIONS
COMMON STATISTICAL DISTRIBUTIONS

A Binomial Distribution can end up looking a lot like the shape of a normal distribution. The main difference is that instead of plotting continuous data, it instead plots a distribution of two possible discrete outcomes, for example, the results from flipping a coin.

Imagine flipping a coin 10 times, and from those 10 flips, noting down how many were “Heads”. It could be any number between 1 and 10. Now imagine repeating that task 1,000 times…

If the coin we are using is indeed fair (not biased to heads or tails) then the distribution of outcomes should start to look at the plot above. In the vast majority of cases, we get 4, 5, or 6 “heads” from each set of 10 flips, and the likelihood of getting more extreme results is much rarer!

4. Bernoulli Distribution

COMMON STATISTICAL DISTRIBUTIONS

The Bernoulli Distribution is a special case of the Binomial Distribution. It considers only two possible outcomes, success or failure, true or false.

It’s a really simple distribution, but worth knowing! In the example below we’re looking at the probability of rolling a 6 with a standard die.

If we roll a die many, many times, we should end up with a probability of rolling a 6, 1 out of every 6 times (or 16.7%) and thus a probability of not rolling a 6, in other words rolling a 1,2,3,4 or 5, 5 times out of 6 (or 83.3%) of the time!

5. Uniform Distribution

COMMON STATISTICAL DISTRIBUTIONS

A Uniform Distribution is a distribution in which all events are equally likely to occur. Below, we’re looking at the results from rolling a die many, many times.

We’re looking at which number we got on each roll and tallying these up. If we roll the die enough times (and the die is fair) we should end up with a completely uniform probability where the chance of getting any outcome is exactly the same.

6. Poisson Distribution

COMMON STATISTICAL DISTRIBUTIONS

A Poisson Distribution is a discrete distribution similar to the Binomial Distribution (in that we’re plotting the probability of whole numbered outcomes) Unlike the other distributions we have seen, however, this one is not symmetrical – it is instead bounded between 0 and infinity

The Poisson distribution describes the number of events or outcomes that occur during some fixed interval. Most commonly this is a time interval like in our example below where we are plotting the distribution of sales per hour in a shop.

Download Data Analytics By Arthur Zhang

Data is important because you need information about certain aspects of your
business to determine the state of that aspect and how it affects overall business
operations. For example, if you don’t keep track of how many units you sell per
month, there is no way to determine how well your business is doing. There are
many other kinds of data that are important in determining business success that
will be discussed throughout this book data analytics by Arthur Zhang .


Collecting the data isn’t enough, though. The data needs to be analyzed and
applied to be useful. If losing a customer isn’t important to you, or you feel it
isn’t critical to your business, then there’s no need to analyze data. However, a
continual lack of appreciation for customer numbers can impact the ability of
your business to grow because the number of competitors who do focus on
customer satisfaction is growing. This is where predictive analytics becomes
important and how you employ this data will distinguish your business from
competitors. Predictive analytics can create strategic opportunities for you in the
business market, giving you an edge over the competition.

Download Data Analytics By Arthur Zhang

Table of Contents

CHAPTER 1: WHY DATA IS IMPORTANT TO YOUR BUSINESS

CHAPTER 2: BIG DATA

CHAPTER 3: DEVELOPMENT OF BIG DATA
CHAPTER 4: CONSIDERING THE PROS AND CONS OF BIG DATA

CHAPTER 5: BIG DATA FOR SMALL BUSINESSES? WHY NOT?

CHAPTER 6: IMPORTANT TRAINING FOR THE MANAGEMENT OF
BIG DATA

CHAPTER 7: STEPS TAKEN IN DATA ANALYSIS

CHAPTER 8: DESCRIPTIVE ANALYTICS

CHAPTER 9: PREDICTIVE ANALYTICS

CHAPTER 10: PREDICTIVE ANALYSIS METHODS

CHAPTER 11: R – THE FUTURE IN DATA ANALYSIS SOFTWARE

CHAPTER 12: PREDICTIVE ANALYTICS & WHO USES IT

CHAPTER 13: DESCRIPTIVE AND PREDICTIVE ANALYSIS
CHAPTER 14: CRUCIAL FACTORS FOR DATA ANALYSIS
CHAPTER 15: EXPECTATIONS OF BUSINESS INTELLIGENCE

CHAPTER 16: WHAT IS DATA SCIENCE?

CHAPTER 17: DEEPER INSIGHTS ABOUT A DATA SCIENTIST’S
SKILLS

CHAPTER 18: BIG DATA AND THE FUTURE

CHAPTER 19: FINANCE AND BIG DATA

CHAPTER 20: MARKETERS PROFIT BY USING DATA SCIENCE

CHAPTER 21: USE OF BIG DATA BENEFITS IN MARKETING

CHAPTER 22: THE WAY THAT DATA SCIENCE IMPROVES TRAVEL

CHAPTER 23: HOW BIG DATA AND AGRICULTURE FEED PEOPLE

CHAPTER 25: THE USE OF BIG DATA IN THE PUBLIC SECTOR

CHAPTER 26: BIG DATA AND GAMING

CHAPTER 27: PRESCRIPTIVE ANALYTICS

Download Introduction to Probability For Data Science

This book is an introductory textbook in undergraduate probability. It has a mission to spell out the motivation, intuition, and implication of the probabilistic tools we use in science and engineering. The writer has distilled what he believes to be the core of probabilistic methods. He put the book in the context of data science to emphasize the inseparability between data (computing) and probability (theory) in our time.

Probability is one of the most interesting subjects in electrical engineering and com-
puter science. It bridges our favourite engineering principles to the practical reality, a world that is full of uncertainty. However, because the probability is such a mature subject, the undergraduate textbooks alone might fill several rows of shelves in a library. When the literature is so rich, the challenge becomes how one can pierce through to the insight while diving into the details. For example, many of you have used a normal random variable before, but have you ever wondered where the “bell shape” comes from? Every probability class will teach you about flipping a coin, but how can “flipping a coin” ever be useful in machine learning today? Data scientists use the Poisson random variables to model the internet traffic, but where does the gorgeous Poisson equation come from?

This book is designed to fill these gaps with the knowledge that is essential to all data science students. This leads to the three goals of the book.

(i) Motivation: In the ocean of mathematical definitions, theorems, and equations, why should we spend our time on this particular topic but not another?

(ii) Intuition: When going through the derivations, is there a geometric interpretation or physics beyond those equations?

(iii) Implication: After we have learned a topic, what new problems can we solve?

The book’s intended audience is undergraduate juniors/seniors and first-year graduates students majoring in electrical engineering and computer science. The prerequisites are standard undergraduate linear algebra and calculus, except for the section about characteristic functions, where Fourier transforms are needed. An undergraduate course in signals and systems would suffice, even taken concurrently while studying this book. The length of the book is suitable for a two-semester course. Instructors are encouraged to use the set of chapters that best fits their classes. For example, a basic probability course can use Chapters 1-5 as its backbone.

Chapter 6 on sample statistics is suitable for students who wish to gain theoretical insights into probabilistic convergence. Chapter 7 on regression and Chapter 8 on estimation best suit students who want to pursue machine learning and signal processing. Chapter 9 discusses confidence intervals and hypothesis testing, which are critical to modern data analysis. Chapter 10 introduces random processes. My approach for random processes is more tailored to information processing and communication systems, which are usually more relevant to electrical engineering students.


Additional teaching resources can be found on the book’s website, where you can find lecture videos and homework videos. Throughout the book, you will see many “practice exercises”, which are easy problems with worked-out solutions. They can be skipped without losing to the flow of the book.

Download Introduction to Probability For Data Science
Download Introduction to Probability For Data Science

Talent vs Luck: The role of randomness in success and failure

The distribution of wealth follows a well-known pattern sometimes called an 80:20 rule: 80 percent of the wealth is owned by 20 percent of the people. A report last year shows that just eight men had a total wealth equivalent to that of the world’s poorest 3.8 billion people. The distribution of wealth is among the most controversial because of the issues it raises about the role of randomness in success and failure.

Why should so few people have so much wealth? The most common explanation is that the wealthy have earned it, whether by IQ or intelligence or talent, virtuous hard work or sheer rapacity. Or all of the above, though it’s kind of tough to be both virtuous and rapacious.

But what about good old dumb luck? Luckily we have an answer thanks to the work of Alessandro Pluchino at the University of Catania in Italy and a couple of colleagues. These guys have created a computer model of human talent and the way people use it to exploit opportunities in life. The model allows the team to study the role of randomness in success and failure.

Some findings of the Study: The role of randomness in success and failure

Talent vs Luck: The role of randomness in success and failure
Talent vs Luck: The role of randomness in success and failure
  • The chance of becoming a CEO is influenced by your name or month of birth. The number of CEOs born in June and July is much smaller than the number of CEOs born in other months.
  • Those with last names earlier in the alphabet are more likely to receive tenure at top departments in Universities.
  • The display of middle initials increases positive evaluations of people’s intellectual capacities and achievements.
  • People with easy to pronounce names are judged more positively than those with difficult-to-pronounce names.
  • Females with masculine sounding names are more successful in legal careers.

A number of studies and books–including those by risk analyst Nassim Taleb, investment strategist Michael Mauboussin, and economist Robert Frank– have suggested that luck and opportunity may play a far greater role than we ever realized, across a number of fields, including financial trading, business, sports, art, music, literature, and science. Their argument is not that luck is everything; of course, talent matters.

Data Science For Business By Foster Provost and Tom Fawcett

The past fifteen years have seen extensive investments in business infrastructure, which have improved the ability to collect data throughout the enterprise. Virtually every part of the business is now open to data collection and often even instrumented for data collection: operations, manufacturing, supply-chain management, customer behaviour, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on external events such as market trends, industry news, and competitors’ movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data—the realm of data science.

Writer of this book Foster provost and Tom Fawcett go beyond data analytics. This book is an essential guide for all of us who want to use data science in our business and daily life. The author, deep knowledge about data and data science makes this book a must read.

Data Science For Business By Foster Provost and Tom Fawcett
Data Science For Business By Foster Provost and Tom Fawcett

Praise for Data Science For Business By Foster Provost and Tom Fawcett

“A must-read resource for anyone who is serious
about embracing the opportunity of big data.”
— Craig Vaughan
Global Vice President at SAP

“This timely book says out loud what has finally become apparent: in the modern world,
Data is Business, and you can no longer think business without thinking data. Read this
book and you will understand the Science behind thinking data.”
— Ron Bekkerman
Chief Data Officer at Carmel Ventures


“A great book for business managers who lead or interact with data scientists, who wish to
better understand the principles and algorithms available without the technical details of
single-disciplinary books.”
— Ronny Kohavi
Partner Architect at Microsoft Online Services Division


“Provost and Fawcett have distilled their mastery of both the art and science of real-world
data analysis into an unrivalled introduction to the field.”
—Geoff Webb
Editor-in-Chief of Data Mining and Knowledge
Discovery Journal

“I would love it if everyone I had to work with had read this book.”
— Claudia Perlich

R Libraries Every Data Scientist Should Know

I have been using R for the longest time in my professional life, I realized that R outclasses Python in several use cases, particularly for statistical analyses. As well, R has some powerful packages that were built by the world’s biggest tech companies, and they aren’t in Python!

And so, in this article, I wanted to go over three R packages that I highly recommend that you take the time to learn and add to your arsenal of tools because they are seriously powerful tools. Without further ado, here are three R packages that every data scientist should know:

1. Causal Impact (Google)

R Libraries Every Data Scientist Should Know

The package is designed to make a counterfactual inference as easy as fitting a regression model, but much more powerful, provided the assumptions above are met. The package has a single entry point, the function CausalImpact(). Given a response time series and a set of control time series, the function constructs a time-series model, performs posterior inference on the counterfactual, and returns an CausalImpact object. The results can be summarized in terms of a table, a verbal description, or a plot.

2. Robyn (Meta / Facebook)

R Libraries Every Data Scientist Should Know

Robyn is an automated Marketing Mix Modeling (MMM) code. It aims to reduce human bias by means of ridge regression and evolutionary algorithms, enables actionable decision making provides a budget allocator and diminishing returns curves and allows ground-truth calibration to account for causation

3. Anomaly Detection (Twitter)

R Libraries Every Data Scientist Should Know

AnomalyDetection is an open-source R package to detect anomalies that is robust, from a statistical standpoint, in the presence of seasonality and an underlying trend. The anomaly detection package can be used in a wide variety of contexts. For example, detecting anomalies in system metrics after a new software release, user engagement post an A/B test, or for problems in econometrics, financial engineering, political and social sciences.