This post aims to share with you how to create a pareto chart by using ggplot package of R.
Pareto chart is a popular data visualisation use to reveal something like the 80-20 rule.
The purpose of the Pareto chart is to highlight the most important among a (typically large) set of factors. In market share study, Pareto charts are useful to reveal if the business is dominated by a handful of key products or customers. In general, businesses will try to avoid their businesses over-dependent by a few products or customers because if something happen to their key products or business partners, the businesses will be very badly affected.
In the nutshell, a Pareto chart is a dual axis chart. The y-axis on the left (also known as primary y-axis) shows the absolute frequency of the observations and their values are presented as bars. The y-axis on the right shows the cumulative frequency of the observed values and they are presented as dots. These dots are then joined by a smooth line.
In this exercise, we are going to create a pareto chart showing the distribution of returns by product sub-category by using a data set from a retail store group.
Figure below shows the workflow of a typical data science and analytics process and the tidyverse R packages that specially designed to support the specific tasks in the data science and analytics process.
Before we get started, it is important for us to ensure that the required R packages have been installed. For the purpose of the exercise, the follow tidyverse packages will be used:
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages = c('tidyverse', 'readxl', 'knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
It is important note that readxl package has to be listed separately in the packages list because it is not part of the core tidyverse package.
In this exercise, superstore-2021 data set will be used in this example. It is an MS Excel workbook. It consists of three worksheets, namely: Orders, People and Returns.
The code chunk below imports superstore-2021.xls into R environment by using read_csv() function of readr package.
orders <- read_xls("data/Superstore-2021.xls",
sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls",
sheet = "Returns")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio.
In this step, the left_join() of dplyr is used to join the returns data frame and orders data frame by using Order ID as the unique identifier.
joined_tab <- left_join(returns, orders,
by = c('Order ID' = 'Order ID'))
To learn more about Join of dplyr, click on this link.
Next, we are going to compute the frequency count of returns by sub-category. There are two ways to complete the task. The first way is by using the group-by method and the second way is by using the count method of dplyr.
In the code chunk below, group_by() of dplyr package is used to group the orders by Sub-Category. Then, summarise() of dplyr is used to count (i.e. n()) the number of returned orders.
freq_returned <- joined_tab %>%
group_by(`Sub-Category`) %>%
summarise('Returns' = n()) %>%
ungroup()
The code chunk below shows the alternative way to derive the frequency of returns by sub-category. In this case, count() of dplyr package is used.
freq_returned <- joined_tab %>%
count(`Sub-Category`) %>%
rename(Returns = n)
By default, count() will return a new field called n to store the result. In view of this, rename() of dplyr is used to rename n to Returns.
By default, the values of the tibble data frame is sorted according to the values of the first column. In this case, the values are sorted alphabetically by Sub-Category field as shown in the figure below
Before we can compute the cumulative frequency, we need to sort the values in the sub-category field by the values in the Returns field. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.
freq_sorted <- freq_returned %>%
arrange(desc(Returns))
The sorted data frame should look similar to the screenshot below.
Out next task is to compute the cumulative frequency of returns by product sub-category. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.
freq_cum <- freq_sorted %>%
mutate(cumfreq = cumsum(Returns))
The newly computed cumulative frequency values will be stored in a new field called cumfreq as shown in the screenshot below.