ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
1 1 46 Low 26 Low 3009.60
2 2 40 Low 56 Medium 57347.28
3 3 35 Low 293 High 14496.16
4 4 50 Low 18 Low 1416.20
5 5 77 Medium 14 Low 523.72
6 6 55 Medium 39 Low 8830.32
Monetary_Value_Level Observation_Period Churn
1 Medium 742 0
2 High 2301 0
3 High 2411 0
4 Medium 813 0
5 Low 1 0
6 Medium 2077 0
3 Importing Data
In this chapter, we discuss the main topics regarding data imports. Understanding how to effectively import data is a crucial skill for data analysis, as it forms the foundation upon which all subsequent analysis is built. As such, the principles we will cover are broadly applicable across various software environments.
3.1 Spreadsheets and File Types
Data sets are typically stored in all kinds of formats. Probably the most common type is the table form or electronic spreadsheet (e.g., Excel format). A spreadsheet is similar to a data frame or matrix, because it consists of rows and columns. The type of file determines how we import it into R. Common file types include Excel workbooks, CSV files, or text files with specific delimiters like tabs or semicolons. For instance, a CSV file (Comma-Separated Values) uses commas to separate values within each row. Understanding these file types is crucial because it influences how data is read into R using appropriate functions or packages. For instance, the file customer_churn
below is seen with a text editor:
The first row contains headers, which might appear wrapped due to length but, in terms of structure, they are still a single row. By understanding the file type and structure, we can accurately import our data in R.
3.2 Paths and the working directory
Except for the file type, we need to know the path of a file. The path of a file essentially denotes where the file is stored. Usually, we can have these files in organized folders, which are called directories. Although the names may not be so intuitive, the important thing to remember is that, to import a file in RStudio, we need to know its type and where it is. To understand the terminology, suppose we have a csv file called customer_churn
in a folder called Data Sets
. A possible path in that case would be C:/Users/User/Desktop/Data Set/customer_churn.csv
. Let’s break it down:
Full path:
C:/Users/User/Desktop/Data Sets/customer_churn.csv
Directory Path:
C:/Users/User/Desktop/Data Sets
Directory:
Data Sets
File:
customer_churn.csv
So, by using the full path, we can import a data set in R. To see how this works in practice, we implement what we just described. The built-in function to import a csv file in R is the read.csv()
function. Inside the function we specify the full path and we can store the data in an object directly. In our example, we give the name churn_data
to the object in which we want to store the imported data.
# Import the data
<- read.csv("C:/Users/User/Desktop/Data Sets/customer_churn.csv")
churn_data
# Print the first 6 rows
head(churn_data)
When we work with R though, we are always located “somewhere” in the computer in which we work. In other words, R assumes that we have a specific path, from which we work. This is called our working directory. With working directory, there is no need to specify the full path every time when we import a data set; we can simply use the file name instead of the full path inside the function. Before we see how this works, let’s check our current working directory. For this, we can use the function getwd()
:
# Get Working Directory
getwd()
[1] "C:/Users/User/Document"
We see that our working directory is C:/Users/User/Document. To change the working directory, we can use the function setwd()
. For instance, suppose we want to change the working directory from C:/Users/User/Document to C:/Users/User/Desktop/Data Sets. To do this, we enter the desired directory path inside the parenthesis.
# Change Working Directory
setwd("C:/Users/User/Desktop/Data Sets")
Now, if we use the getwd()
again, we see that our working directory is different.
# Get Working Directory
getwd()
[1] "C:/Users/User/Desktop/Data Sets"
As our working directory is the Data Sets
directory, we can now use the read.csv()
function by only filling the name of the file in the parenthesis:
# Import the data
<- read.csv("Customer_Churn.csv")
churn_data
# Print the first 6 rows
head(churn_data)
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
1 1 46 Low 26 Low 3009.60
2 2 40 Low 56 Medium 57347.28
3 3 35 Low 293 High 14496.16
4 4 50 Low 18 Low 1416.20
5 5 77 Medium 14 Low 523.72
6 6 55 Medium 39 Low 8830.32
Monetary_Value_Level Observation_Period Churn
1 Medium 742 0
2 High 2301 0
3 High 2411 0
4 Medium 813 0
5 Low 1 0
6 Medium 2077 0
We see that the data import occurs successfully. In this way, we can import different data sets quite efficiently. Another advantage is that, when we share our R script, our code is more readable and other people can easily run the script under the assumption that their working directory contains the same data set. Note that when a file is located in the working directory, we can still use the full path if we want; the result would be exactly the same.
3.3 Importing Data in RStudio
Now that we understand “path” and “directory”, we can examine in practice how to import a data set in R by using the RStudio functionality. To keep things simple, we import the same data set with the same full path as described. So, the csv file that we import is called “Customer_Churn” and the directory path is C:/Users/User/Desktop/Data Sets. As shown earlier, we can use the read.csv()
function to import this data set.
However, RStudio also provides a user-friendly functionality that can help us import our data sets in a relatively straightforward manner. By choosing File -> Import Dataset, we can see that RStudio provides us with different options, such as “From Text (readr)…”. By clicking this option, we will see the following output:
To find the file in our computer, we click the option Browse on the top right corner and find the file by browsing in our computer system. When we find the file we want, we click on it and visualize it on the emerging table:
Generally, we can see a number of options to change how RStudio imports the file. In this way, we see exactly what RStudio will import and, respectively, we can make adjustments before the import takes place. Lastly, we see the exact R code that makes the import on the bottom right corner on the bottom right corner. This is very valuable because not only can we use this code later, but also we can learn how to import a data set by coding directly on the console. In this example we note that RStudio used the readr package and the read_csv()
function to make this import possible. This function can be thought of as an advanced version of the read.csv()
function. The details are not important at this point; our goal is to capture the intuition about paths, directory and the overall functionality of RStudio regarding data import.
3.4 Importing Data from other sources
It is possible to import data in R from various sources, including relational database platforms such as MySQL, as well as directly from web pages via URLs. Additionally, R can be used for web scraping, which involves extracting data from HTML or directly from web pages. Given the variety of data sources, it’s impractical to cover every possible method in detail. However, the core idea remains the same: we need to guide R to the location of the data and specify the appropriate function for importing it, as different file types require different functions.