Importing Data

Introduction

In this chapter, we discuss the main topics regarding data imports. Understanding how to effectively import data is a crucial skill for data analysis, as it forms the foundation upon which all subsequent analysis is built. As such, the principles we will cover are broadly applicable across various software environments.

Spreadsheets and File Types

Datasets are typically stored in all kinds of formats. Probably the most common type is the table form or electronic spreadsheet (e.g., Excel format), meaning it consists of rows and columns. The type of file determines how we import it into Python. Common file types include Excel workbooks, CSV files, or text files with specific delimiters such as tabs or semicolons. For instance, a CSV file (Comma-Separated Values) uses commas to separate values within each row. Understanding these file types is crucial because it influences how data is read into Python using appropriate functions or packages. For instance, the file Customer_Churn below is seen with a text editor:

Figure 3.1: Opening a CSV in a text editor.

The first row contains headers, which might appear wrapped due to length but, in terms of structure, they are still a single row. By understanding the file type and structure, we can accurately import our data in Python.

Paths and the Working Directory

Except for the file type, we need to know the path of a file. The path of a file essentially denotes where the file is stored. Usually, we can have these files in organized folders, which are called directories. Although the names may not be so intuitive, the important thing to remember is that, to import a file in VS Code, we need to know its type and where it is. To understand the terminology, suppose we have a csv file called Customer_Churn in a folder called Data Sets. A possible path in that case would be C:/Users/User/Desktop/Data Set/Customer_Churn.csv. Let’s break it down:

  • Full path: C:/Users/User/Desktop/Data Sets/Customer_Churn.csv

  • Directory Path: C:/Users/User/Desktop/Data Sets

  • Directory: Data Sets

  • File: Customer_Churn.csv

So, by using the full path, we can import a dataset in Python. To see how this works in practice, we implement what we just described. Even though it is possible to import a csv file using the built-in functions, it is much easier to use the function read_csv() from Pandas.

# Importing pandas using the alias pd
import pandas as pd

As discussed in Chapter Introduction to Python and Visual Studio Code, the as statement allows us to create a shorter and more convenient name for a package or module when importing it. Using aliases in this way is very common in Python, especially for widely used packages such as Pandas. It makes the code shorter, easier to read, and more convenient to write repeatedly.

Therefore, to use the read_csv() function from Pandas, we use the format pd.read_csv(). This way of calling functions functions from specific libraries or modules applies to most Python libraries: we first specify the package or module (or its alias), followed by a dot (.), and then the function name. After using the read_csv() function, we can use the method head() to print the first few rows of the dataset:

# Importing Customer_Churn
customer_churn = pd.read_csv("C:/Users/User/Desktop/Data Sets/Customer_Churn.csv")

# Printing the first few rows 
customer_churn.head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

When we work with Python though, we are always located “somewhere” in the computer in which we work. In other words, Python assumes that we have a specific path, from which we work. This is called our working directory. With working directory, there is no need to specify the full path every time when we import a dataset; we can use the file name instead of the full path inside the function. To check the current working directory, we can either check the path that is written on the terminal, or using the function getcwd() from the module os, which is part of Python’s standard library:

import os

# Printing working directory 
os.getcwd()
C:/Users/User/Document

We see that our working directory is C:/Users/User/Document (your directory will probably be different). To change the working directory, we can use the function chdir() in Python or use the command cd on the terminal, followed by the directory of our choice. For instance, suppose we want to change the working directory from C:/Users/User/Document to C:/Users/User/Desktop/Data Sets. To do this, we pass the desired directory path to chdir():

# Changing working directory 
os.chdir("C:/Users/User/Desktop/Data Sets")

Now, if we use the getcwd() again, we see that our working directory is different:

# Printing working directory 
os.getcwd()
C:/Users/User/Desktop/Data Sets

As our working directory is the Data Sets directory, we can now use the read_csv() function by only filling the name of the file in the parenthesis:

# Importing Customer_Churn
customer_churn = pd.read_csv("Customer_Churn.csv") 

# Printing the first few rows 
customer_churn.head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

We see that the data import occurs successfully. In this way, we can import different datasets quite efficiently. Another advantage is that, when we share our Python script, our code is more readable and other people can easily run the script under the assumption that their working directory contains the same dataset. Note that when a file is located in the working directory, we can still use the full path if we want; the result would be exactly the same.

Slash vs Backslash

The difference between a slash (/) and a backslash (\) in the context of file paths primarily relates to their usage in different operating systems and how they denote locations in a file system:

Slash (/) is commonly used in Unix-like operating systems (Linux, macOS) and URLs.

Backslash (\) is primarily used in Windows operating systems.

However, in Python, as in many other programming languages, the backslash (\) is used as an escape character. This means that when Python sees a backslash, it expects it to be followed by another character or sequence that represents a special character or command (e.g., \n for a new line or \t for a tab).

Example:

• Incorrect (using a single backslash): C:\Users\YourName\Documents

• Correct (using double backslashes): C:\\Users\\YourName\\Documents

• Preferred (using forward slashes): C:/Users/YourName/Documents

For ease of use and to avoid problems with escape characters, using forward slashes (/) in file paths is often the simplest option when working in Python, regardless of the operating system.

Importing Data from GitHub

In addition to importing files from local directories, Python also allows direct data import from online sources. GitHub is a web-based platform for hosting and sharing code, datasets, and collaborative projects. It is one of the most widely used platforms for collaboration and storing code. Although GitHub offers many features, we don’t need to explore them in detail here. Our goal is to see how we can import a dataset from such platforms. Many of these datasets can be accessed directly via their URLs. For instance, suppose we want to import the publicly available customer_churn dataset from GitHub.

Figure 3.2: Viewing the customer_churn dataset on GitHub.

By clicking the Raw tab on GitHub, we access the raw CSV file that can be directly imported into Python:

Figure 3.3: Viewing the raw customer_churn dataset.

The format is still CSV (values separated by commas), which allows us to import the dataset using the read_csv() function by providing the raw file URL:

# Importing customer_churn from GitHub
customer_churn = pd.read_csv("https://raw.githubusercontent.com/Datakortex/Datasets/refs/heads/main/customer_churn.csv")

# Printing the first few rows 
customer_churn.head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

Notice that the local file path does not matter in this case, as we are importing the dataset directly from the web. This is particularly useful for reproducible research and collaborative projects. It allows all users to access the same dataset from a shared online location without manually downloading files. As long as the URL remains valid, the dataset can be imported consistently across different systems and environments.

Importing Data from Other Sources

It is possible to import data in Python from various sources, including relational database platforms such as MySQL, as well as directly from web pages via URLs (just like we did earlier). Additionally, Python can be used for web scraping, which involves extracting data from HTML or directly from web pages. Given the variety of data sources, it’s impractical to cover every possible method in detail in this introductory chapter. However, the core idea remains the same: we need to guide Python to the location of the data and specify the appropriate function for importing it, as different file types require different functions.