# Print with single quotes
print('Hello world!')
[1] "Hello world!"
# Print with double quotes
print("Hello world!")
[1] "Hello world!"
In this chapter, we explore various text manipulation techniques in
R. More specifically, we start by discussing how to handle strings,
focusing on printing and combining (pasting) them. Next, we
introduce the stringr package, part of the
tidyverse family. This package includes functions that
make handling text data much easier compared to base R. Lastly, we
cover how to identify and work with patterns in text data, a concept
known as regular expressions.
When working with numeric data, it is quite intuitive to perform operations like addition or multiplication on vectors. However, manipulating strings (or character data), one of the core data types in R, requires specific functions. String manipulation can become complex, especially when combining strings from a single vector or different columns of a data frame. With text data, we can perform tasks such as adding or replacing text, finding matches, counting letters, locating positions of specific text characters, and much more.
We can use single quotes (’’) or double quotes (““) to specify a
value (any value) as a string. For instance, suppose we want to
print one of the most well known phrases in the Computer Science,
Data Science and Data Engineering world,”Hello world!“. We can print
this phrase with the print() function, enclosing the
text in single or double strings:
# Print with single quotes
print('Hello world!')
[1] "Hello world!"
# Print with double quotes
print("Hello world!")
[1] "Hello world!"
In both cases, we see that we get the exact same results. However,
what happens if we need to have double or single quotes
within a string? Since R would not know which quotes we
want to include in the string, we need to be able to clarify which
quotes are part of the text itself, and which quotes are used to
indicate a string. To do so, we need to use what is called an
escape sequence. For this, we use the special
character backslash (\) before the single or double quotes that we
want to include in the string and the cat() function.
The cat() function is used to concatenate and display
text in a way that is more suitable for string formatting,
especially when working with escape sequences or when we want to
print text exactly as it appears, without additional characters like
quotes or backslashes. Unlike the print() function,
which shows the internal representation of objects (including quotes
around strings), the cat() function outputs the string
as plain text.
# Print "I want to print "Hello World"" with print()
print("I want to print \"Hello World!\"")
[1] "I want to print \"Hello World!\""
# Print "I want to print "Hello World!"" with cat()
cat("I want to print \"Hello World!\"")
I want to print "Hello World!"
The print() function displays the string as
"He said, \"Hello world!\"" because it shows both the
double quotes and the backslashes. However, the
cat() function displays
He said, "Hello world!" without the additional
characters, making the output cleaner and more readable.
Using cat() is particularly useful when formatting
strings that include special characters such as quotes, as it
provides more control over how the output appears.
The paste() function can be used when we want to
combine two or more string values into a simple string. For
instance, we can use the paste() function to print
“Hello World”, when the two words are separate strings. The example
below can help us understand the difference.
# Print "Hello World" with paste
paste("Hello", "World!")
[1] "Hello World!"
We see that the paste function combines the two string
values into one string, separating them by a space. This occurs
because the default separator of the paste function is space. We can
use a different separator though by changing the argument
sep. For example, suppose we want to print the text
“Data-Science”.
# Print "Data-Science"
paste("Data", "Science", sep = "-")
[1] "Data-Science"
Things can become complicated when we start including whole
character vectors instead of a single string value inside the
paste() function. For instance, suppose we have a
scalar (one-element vector or just a single value as before) and a
vector of two elements (“Science” and “Analytics”).
# Print with a scalar and a vector
scalar <- "Data"
vector <- c("Science", "Analytics")
paste(scalar, vector, sep = "-")
[1] "Data-Science" "Data-Analytics"
When we have two vectors of the same length, vectorization takes place, as we would expect with vector inputs. This means that each element of one vector is combined with the corresponding element of the other vector.
# Print with two vectors
vector1 <- c("Data", "Science")
vector2 <- c("Data", "Analytics")
paste(vector1, vector2, sep = "-")
[1] "Data-Data" "Science-Analytics"
With vectors that are not of the same length, R will recycle the shorter vector to match the length of the longer one. This means that the shorter vector is repeated until it matches the length of the longer vector, which can sometimes lead to unexpected or undesired results if not used carefully.
# Print with two vectors
vector1 <- c("Data", "Science")
vector2 <- c("Data", "Analytics", "Engineering")
paste(vector1, vector2, sep = "-")
[1] "Data-Data" "Science-Analytics" "Data-Engineering"
If we want to combine all elements together, we can use the
collapse argument, including a character based on which
we want to make this combination. For instance, suppose we add to
the last example the argument collapse with the value ”
and ” (notice that we included spaces inbetween).
# Print with a scalar and a vector
scalar <- "Data"
vector <- c("Science", "Analytics")
paste(scalar, vector, sep = "-", collapse = " and ")
[1] "Data-Science and Data-Analytics"
# Print with two vectors
vector1 <- c("Data", "Science")
vector2 <- c("Data", "Analytics")
paste(vector1, vector2, sep = "-", collapse = " and ")
[1] "Data-Data and Science-Analytics"
Lastly, a variation of the function paste() is the
function paste0(). The difference between these two
functions is that the first leaves a space between every piece of
text we include, while the second does not.
# Print "Hello World" with paste
paste("Hello", "World!")
[1] "Hello World!"
# Print "Hello World" with paste0
paste0("Hello", "World!")
[1] "HelloWorld!"
When we use the paste() or the
paste0() function, it is a good idea to try some prints
just to make sure that the output is the expected one. We saw that
when we have vectors inside the function, things can become
complicated.
stringr Package
As mentioned at the beginning, the package we can use for text
manipulation is the stringr package. Although base R
already provides many alternative functions to manipulate strings,
it is better to use the stringr package due to its
consistency on the basic syntax, i.e.,:
With RStudio, this is very handy as we are not concerned about remembering all these string manipulation functions: when we type “str”, RStudio automatically shows us all available alternatives.
Let’s start by loading the stringr package and creating
a vector of character values.
# Library
library(stringr)
# Quotes
quotes <- c("Become a Master in Data Science.",
"The best way to learn data science is to do data science.",
"Text mining is an essential skill.")
With this character vector, we can experiment using the many
functions of the stringr package, all with different
purposes. For instance, suppose we want to check whether the word
“is” exists within each value of the vector. We can do this by using
the str_detect() function, since we try to “detect”
whether a specific pattern exists within a string.
# Is the pattern in the string?
str_detect(quotes, pattern = "is")
[1] FALSE TRUE TRUE
As expected, we get the values FALSE, TRUE and TRUE because the word “is” is found within the second and the third element of the vector but not in the first one.
Another similar function is str_which() function. This
function is similar to the which() function from base R
and shows in which elements the specified pattern exists.
# Return the indexes of entries that contain the pattern
str_which(quotes, pattern = "is")
[1] 2 3
Regarding sub-setting strings, the functions
str_sub() and str_subset() can be used.
The first subsets a string based on specified positions while the
second subsets a string based on a specified pattern. The example
below shows how they work and helps us understand the difference
between the two.
# Extract the first 6 characters
str_sub(quotes, start = 1, end = 6)
[1] "Become" "The be" "Text m"
# Return the subset of the strings that contains the word "Master"
str_subset(quotes, pattern = "Master")
[1] "Become a Master in Data Science."
If we want to check whether a specific pattern exists within a
string, the str_view() function emphasizes this pattern
(if it exists of course).
# Emphasize word "is"
str_view(quotes, pattern = "is")
[2] │ The best way to learn data science <is> to do data science.
[3] │ Text mining <is> an essential skill.
With the str_split() function, we can split a string
into a list with its parts being separated by a specified pattern.
In the example below, we see how each element is a different part of
the list as well as how the string in the second and third element
is split within the list.
# Split the quotes to create a list
str_split(quotes, pattern = "is")
[[1]]
[1] "Become a Master in Data Science."
[[2]]
[1] "The best way to learn data science " " to do data science."
[[3]]
[1] "Text mining " " an essential skill."
We see that, in practice, the functions found in
stringr are simple and effective. There are many other
functions in stringr, but the table below provides an
overview regarding the ones most commonly used. More specifically,
the below table includes the names of the functions, their
description, a usage example using the vector “quotes” (the one
previously created) and the respective output. It is advisable to
come back and check this table when we want to solve a task that
includes strings.
| Function | Description |
|---|---|
| str_detect() | Is the pattern in the string? |
| str_which() | Return the indexes of entries that contain the pattern |
| str_sub() | Extract the characters based on a specified positions (e.g. from 1 to 6) |
| str_subset() | Return the subset of the strings that contains the pattern(e.g. “Master”) |
| str_replace() | Replace the first part of a string with another (if pattern is matched) |
| str_replace_all() | Replace all parts of a string with another (if pattern is matched) |
| str_locate() | Return positions of the first occurrence of the specified pattern |
| str_locate_all() | Return positions of all occurrences of the specified pattern |
| str_to_upper() | Change all characters to upper case letters |
| str_to_lower() | Change all characters to lower case letters |
| str_to_title() | Change first character to upper and rest to lower |
| str_length() | Number of characters in a string |
| str_count() | Count number of times a pattern appears in a string |
| str_replace_na() | Replace all NAs to a new specified value |
| str_trim() | Remove white space at the start and at the end of a string |
| str_sort() | Sort the vector in alphabetical order |
| str_order() | Indexes to order the vector in alphabetical order |
| str_trunc() | Truncate a string to a fixed size (the dots consume 3 spots) |
| str_c() | Joining strings |
| str_view_all() | Emphasize all the parts of a string that match the pattern |
| str_split() | Split a string into a list with its parts to be separated by the pattern |
In R, regular expressions are pattern-matching
tools that enable the concise and flexible manipulation of text data
by providing a syntax for specifying search patterns and
facilitating string matching and manipulation operations. Put
simply, we use regular expressions to describe patterns in strings.
To understand what this means and how we can use regular
expressions, we will use the string “Data!” with the function
str_detect() that we discussed previously.
# Check Regular Expression for "Data!"
str_detect("Data!", pattern = "^....!")
[1] TRUE
What exactly is this pattern? As we see, we just matched the pattern of “Data!” using a sequence of special characters. The special character caret (^) signifies the start of a string, without considering (or representing) the first letter. Then, we used the special character dot (.) 4 times because a dot represents a single letter in our string. Since the word “Data” contains 4 letters, we used dot 4 times to capture the pattern. Lastly, we included the special character exclamation mark (!) because it appears in our string. As a result, we described the pattern of the string “Data!” fully and that is why we got TRUE as an output. It is important to understand that the exact same regular expression would describe similar strings such as “Math!” or “Stat!” as the pattern is exactly the same (4 letters, followed by an exclamation mark).
# Check Regular Expression for "Math!"
str_detect("Math!", pattern = "^....!")
[1] TRUE
# Check Regular Expression for "Stat!"
str_detect("Stat!", pattern = "^....!")
[1] TRUE
That is actually the difference between regular expressions and
using the exact same value of a string as a pattern. Had we used the
value “Data!” in the argument pattern, of course we
would get the output TRUE in the first example but we would get
FALSE in the other two examples. Because our purpose is to describe
the general pattern of sequential values in a vector, it is very
useful to be able to describe those patterns.
# Check Regular Expression for "Data!" with the pattern "Data!"
str_detect("Data!", pattern = "Data!")
[1] TRUE
# Check Regular Expression for "Math!" with the pattern "Data!"
str_detect("Math!", pattern = "Data!")
[1] FALSE
# Check Regular Expression for "Stat!" with the pattern "Data!"
str_detect("Stat!", pattern = "Data!")
[1] FALSE
The main question that arises of course is “what is the point of learning regular expressions?”. Regular expressions are very useful when it comes to text data manipulation. For instance, suppose we have a vector that describes the body weight of 5 people.
# Create a vector
body_weight <- c("75 KG", "82 KG", "85 KG", "68 KG", "79 KG")
# Print body_weight
body_weight
[1] "75 KG" "82 KG" "85 KG" "68 KG" "79 KG"
# Print the class of the bodyweight
class(body_weight)
[1] "character"
Since we have text in this vector, the data type of our created
vector is “character”. In practice though, we would probably want to
separate those numeric values from the character values within each
element of the vector “body_weight” in order to perform analysis on
the numeric values. In other words, there is no need to have the
value of “kg”in our vector. By describing the pattern in the
function str_remove() from the
stringr package we can remove those strings using
regular expressions.
# Create a vector
body_weight <- str_remove(string = body_weight, pattern = c(" ..$"))
# Print body_weight
body_weight
[1] "75" "82" "85" "68" "79"
# Print the class of the body_weight
class(body_weight)
[1] "character"
This simple example clearly illustrates the value of regular
expressions. However, regular expressions can be very confusing,
especially when we are working with more complicated strings. For
this reason, we will learn how to create regular expressions with
the rebus package, which contains functions to help us
construct regular expressions easier, in a way closer to human
language. Although not part of tidyverse, this package can be
greatly combined with the stringr package. To see how
this works, let’s load the rebus package and
use its syntax to describe the same string “Data!” (with the
str_detect() function).
# Library
library(rebus)
# Check Regular Expression for "Data!" with base R
str_detect("Data!", pattern = "^....!")
[1] TRUE
# Check Regular Expression for "Data!" with rebus
str_detect("Data!",
pattern = START %R% ANY_CHAR %R%
ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% "!")
[1] TRUE
We see that the pattern is much more understandable the way we wrote
it using the rebus package. We start
(START) the pattern, then we use 4 times the syntax
ANY_CHAR because the word “Data” consists of 4 letters,
and finally we use an exclamation mark in double quotes. The special
operator %R% can be read as “followed by” or “then”.
With the mentioned syntax though, we describe the
whole pattern of the value “Data!”. We could describe this
word in many different ways, such as the ones in the following
example.
# "Data!" - the pattern is that the character value starts with any character
str_detect("Data!", pattern = START %R% ANY_CHAR)
[1] TRUE
# "Data!" - the pattern is that the character value ends with exclamation mark
str_detect("Data!", pattern = "!" %R% END)
[1] TRUE
In both cases, we see that we get the value of TRUE. It is therefore important to understand that there is no need to describe the whole pattern every time. How we will describe a pattern though really depends on the underlying data. Especially with regular expressions, it is important to practice and try to understand the output that we get when we use a specific pattern. In our previous example, if we set the pattern to the value of “START %R% one_or_more(DGT)”, we get the value of FALSE. Clearly, the reason would be that the value “Data!” does not start with one or more digits, but if we had the value “4-Data!”, we would get the value TRUE.
# Describe "Data!"
str_detect("Data!", pattern = START %R% one_or_more(DGT))
[1] FALSE
# Describe "4-Data!"
str_detect("4-Data!", pattern = START %R% one_or_more(DGT))
[1] TRUE
Now that we have discussed the intuition behind regular expressions
and the rebus package, we can focus on the more
technical details. The table below summarizes the syntax for
different regular expressions using the corresponding syntax of the
rebus package. It is important to note that the syntax
that we use with rebus is NOT considered a regular
expression; we use rebus to construct a regular
expression in an easier way.
| Regular_Expression | Rebus | Description |
|---|---|---|
| ^ | START | Start of a string |
| $ | END | End of a string |
| . | ANY_CHAR | Any single character |
| ? | optional() | Optional pattern |
| * | zero_or_more() | Zero or more occurences |
| + | one_or_more() | One or more occurences |
| {} | repeated() | Repeated pattern |
| | | or() | Choice among alternatives |
| [] | char_class() | Any character within a specified set |
| [^] | negated_char_class() | Any character NOT in a specified set |
| \^ | CARET | Caret sign |
| \$ | DOLLAR | Dollar sign |
| \. | DOT | Dot sign |
| \d | DGT | Any digit |
| \w | WRD | Any character |
| \s | SPC | Any whitespace |