RFP - Part1: R Vectors

Master R

By Guangming Lang Comment

The simplest data structure in R is the vector. A vector is one dimensional and can be imagined as a sequence of blocks containing values:

    | v1 | v2 | ... |

A R vector can have any length. Its elements must have the same data type. (We’ll see later that each element is itself a length-1 vector.) There are four common types: logical, integer, double (also called numeric), and character. A vector is numeric if and only if its elements are doubles. Similarly, a logical vector has TRUE or FALSE as its elements. An integer vector contains only integers, and a character vector has only strings. Given a vector x, we can call typeof(x) to find its type.

Empty vectors

There’s an empty vector for each type, with the following syntax:

  • logical()
  • integer()
  • numeric()
  • character()

An empty vector has 0 elements and 0 length.

Length-1 vectors

Unlike most other programming languages, R doesn’t have scalar types or values (or as I like to call them, singletons). What appear as singletons are really just vectors of length one. For example, literals like TRUE, 1L, 3.124, "R is awesome" are vectors of length 1, and each has a different type. Constants like NA, NA_integer_, NA_real_ and NA_character_ are also vectors of length 1, where NA has type logical even though it’s not written explicitly like the others.

print_atom_info = function(x) {
        # Prints the type and length of an atomic/single value. 
        # Shows atomic/singleton values are really just vectors of length 1.
        #
        # x: an atomic/singleton value such as 1L, "R is awesome", NA and etc.
        data_type = typeof(x)
        
        if (is.finite(x)) {
                if (data_type == "integer") x_str = paste0(x, "L")
                else x_str = x
        } else {
                if (is.nan(x) | is.infinite(x)) x_str = x
                else if (is.na(x)) { 
                        if (data_type == "logical") x_str = x
                        if (data_type == "integer") x_str = "NA_integer_"
                        if (data_type == "double") x_str = "NA_real_"
                        if (data_type == "character") x_str = "NA_character_"
                } else { # x must be a string
                        x_str = paste0("'", x, "'")   
                }        
        }
                
        print(paste(x_str, "is a vector of type", typeof(x), 
                    "and length", length(x)))
}

# literals are vectors of length 1
{
        print_atom_info(TRUE)        
        print_atom_info(1L) 
        print_atom_info(3.124)        
        print_atom_info("R is awesome")
}
## [1] "TRUE is a vector of type logical and length 1"
## [1] "1L is a vector of type integer and length 1"
## [1] "3.124 is a vector of type double and length 1"
## [1] "'R is awesome' is a vector of type character and length 1"
# constants are also vectors of length 1
{
        print_atom_info(NA)
        print_atom_info(NA_integer_)
        print_atom_info(NA_real_)
        print_atom_info(NA_character_)
        print_atom_info(NaN)
        print_atom_info(Inf)
        print_atom_info(-Inf)
}
## [1] "NA is a vector of type logical and length 1"
## [1] "NA_integer_ is a vector of type integer and length 1"
## [1] "NA_real_ is a vector of type double and length 1"
## [1] "NA_character_ is a vector of type character and length 1"
## [1] "NaN is a vector of type double and length 1"
## [1] "Inf is a vector of type double and length 1"
## [1] "-Inf is a vector of type double and length 1"

Length-n vectors, n > 1

The syntax for a vector with at least 2 values is c(v1, v2, ..., vn). (Now we know each value v is itself a length-1 vector, we’ll stop repeating this and simply treat them as if they are single atomic values.) We can make a vector with c(e1, ..., en) where each expression1 e is evaluated to a value. In practice, it’s more common to make a vector with c(e1, e2), called “e1 combined with e2,” where e1 evaluates to a “vector of type t” and e2 evaluates to another “vector of type t.” The result is a new vector that starts with the elements in e1 followed by the elements in e2. When e1 evaluates to a single value, borrowing the word “cons” from FP (Functional Programming), c(e1, e2) can also be called “e1 consed onto e2.” The result is then a new vector that starts with the value of e1 followed by the elements in e2.

How to use vectors

One goal of this RFP (R Functional Programming) series is to learn the fundamental ideas of functional programming using R2. These ideas are very powerful, and the first one we’ll look at is the emphasis on recursion. As we’ll see, because of recursion, all we need, when working with vectors, are three basic operations:

  1. Check if a vector is empty.
  2. Get the first element of a vector, raising an exception if the vector is empty.
  3. Get the tail of a vector without its first element, raising an exception if the vector is empty.

And we can solve almost all problems that involve one or more vectors. But R doesn’t provide these basic operations perfectly out of the box. Instead, we have to write our own functions for them.

# Check if a vector is empty. Returns TRUE if it is, FALSE otherwise.
is_empty = function(xs) { # xs: a vector of any type
        is.null(xs) | identical(xs, logical()) | identical(xs, integer()) |
                identical(xs, numeric()) | identical(xs, character())
}

# Get the first element of a vector, raising an exception if the vector is empty.
hd = function(xs) { # xs: a vector of any type
        if (is_empty(xs)) stop("Vector is empty.")
        else xs[1]
}

# Get the tail of a vector without its first element, raising an exception if the vector is empty.
tl = function(xs) { # xs: a vector of any type
        if (is_empty(xs)) stop("Vector is empty.")
        else xs[-1]
}

Having defined is_empty(), hd() and tl(), we can use them inside of recursive functions to perform complex operations on vectors. For example, we can sum up all values in an integer or numeric vector.

sum_vec = function(xs) {
        # xs: an interger or numeric vector
        if (is_empty(xs)) 0
        else hd(xs) + sum_vec(tl(xs))
}

{
        sum_vec(integer())
        sum_vec(1)
        sum_vec(1:10)
}
## [1] 55

We can also count from n down to 0 and return a vector with integer elements of n, n-1, …, 0.

countdown = function(n) {
        # n: a positive integer
        if (n == 0) integer()
        else c(n, countdown(n-1))
}
countdown(10)
##  [1] 10  9  8  7  6  5  4  3  2  1

Both sum_vec() and countdown() are recursive functions. A recursive function (or recursion) has a base case and a recursive case. For example, in sum_vec(), the base case is when the input vector is empty, and the result is just 0. When the vector is not empty, we enter the recursive case and get the result by adding its first element and the result of calling sum_vec() on its tail, which is also a vector. In countdown(), the base case is when n is 0, and the result is an empty integer vector. We enter the recursive case as long as n > 0, and get the result by consing n onto the result of calling countdown() on n-1. In general, when thinking about recursion, we want to reason as follows:

  1. What’s the base case? What should the result be under the base case?
  2. What’s the recursive case? How can the result be expressed in terms of the result for the sub-problem (for example, the rest of the vector or n-1).

It is not a coincidence that we’ve written both sum_vec() and countdown() recursively. From the FP perspective, recursion is almost always THE approach when processing or building vectors because a vector can grow or shrink and its length isn’t needed for recursion to work. The alternative approach of using loops and assignment statements is inferior and discouraged. To learn why, google “loops discouraged in functional programming.” There are many good and thorough discussions about this topic on the internet. Here I’ll just give a superficial but important reason: it takes more lines to write the same sum_vec() using a while-loop. We’ll also need extra things that recursion doesn’t, namely, local variables and assignment statements.

sum_vec_while_loop = function(xs) {
        tot = 0 # need a local variable to hold the result
        while (!is_empty(xs)) {    # as long as the loop doesn't end,
                tot = tot + hd(xs) #   need to keep copy-n-modify tot, and 
                xs = tl(xs)        #   need to keep copy-n-modify xs.
        }
        tot
}
  1. The word ‘expression’ here is being used in the sense of expressions in any programming language or mathematical expressions. Do NOT confuse it with the R expression object. 

  2. R is a functional programming language in its core. 

If you enjoyed this post, get updates. It's FREE