Filtering and subsetting data in R Programming Language
Table of Content:
The data that we read in our previous recipes exists in R as data frames. If you want to know how to read and sub set data please read by clicking here. Data frames are the primary structures of tabular data in R. By a tabular structure, we mean the row-column format. The data we store in the columns of a data frame can be of various types, such as numeric or factor. In this recipe, we will talk about some simple operations on data to extract parts of these data frames, add a new chunk, or filter a part that satisfies certain conditions.
The following items are needed for this recipe:
- A data frame loaded to be modified or filtered in the R session (in our case, the iris data)
- Another set of data to be added to item 1 or a set of filters to be extracted from item 1
Perform the following steps to filter and create a subset from a data frame:
- Load the iris data as explained in the earlier recipe.
- To extract the names of the species and corresponding sepal dimensions (length and width), take a look at the structure of the data as follows:
> str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 … $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 … $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 … $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 … $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- To extract the relevant data to the
myiris
object, use thedata.frame
function that creates a data frame with the defined columns as follows:> myiris=data.frame(Sepal.Length=iris$Sepal.Length, Sepal.Width= iris$Sepal.Width, Species= iris$Species)
- Alternatively, extract the relevant columns or remove the irrelevant ones (however, this style of subsetting should be avoided):
> myiris <- iris[,c(1,2,5)]
- Instead of the two previous methods, you can also use the removal approach to extract the data as follows:
> myiris <- iris[,-c(3,4)]
- You can add to the data by adding a new column with
cbind
or a new row throughrbind
(thernorm
function generates a random sample from a normal distribution and will be discussed in detail in the next recipe):> Stalk.Length <-c (rnorm(30,1,0.1),rnorm(30,1.3,0.1), rnorm(30,1.5,0.1),rnorm(30,1.8,0.1), rnorm(30,2,0.1)) > myiris <- cbind(iris, Stalk.Length)
- Alternatively, you can do it in one step as follows:
> myiris$Stalk.Length = c(rnorm(30,1,0.1),rnorm(30,1.3,0.1), rnorm(30,1.5,0.1),rnorm(30,1.8,0.1), rnorm(30,2,0.1))
- Check the new data frame using the following commands:
> dim(myiris) [1] 150 6 > colnames(myiris)# get column names for the data frame myiris [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" "Stalk.Length"
- Use
rbind
as depicted:newdat <- data.frame(Sepal.Length=10.1, Sepal.Width=0.5, Petal.Length=2.5, Petal.Width=0.9, Species="myspecies") > myiris <- rbind(iris, newdat) > dim(myiris) [1] 151 5 > myiris[151,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 151 10.1 0.5 2.5 0.9 myspecies
- Extract a part from the data frame, which meets certain conditions, in one of the following ways:
- One of the conditions is as follows:
> mynew.iris <- subset(myiris, Sepal.Length == 10.1)
- An alternative condition is as follows:
> mynew.iris <- myiris[myiris$Sepal.Length == 10.1, ] > mynew.iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 151 10.1 0.5 2.5 0.9 myspecies > mynew.iris <- subset(iris, Species == "setosa")
- One of the conditions is as follows:
- Check the following first row of the extracted data:
> mynew.iris[1,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa
You can use any comparative operator as well as even combine more than one condition with logical operators such as
&
(AND),|
(OR), and!
(NOT), if required.