Working with data in a matrix (2024)

Loading data

Our example data is quality measurements (particle size) on PVC plastic production, using eight different resin batches, and three different machine operators.

The data set is stored in comma-separated value (CSV) format. Each row is a resin batch, and each column is an operator. In RStudio, open pvc.csv and have a look at what it contains.

read.csv("r-intro-files/pvc.csv", row.names=1)

We have called read.csv with two arguments: the name of the file we want to read, and which column contains the row names. The filename needs to be a character string, so we put it in quotes. Assigning the second argument, row.names, to be 1 indicates that the data file has row names, and which column number they are stored in. If we don’t specify row.names the result will not have row names.

dat <- read.csv("r-intro-files/pvc.csv", row.names=1)

dat

## Alice Bob Carl## Resin1 36.25 35.40 35.30## Resin2 35.15 35.35 33.35## Resin3 30.70 29.65 29.20## Resin4 29.70 30.05 28.65## Resin5 31.85 31.40 29.30## Resin6 30.20 30.65 29.75## Resin7 32.90 32.50 32.80## Resin8 36.80 36.45 33.15

class(dat)

## [1] "data.frame"

str(dat)

## 'data.frame': 8 obs. of 3 variables:## $ Alice: num 36.2 35.1 30.7 29.7 31.9 ...## $ Bob : num 35.4 35.4 29.6 30.1 31.4 ...## $ Carl : num 35.3 33.4 29.2 28.6 29.3 ...

read.csv has loaded the data as a data frame. A data frame contains a collection of “things” (rows) each with a set of properties (columns) of different types.

Actually this data is better thought of as a matrix1. In a data frame the columns contain different types of data, but in a matrix all the elements are the same type of data. A matrix in R is like a mathematical matrix, containing all the same type of thing (usually numbers).

R often but not always lets these be used interchangably. It’s also helpful when thinking about data to distinguish between a data frame and a matrix. Different operations make sense for data frames and matrices. Data frames are very central to R, and mastering R is very much about thinking in data frames. However anything statistical will often involve using matrices. For example when we work with RNA-Seq data we use a matrix of read counts. So it will be worth our time to learn to use matrices as well.

Let us insist to R that what we have is a matrix. as.matrix “casts” our data to have matrix type.

mat <- as.matrix(dat)class(mat)

## [1] "matrix"

str(mat)

## num [1:8, 1:3] 36.2 35.1 30.7 29.7 31.9 ...## - attr(*, "dimnames")=List of 2## ..$ : chr [1:8] "Resin1" "Resin2" "Resin3" "Resin4" ...## ..$ : chr [1:3] "Alice" "Bob" "Carl"

Much better.

Indexing matrices

We can check the size of the matrix with the functions nrow and ncol:

nrow(mat)

## [1] 8

ncol(mat)

## [1] 3

This tells us that our matrix, mat, has 8 rows and 3 columns.

If we want to get a single value from the matrix, we can provide a row and column index in square brackets:

# first value in matmat[1, 1]

Summary functions

Now let’s perform some common mathematical operations to learn about our data. When analyzing data we often want to look at partial statistics, such as the maximum value per resin or the average value per operator. One way to do this is to select the data we want as a new temporary variable, and then perform the calculation on this subset:

# first row, all of the columnsresin_1 <- mat[1, ]# max particle size for resin 1max(resin_1)

## [1] 36.25

We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:

# max particle size for resin 2max(mat[2, ])

## [1] 35.35

R has functions for other common calculations, e.g.finding the minimum, mean, median, and standard deviation of the data:

# minimum particle size for operator 3min(mat[, 3])

## [1] 28.65

# mean for operator 3mean(mat[, 3])

## [1] 31.4375

# median for operator 3median(mat[, 3])

## [1] 31.275

# standard deviation for operator 3sd(mat[, 3])

Challenge - Subsetting data in a matrix

Suppose you want to determine the maximum particle size for resin 5 across operators 2 and 3. To do this you would extract the relevant slice from the matrix and calculate the maximum value. Which of the following lines of R code gives the correct answer?

max(mat[5, ])
max(mat[2:3, 5])
max(mat[5, 2:3])
max(mat[5, 2, 3])

Summarizing matrices

What if we need the maximum particle size for all resins, or the average for each operator? As the diagram below shows, we want to perform the operation across a margin of the matrix:

To support this, we can use the apply function.

apply allows us to repeat a function on all of the rows (MARGIN = 1) or columns (MARGIN = 2) of a matrix. We can think of apply as collapsing the matrix down to just the dimension specified by MARGIN, with rows being dimension 1 and columns dimension 2 (recall that when indexing the matrix we give the row first and the column second).

Thus, to obtain the average particle size of each resin we will need to calculate the mean of all of the rows (MARGIN = 1) of the matrix.

avg_resin <- apply(mat, 1, mean)

And to obtain the average particle size for each operator we will need to calculate the mean of all of the columns (MARGIN = 2) of the matrix.

avg_operator <- apply(mat, 2, mean)

Since the second argument to apply is MARGIN, the above command is equivalent to apply(dat, MARGIN = 2, mean).

Challenge - summarizing the matrix

How would you calculate the standard deviation for each resin?

Advanced: How would you calculate the values two standard deviations above and below the mean for each resin?

t test

R has many statistical tests built in. One of the most commonly used tests is the t test. Do the means of two vectors differ significantly?

mat[1,]

## Alice Bob Carl ## 36.25 35.40 35.30

mat[2,]

## Alice Bob Carl ## 35.15 35.35 33.35

t.test(mat[1,], mat[2,])

## ## Welch Two Sample t-test## ## data: mat[1, ] and mat[2, ]## t = 1.4683, df = 2.8552, p-value = 0.2427## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -1.271985 3.338652## sample estimates:## mean of x mean of y ## 35.65000 34.61667

Actually, this can be considered a paired sample t-test, since the values can be paired up by operator. By default t.test performs an unpaired t test. We see in the documentation (?t.test) that we can give paired=TRUE as an argument in order to perform a paired t-test.

t.test(mat[1,], mat[2,], paired=TRUE)

## ## Paired t-test## ## data: mat[1, ] and mat[2, ]## t = 1.8805, df = 2, p-value = 0.2008## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -1.330952 3.397618## sample estimates:## mean of the differences ## 1.033333

Challenge - using t.test

Can you find a significant difference between any two resins?

When we call t.test it returns an object that behaves like a list. Recall that in R a list is a miscellaneous collection of values.

result <- t.test(mat[1,], mat[2,], paired=TRUE)names(result)

## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name"

result$p.value

## [1] 0.2007814

This means we can write software that uses the various results from t.test, for example performing a whole series of t tests and reporting the significant results.

Plotting

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few of R’s plotting features.

Let’s take a look at the average particle size per resin. Recall that we already calculated these values above using apply(mat, 1, mean) and saved them in the variable avg_resin. Plotting the values is done with the function plot.

plot(avg_resin)

Above, we gave the function plot a vector of numbers corresponding to the average per resin across all operators. plot created a scatter plot where the y-axis is the average particle size and the x-axis is the order, or index, of the values in the vector, which in this case correspond to the 8 resins.

plot can take many different arguments to modify the appearance of the output. Here is a plot with some extra arguments:

plot(avg_resin, xlab="Resin", ylab="Particle size", main="Average particle size per resin", type="b")

Challenge - plotting data

Create a plot showing the standard deviation for each resin.

Saving plots

It’s possible to save a plot as a .PNG or .PDF from the RStudio interface with the “Export” button. However if we want to keep a complete record of exactly how we create each plot, we prefer to do this with R code.

Plotting in R is sent to a “device”. By default, this device is RStudio. However we can temporarily send plots to a different device, such as a .PNG file (png("filename.png")) or .PDF file (pdf("filename.pdf")).

pdf("test.pdf")plot(avg_resin)dev.off()

dev.off() is very important. It tells R to stop outputting to the pdf device and return to using the default device. If you forget, your interactive plots will stop appearing as expected!

The file you created should appear in the file manager pane of RStudio, you can view it by clicking on it.

We use matrix here in the mathematical sense, not the biological sense.↩

FAQs

What is an example of a data matrix? ›

Data matrices are used to build evolutionary trees. For example, a data matrix for a beetle phylogeny might consist of a list of beetle species and would specify many traits of those species—how many antennal segments each has, which have spotted wings and which have solid wings, etc.

How do you extract data from a matrix? ›

To extract the numbers from matrix X, call the submatrix function. The arguments for submatrix are as follows: the name of the matrix, the indices of the first and of the last rows to extract, the indices of the first and of the last columns to extract.

View Details ›

How to enter data in the matrix? ›

You can enter numerical matrices in a number of ways. Typically, we will be entering data either manually or reading it from a spreadsheet. To enter a matrix, use commas on the same row, and semicolons to separate columns.

Discover More Details ›

How could a matrix be useful in the way you present data? ›

Matrices are useful for visually displaying your crosstabs that can be otherwise overwhelming to read. If you want to publish your crosstab result, consider adding an element of colour or symbol to drive the reader's attention.

What is a matrix simple example? ›

Row matrix: A row matrix is a matrix having a single row is called a row matrix. Example: [1, −2, 4]. Column matrix: A column matrix is a matrix having a single column is called a column matrix. Example: [−1, 2, 5]^T.

View Details ›

What is a matrix with data? ›

Data matrix (multivariate statistics), mathematical matrix of data whose rows represent different repetition of an experiment, and whose columns represent different kinds of datum taken for each repetition.

How do you select from a matrix? ›

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns.

Read On ›

How to decrypt a matrix? ›

To decode the message, we take the string of coded numbers and multiply it by the inverse of the matrix to get the original string of numbers. Finally, by associating the numbers with their corresponding letters, we obtain the original message.

Find Out More ›

How to put data into a matrix? ›

Constructing a Matrix of Data

If you have a specific set of data, you can arrange the elements in a matrix using square brackets. A single row of data has spaces or commas in between the elements, and a semicolon separates the rows. For example, create a single row of four numeric elements.

What is the format of matrix data? ›

Statistica matrix files (such as Correlation, Covariances, Similarities, and Dissimilarities) can be used in the modules that support the matrix input file format. By default, matrix spreadsheets are saved with the default file extension . smx.

How can we use matrix in real life? ›

Matrices are used in the science of optics to account for reflection and for refraction. Matrices are also useful in electrical circuits and quantum mechanics and resistor conversion of electrical energy. Matrices are used to solve AC network equations in electric circuits.

Get More Info ›

What is the purpose of a data matrix? ›

Data matrix codes are used extensively to label small electronic and industrial components, but they are also used widely in many different industries, including food and beverage, aerospace, pharmaceutical, defense, mail, and print media because tracking and traceability are critical in these industries.

View Details ›

Why is the Matrix effective? ›

The matrix structure combines the project management structure with the functional management structure to increase efficiency, adapt to changing markets and respond more quickly to market demand.

View Details ›

What does a data matrix look like? ›

Data Matrix symbols are rectangular, usually square in shape and composed of square "cells" which represent bits. Depending on the coding used, a "light" cell represents a 0 and a "dark" cell is a 1, or vice versa.

See Details ›

What are the different types of data matrix? ›

There are two types of the Data Matrix symbology, based on the error checking and correction (ECC) methods they use: ECC 200, which uses the Reed-Solomon algorithm. ECC 000, ECC 050, ECC 080, ECC 100, and ECC 140 (known collectively as ECC 000-140), which use convolutional error correction.

See Details ›

What is an example of a matrix in real life? ›

Matrixes are used in geology to conduct seismic surveys. They are used to create graphs, statistics, calculate and conduct scientific studies and research in a variety of subjects. Matrices are also used to represent real-world statistics such as population, infant mortality rate, and so on.

Get More Info Here ›