Data Analysis in the Geosciences (2024)

Understanding and characterizing variation in samples is an important part of statistics. Variation can be measured with several different statistics. Range is the difference between the largest and smallest values in a distribution, and is calculated in R with range(). Because range tends to increase with sample size, and because it is highly sensitive to outliers, it is often not a desirable measure of variation.

Variance is a much more useful measure of variation. Variance of a population is equal to the average squared deviation of every observation from the population mean. It is symbolized by a Greek lowercase sigma-squared (σ2).

Data Analysis in the Geosciences (1)

The population mean is typically unknown, so sample variance is calculated as the sum of squared deviations of every observation from the sample mean, divided by the degrees of freedom. In this case, there is a penalty for using the sample mean as an estimate of the population mean. Sample variance is symbolized by a Roman lowercase s-squared (s2) for samples.

Data Analysis in the Geosciences (2)

In R, sample variance is calculated with the var() function. In those rare cases where you need a population variance, use the population mean to calculate the sample variance and multiply the result by (n-1)/n; note that as sample size gets very large, sample variance converges on the population variance.

Because variance is the average squared deviation from the mean, the units of variance are the square of the original measurements. For example, if the length of shells is measured in millimeters, variance has units of mm2. Taking the square root of variance gives standard deviation, which therefore has the same units as the original measurements, making it more easily understood. In R, sample standard deviation is calculated with the sd() function.

A normal distribution is scaled by the standard deviation, with 68.3% of the distribution within one standard deviation of the mean, 95.4% within two standard deviations of the mean, and 99.7% within three standard deviations (you can calculate these easily with the qnorm() function). These are good rules of thumb to remember, particularly that plus or minus two standard deviations encompasses roughly 95% of a normal distribution.

Data Analysis in the Geosciences (3)

The coefficient of variation is the standard deviation divided by the mean and is therefore a dimensionless number. Its value lies in being a dimensionless way of describing variation.

Data Analysis in the Geosciences (4)

There is no built-in function for the coefficient of variation in R, but writing such a function is straightforward:

CV <- function(x) { sd(x) / mean(x) }

Comparing variances in two samples

To compare variances, we express them as a ratio, known as an F statistic. The F statistic is named for its discoverer, the biostatistician R.A. Fisher (of p-value and Modern Synthesis fame). It is therefore also called Fisher’s F.

Data Analysis in the Geosciences (5)

For two samples that are drawn from the same population, and therefore ought to have the same variance, it is easy to simulate the distribution of the F-statistic. To do this, we will assume that they are drawn from a population that is normally distributed. We state the two sample sizes, generate normally distributed values for those two samples, calculate their variances, and then measure the ratio of their variances. This is repeated many (10000 here) times using the replicate() function, which saves those 10000 simulated F ratios to the object named F.

sampleSize1 <- 12
sampleSize2 <- 25
numTrials <- 10000

F <- replicate(numTrials, var(rnorm(sampleSize1))/var(rnorm(sampleSize2)))

hist(F, breaks=50, col='salmon', main=paste('n1=',sampleSize1,', n2=',sampleSize2, sep=''))

Data Analysis in the Geosciences (6)

This distribution matches what we would expect. Two samples from the same population should typically have similar values, so the mode of the F-ratio is commonly near 1. Less commonly, one of the samples will have a much larger variance than the other, creating the two tails of the distribution. Because variance is always positive, the smallest possible value of the F-ratio is 0 (when the numerator is zero). The largest possible value would be positive infinity (when the denominator is zero). As a result, the left tail is short, and the right tail is long. The probability of generating such extreme F-ratios depends on the sample size of the two samples, so the shape of the F-distribution reflects the degrees of freedom for the numerator and and for the denominator.

In practice, F distributions come from analytic solutions, not from simulations. These analytic solutions assume that both samples come from normal distributions, and this is an important consideration in any application of the F distribution.

As with all distributions, R comes with functions to explore the F distribution, which include df() to find the height of the distribution from F and the two degrees of freedom; pf() to obtain a p-value from an observed F and the two degrees of freedom; qf() to obtain a critical value from alpha and the two degrees of freedom; and rf() to obtain random variates from an F distribution from the two degrees of freedom. Here is the same distribution that we simulated above, but calculated analytically with df().

F <- seq(from=0, to=5, by=0.01)
density <- df(F, df1=sampleSize1–1, df2=sampleSize2–1)
plot(F, density, type='l', lwd=2, las=1)

Data Analysis in the Geosciences (7)

Once we have a statistic and have the distribution of expected values, we can place confidence limits on an F-ratio. We can also perform statistical tests on a null hypothesis of the F-ratio, and the null hypothesis is usually that that the two samples were drawn from populations with the same variance (i.e, a zero null).

The F-test is based on two assumptions, which you must demonstrate to be true before you perform the test. The first is assumes random sampling, as do all tests. The second assumption is that the populations from which both samples are drawn are normally distributed. If your data are not normally distributed, a data transformation may help; if it doesn’t, you will need to use a nonparametric test to compare the variances.

An F test is performed in R with the var.test() function. Here is a demonstration using two simulated data sets:

mydata1 <- rnorm(50, mean=3, sd=2)
mydata2 <- rnorm(44, mean=2, sd=1.4)
var.test(mydata1, mydata2)

This is by default a two-tailed test, and it does not matter which variance you put in the numerator and denominator. You can specify one-tailed tests with the alternative argument. You can also change the confidence level with the conf.level argument. The var.test() function produces the following:

F test to compare two variances

data: mydata1 and mydata2
F = 1.8911, num df = 49, denom df = 43, p-value = 0.03519
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.046032 3.380349
sample estimates:
ratio of variances
1.891094

The ratio of variances (F ratio) is listed at the end of the output, along with the confidence interval on the ratio of variances three lines above that. If you are performing a hypothesis test, report the values shown in the line beginning “F = ”; be sure to include the F-ratio, both degrees of freedom, and the p-value.

Testing multiple variances: Bartlett test

In some cases, you may have multiple variances that you wish to compare. To do this, use Bartlett’s test for the hom*ogeneity of variances. Like the F-test, Bartlett’s test is a parametric test: it assumes that the data are normally distributed. In R, a Bartlett’s test is run with bartlett.test(). Your data should be set up with the measurement variable in one vector and a grouping variable in a second vector. The test is run like this:

bartlett.test(measurementVariable, groupingVariable)

For example, suppose you are calculating alkalinity measurements made in several streams, and that you have multiple replicates in each stream. If you wanted to test that the variance in alkalinity is the same in all streams, your test would be called like this:

bartlett.test(alkalinity, stream)

The non-parametric Ansari-Bradley test

If your data are not normally distributed, you can try a data transformation (such as the log transformation) to see if the distributions can be made normal. If a data transformation does not produce normality, you will need to use a non-parametric test, such as the Ansari-Bradley test.

The Ansari-Bradley test makes only one assumption: random sampling. It is called in R like this:

ansari.test(mydata1, mydata2)

The output consists of the AB statistic and a p-value. Like all non-parametric tests, confidence limits on a population parameter are not possible.

Background reading

Read chapter 4 of Crawley, entering the code in R as always. Skim (or read, but don’t get bogged down on) the part on the bootstrap, a technique we will cover later.

Data Analysis in the Geosciences (2024)

FAQs

What is the most important question in data analysis? ›

The most crucial question of the entire data preparation method for analysis is to find out who would be the end users of the analysis.

What is an example of data analysis? ›

A simple example of data analysis can be seen whenever we make a decision in our daily lives by evaluating what has happened in the past or what will happen if we make that decision. Basically, this is the process of analyzing the past or future and making a decision based on that analysis.

Why is data analysis important in Science? ›

In order for patterns and trends to be seen, data must be analyzed and interpreted first. The analyzed and interpreted data may then be used as evidence in scientific arguments, to support a hypothesis or a theory.

How shall we record data so that it will be easier for us to analyze and interpret them? ›

It's important to put the data into a standard format or template that can be used for you analysis method. Often this is an Excel spreadsheet or basic Word table. The key to recording data is making sure that you're consistent.

Is data analysis easy or difficult? ›

Becoming a data analyst isn't hard per se, though it does require certain technical skills that might be more challenging for some than others. Additionally, because of continuing advancements in the field, data analysis is a career path that requires ongoing education.

What is the toughest part of data analysis? ›

The hardest part of data science is not building an accurate model or obtaining good, clean data, but defining feasible problems and coming up with reasonable ways of measuring solutions.

What are the three 3 kinds of data analysis? ›

There are three types of analytics that businesses use to drive their decision making; descriptive analytics, which tell us what has already happened; predictive analytics, which show us what could happen, and finally, prescriptive analytics, which inform us what should happen in the future.

How do you perform data analysis? ›

  1. Step 1: Define Your Goals. Before jumping into your data analysis, make sure to define a clear set of goals. ...
  2. Step 2: Decide How to Measure Goals. Once you've defined your goals, you'll need to decide how to measure them. ...
  3. Step 3: Collect your Data. ...
  4. Step 4: Analyze Your Data. ...
  5. Step 5: Visualize & Interpret Results.

What is the first step in the process of data analysis? ›

The first step in any data analysis process is to define your objective. In data analytics jargon, this is sometimes called the 'problem statement'. Defining your objective means coming up with a hypothesis and figuring how to test it.

What does it mean to analyze data in science? ›

Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data.

What is the key objective of data analysis? ›

The ultimate objective of data analysis is to make informed data-driven decisions based on your analysis. Once your data is collected and analyzed, the fun part begins! Drawing conclusions and making predictions about future outcomes allows you to gain insights that inform your decision-making process.

What are two important first steps in data analysis? ›

Steps of Data Analysis
  • Step 1 - Determining the objective.
  • Step two: Gathering the data.
  • Step three: Cleaning the data.
  • Step four: Interpreting the data.
  • Step five: Sharing the results.
May 3, 2021

What are the basic of data analysis? ›

In short, data analysis involves sorting through massive amounts of unstructured information and deriving key insights from it. These insights are enormously valuable for decision-making at companies of all sizes. A quick note here: data analysis and data science are not the same.

How do you interpret data analysis results? ›

There are four steps to data interpretation: 1) assemble the information you'll need, 2) develop findings, 3) develop conclusions, and 4) develop recommendations. The following sections describe each step. The sections on findings, conclusions, and recommendations suggest questions you should answer at each step.

What is the most important step in data analysis? ›

The first step in any data analysis process is to define your objective. In data analytics jargon, this is sometimes called the 'problem statement'. Defining your objective means coming up with a hypothesis and figuring how to test it. Start by asking: What business problem am I trying to solve?

What are the four questions of data analysis? ›

The four questions of data analysis are the questions of description, probability, inference, and hom*ogeneity.

What is the most important as a data analyst? ›

Probability and statistics are important data analyst skills. This knowledge will guide your analysis and exploration and help you decipher the data. Additionally, understanding statistics will also help you ensure your analysis is valid, and it will help you avoid common fallacies and logical errors.

What questions do data analysis ask? ›

Data analysis process questions
  • Explain how you would estimate … ? ...
  • What is your process for cleaning data? ...
  • How do you explain technical concepts to a non-technical audience? ...
  • Tell me about a time when you got unexpected results. ...
  • How would you go about measuring the performance of our company?
4 days ago

Top Articles
Latest Posts
Article information

Author: Terrell Hackett

Last Updated:

Views: 6326

Rating: 4.1 / 5 (52 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Terrell Hackett

Birthday: 1992-03-17

Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

Phone: +21811810803470

Job: Chief Representative

Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.