Data Analysis in the Geosciences (2024)

Understanding and characterizing variation in samples is an important part of statistics. Variation can be measured with several different statistics. Range is the difference between the largest and smallest values in a distribution, and is calculated in R with range(). Because range tends to increase with sample size, and because it is highly sensitive to outliers, it is often not a desirable measure of variation.

Variance is a much more useful measure of variation. Variance of a population is equal to the average squared deviation of every observation from the population mean. It is symbolized by a Greek lowercase sigma-squared (σ²).

The population mean is typically unknown, so sample variance is calculated as the sum of squared deviations of every observation from the sample mean, divided by the degrees of freedom. In this case, there is a penalty for using the sample mean as an estimate of the population mean. Sample variance is symbolized by a Roman lowercase s-squared (s²) for samples.

In R, sample variance is calculated with the var() function. In those rare cases where you need a population variance, use the population mean to calculate the sample variance and multiply the result by (n-1)/n; note that as sample size gets very large, sample variance converges on the population variance.

Because variance is the average squared deviation from the mean, the units of variance are the square of the original measurements. For example, if the length of shells is measured in millimeters, variance has units of mm². Taking the square root of variance gives standard deviation, which therefore has the same units as the original measurements, making it more easily understood. In R, sample standard deviation is calculated with the sd() function.

A normal distribution is scaled by the standard deviation, with 68.3% of the distribution within one standard deviation of the mean, 95.4% within two standard deviations of the mean, and 99.7% within three standard deviations (you can calculate these easily with the qnorm() function). These are good rules of thumb to remember, particularly that plus or minus two standard deviations encompasses roughly 95% of a normal distribution.

The coefficient of variation is the standard deviation divided by the mean and is therefore a dimensionless number. Its value lies in being a dimensionless way of describing variation.

Comparing variances in two samples

To compare variances, we express them as a ratio, known as an F statistic. The F statistic is named for its discoverer, the biostatistician R.A. Fisher (of p-value and Modern Synthesis fame). It is therefore also called Fisher’s F.

For two samples that are drawn from the same population, and therefore ought to have the same variance, it is easy to simulate the distribution of the F-statistic. To do this, we will assume that they are drawn from a population that is normally distributed. We state the two sample sizes, generate normally distributed values for those two samples, calculate their variances, and then measure the ratio of their variances. This is repeated many (10000 here) times using the replicate() function, which saves those 10000 simulated F ratios to the object named F.

sampleSize1 <- 12
sampleSize2 <- 25
numTrials <- 10000

F <- replicate(numTrials, var(rnorm(sampleSize1))/var(rnorm(sampleSize2)))

hist(F, breaks=50, col='salmon', main=paste('n1=',sampleSize1,', n2=',sampleSize2, sep=''))

This distribution matches what we would expect. Two samples from the same population should typically have similar values, so the mode of the F-ratio is commonly near 1. Less commonly, one of the samples will have a much larger variance than the other, creating the two tails of the distribution. Because variance is always positive, the smallest possible value of the F-ratio is 0 (when the numerator is zero). The largest possible value would be positive infinity (when the denominator is zero). As a result, the left tail is short, and the right tail is long. The probability of generating such extreme F-ratios depends on the sample size of the two samples, so the shape of the F-distribution reflects the degrees of freedom for the numerator and and for the denominator.

In practice, F distributions come from analytic solutions, not from simulations. These analytic solutions assume that both samples come from normal distributions, and this is an important consideration in any application of the F distribution.

Testing multiple variances: Bartlett test

In some cases, you may have multiple variances that you wish to compare. To do this, use Bartlett’s test for the hom*ogeneity of variances. Like the F-test, Bartlett’s test is a parametric test: it assumes that the data are normally distributed. In R, a Bartlett’s test is run with bartlett.test(). Your data should be set up with the measurement variable in one vector and a grouping variable in a second vector. The test is run like this:

bartlett.test(measurementVariable, groupingVariable)

For example, suppose you are calculating alkalinity measurements made in several streams, and that you have multiple replicates in each stream. If you wanted to test that the variance in alkalinity is the same in all streams, your test would be called like this:

bartlett.test(alkalinity, stream)

The non-parametric Ansari-Bradley test

If your data are not normally distributed, you can try a data transformation (such as the log transformation) to see if the distributions can be made normal. If a data transformation does not produce normality, you will need to use a non-parametric test, such as the Ansari-Bradley test.

The Ansari-Bradley test makes only one assumption: random sampling. It is called in R like this:

ansari.test(mydata1, mydata2)

The output consists of the AB statistic and a p-value. Like all non-parametric tests, confidence limits on a population parameter are not possible.

Background reading

Read chapter 4 of Crawley, entering the code in R as always. Skim (or read, but don’t get bogged down on) the part on the bootstrap, a technique we will cover later.

FAQs

What is the most important question in data analysis? ›

The most crucial question of the entire data preparation method for analysis is to find out who would be the end users of the analysis.

What is an example of data analysis? ›

A simple example of data analysis can be seen whenever we make a decision in our daily lives by evaluating what has happened in the past or what will happen if we make that decision. Basically, this is the process of analyzing the past or future and making a decision based on that analysis.

View Details ›

Why is data analysis important in Science? ›

In order for patterns and trends to be seen, data must be analyzed and interpreted first. The analyzed and interpreted data may then be used as evidence in scientific arguments, to support a hypothesis or a theory.

Discover More Details ›

How shall we record data so that it will be easier for us to analyze and interpret them? ›

It's important to put the data into a standard format or template that can be used for you analysis method. Often this is an Excel spreadsheet or basic Word table. The key to recording data is making sure that you're consistent.

Is data analysis easy or difficult? ›

Becoming a data analyst isn't hard per se, though it does require certain technical skills that might be more challenging for some than others. Additionally, because of continuing advancements in the field, data analysis is a career path that requires ongoing education.

View Details ›

What is the toughest part of data analysis? ›

The hardest part of data science is not building an accurate model or obtaining good, clean data, but defining feasible problems and coming up with reasonable ways of measuring solutions.

What are the three 3 kinds of data analysis? ›

There are three types of analytics that businesses use to drive their decision making; descriptive analytics, which tell us what has already happened; predictive analytics, which show us what could happen, and finally, prescriptive analytics, which inform us what should happen in the future.

Read On ›

How do you perform data analysis? ›

Step 1: Define Your Goals. Before jumping into your data analysis, make sure to define a clear set of goals. ...
Step 2: Decide How to Measure Goals. Once you've defined your goals, you'll need to decide how to measure them. ...
Step 3: Collect your Data. ...
Step 4: Analyze Your Data. ...
Step 5: Visualize & Interpret Results.

Find Out More ›

What is the first step in the process of data analysis? ›

The first step in any data analysis process is to define your objective. In data analytics jargon, this is sometimes called the 'problem statement'. Defining your objective means coming up with a hypothesis and figuring how to test it.

What is the key objective of data analysis? ›

The ultimate objective of data analysis is to make informed data-driven decisions based on your analysis. Once your data is collected and analyzed, the fun part begins! Drawing conclusions and making predictions about future outcomes allows you to gain insights that inform your decision-making process.

What are two important first steps in data analysis? ›

Steps of Data Analysis

Step 1 - Determining the objective.
Step two: Gathering the data.
Step three: Cleaning the data.
Step four: Interpreting the data.
Step five: Sharing the results.

May 3, 2021

Get More Info ›

What are the basic of data analysis? ›

In short, data analysis involves sorting through massive amounts of unstructured information and deriving key insights from it. These insights are enormously valuable for decision-making at companies of all sizes. A quick note here: data analysis and data science are not the same.

View Details ›

How do you interpret data analysis results? ›

There are four steps to data interpretation: 1) assemble the information you'll need, 2) develop findings, 3) develop conclusions, and 4) develop recommendations. The following sections describe each step. The sections on findings, conclusions, and recommendations suggest questions you should answer at each step.

View Details ›

What is the most important step in data analysis? ›

See Details ›

What are the four questions of data analysis? ›

The four questions of data analysis are the questions of description, probability, inference, and hom*ogeneity.

See Details ›

What is the most important as a data analyst? ›

Probability and statistics are important data analyst skills. This knowledge will guide your analysis and exploration and help you decipher the data. Additionally, understanding statistics will also help you ensure your analysis is valid, and it will help you avoid common fallacies and logical errors.

Get More Info Here ›

What questions do data analysis ask? ›

Data analysis process questions

Explain how you would estimate … ? ...
What is your process for cleaning data? ...
How do you explain technical concepts to a non-technical audience? ...
Tell me about a time when you got unexpected results. ...
How would you go about measuring the performance of our company?

4 days ago

View Details ›