Introduction
This Reading Assignment examines how data in a sample can be collected and then used to provide information on the wider population. Many of the examples are concerned with the mean of the sample being used to estimate the population mean, this is a practice often used in finance. The Central Limit Theorem allows us to make probability statements about a population mean based on sample data. It is imperative that you understand the concept and calculation of confidence intervals for the population mean and when to use the z-statistic or t-statistic.
Sampling
There are different ways of selecting a sample from a population. The basic type of sample is a simple random sample. In this sample each item or person in the population has an equal probability of being included.
Example -1: Simple random sample
Put each member of a population in a sequence and identify each member by a number then use random number tables to select the numbers for a sample (however many numbers needed for the sample size required). Match these numbers to the members of the population to identify the sample.
When it is not practical to assign a number to each item in a population then we might use systematic random sampling. In this case the items are arranged and then every nth item is included in the sample. This assumes that there is no pattern to the way that the items are arranged.
Example 11-2: Systematic random sample
A chocolate bar manufacturer selects every 100th chocolate bar coming off a conveyor belt for inclusion in a sample to test the weights of chocolate bars being produced.
Although a random sample will reflect the characteristics of the population in an unbiased way, there is likely to be a difference between the estimate from the sample and the actual population characteristic.
Sampling error is defined as the difference between the observed value of a sample statistic and the quantity that it is being used to estimate from the population.
Example -3: Sampling error
A chocolate bar manufacturer selects every 100th chocolate bar coming off a conveyor belt for inclusion in a sample to test the weights of chocolate bars being produced. The mean weight of chocolate bars in the sample is 105 grams, the mean population weight is 100 grams, and sampling error is therefore 5 grams.
A sampling distribution of a statistic is the distribution of all possible distinct values that the statistic can assume when samples of the same size are randomly taken from the population.
For example the sampling distribution of the sample mean is the distribution of all possible sample means of a given sample size and the probability of occurrence of each sample mean.
Example -4: Sampling distribution
The four employees of a firm have worked for the firm for 3, 7, 8 and 12 years. To calculate the sampling distribution of the sample mean for samples of two workers calculate the means for all possible samples
|
Employees in Sample (years worked)
|
Sample Mean (in years)
|
|
3 and 7
|
5.0
|
|
3 and 8
|
5.5
|
|
3 and 12
|
7.5
|
|
7 and 8
|
7.5
|
|
7 and 12
|
9.5
|
|
8 and 12
|
10.0
|
Therefore the sampling distribution of the sample mean is:
|
Sample Mean (in years)
|
Probability
|
|
5.0
|
0.167
|
|
5.5
|
0.167
|
|
7.5
|
0.333
|
|
9.5
|
0.167
|
|
10.0
|
0.167
|
We can see from the example that the mean of the sample mean is the same as the population mean, and the standard deviation of the distribution of sample mean is less than that of the population.
Another method of taking a sample is stratified random sampling. In this case we divide the population into subgroups (or strata) and select a sample from each subgroup. If it is a proportional sample then the number of items selected from each subgroup will be the same as the size of the subgroup as a proportion to the total population.
Example -5: Stratified random sampling
If we wish to study the usage of cars by a population of car owners we might decide to divide car owners into three subgroups by age as shown below.
|
Age
|
Percentage of car owners
|
Number in sample
|
|
Under 25 ears
|
15%
|
300
|
|
25 ears a to 55 ears
|
60%
|
1,200
|
|
55 ears and over
|
25%
|
500
|
|
Total
|
100%
|
2,000
|
The number from each group selected for the sample is based on the percentage of car owners in that group.
Two different forms of data are:
-
Time-series data
Time-series data is a sequence of returns collected at discrete and equally spaced time intervals, for example historic monthly stock returns.
-
Cross-sectional data
This is data collected on a characteristic of a group, which might be a group of individuals or companies, at a single point in time. Last year's closing prices for stocks that trade on the NYSE is an example of cross-sectional data.
Central Limit Theorem
For a population with a mean of μ and a variance of σ2, the sampling distribution of the sampling mean (x) of all possible samples of size n will be approximately normally distributed with a mean μ and variance σ2/n (assuming n is large, say 30 or over).
To summarize:
-
Even if the distribution of the population is not normal the sampling distribution of the sampling mean, x, is approximately a normal distribution.
-
The mean of the distribution of x will be equal to the mean of the population.
-
The variance of the distribution of x will be equal to the variance of the population divided by the sample size.
The standard error of the sample mean is
| (1) |
 |
This is the standard deviation of the sampling distribution of the sample mean.
If the population standard deviation (σ) is not known, then we can use the sample standard deviation, s, to estimate the standard error, it is then denoted by:
| (2) |
 |
where:
| (3) |
 |
Example -6: Standard error of the sample mean
If the standard deviation of a population is 10 and a sample of 49 items is taken from the population then the standard error of the sample mean is:
Estimating a Population Parameter
The formulae that we use to calculate a sample statistic are estimators. The particular value that we calculate using an estimator is an estimate.
A point estimate is a single estimate calculated from a sample which is used to estimate the population parameter. An example of this would be a sample mean being calculated as a point estimate of the population mean.
Another approach is to make an interval estimate of the parameter; this means we find an interval that will include the population parameter with a certain level of probability. This is a confidence interval.
The three desirable properties of an estimator (or estimation formula) are:
-
Unbiased - the expected value (the mean of its sampling distribution) is the same as the parameter it is intended to estimate.
-
Efficient - there is no other unbiased estimate of the same parameter with a sampling distribution of smaller variance.
-
Consistent - the probability of accurate estimates increases as the sample size increases.
Confidence Intervals
This is an interval and the population parameter lies within this interval with a specified probability (1 - α). The probability is the degree of confidence. The interval is called the (1 - α)% confidence interval for the parameter.
The end points of the interval are called the lower and upper confidence limits.
A 95% confidence interval can be interpreted by considering the case when we take a large number of samples from the population and construct a confidence interval for each sample. We expect 95% of these confidence intervals to include the population mean. Following on, we can say that we are 95% confident that a single confidence level includes the population mean.
Constructing a confidence interval
A confidence interval is defined by:
where
|
Point estimate
|
=
|
a point estimate of the parameter
|
|
reliability factor
|
=
|
a number based on the assumed distribution of the point estimate and degree of confidence for the interval
|
|
standard error
|
=
|
standard error of the sample statistic providing the point estimate
|
Applying this to the case where we are estimating the population mean and we are taking a sample from a normally distributed population with known variance. The confidence interval is given by:
| (5) |
 |
where
|
X
|
=
|
sample mean, which is the point estimate of the population mean
|
|
σ
|
=
|
population standard deviation
|
|
n
|
=
|
sample size
|
|
Zα/2
|
=
|
reliability factor, the Point where α/2 of the Probabilitv is in the right tail
|
Using the characteristics of a normal distribution, we can see that:
Another way of saying this is that:
-
90% of the sample means will be within 1.645 standard deviations of the population mean.
-
95% of the sample means will be within 1.960 standard deviations of the population mean.
-
99% of the sample means will be within 2.575 standard deviations of the population mean.
For any distribution if we do not know the variance, and it is a large sample, we can use
| (6) |
 |
where
|
s
|
=
|
sample standard deviation
|
|
x
|
=
|
sample mean
|
|
n
|
=
|
sample size
|
Therefore:
-
The 90% confidence interval for the mean is x ±
-
The 95 % confidence interval for the mean is x ±
-
The 99% confidence interval for the mean is x ±
Example -7: Confidence intervals
A sample of 81 observations is taken from a normal population, the sample mean is 20 and the standard deviation is 3.
The 90% confidence interval is 20 ± 1.645 × 3)/9 which is 19.45 up to 20.55.
This means we can be 90% confident that the population mean lies between 19.45 and 20.55.
The 95% confidence interval is 20 ± 1.960 × 3)/9 which is 19.35 up to 20.65.
The 99% confidence interval is 20 ± 2.575 × 3)/9 which is 19.14 up to 20.86.
Student's t -Distribution
An alternative method for constructing confidence intervals is to use the t-distribution. It is a more conservative method, giving wider intervals, and ideally is used in all cases even when it is a large sample. However when it is a small sample (less than 30), when we do not know the population variance, it is essential to use the t-distribution approach.
The t-distribution is a symmetrical probability distribution defined by a single parameter, the number of degrees of freedom (df).
Degrees of freedom are the number of independent observations used.
The t-distribution with a mean of 0 and (n -1) degrees of freedom is given by:
| (7) |
 |
It is not normal since there are two random variables, the sample mean and standard deviation. However as the number of degrees of freedom increases the t-distribution approaches the normal distribution, as shown below:
Confidence Intervals for the Population Mean
If we are considering a population with unknown variance and either
The (1 - α) % confidence interval is given by:
| 8) |
 |
where the number of degrees of freedom for tα/2 is (n -1), with a sample size of n.
In order to answer hypothesis questions you may be required to read t-distribution tables to find the critical value of t. We show an excerpt from the tables below. Note that these are for one-tailed tests, so for α = 0.05 then p = 0.05, whereas for a two-tailed test you would need to use p = 0.025, which is half the significance level.
For example to find the critical t-value with 5 degrees of freedom and a = 0.05 and a one-tailed test the critical t-value would be 2.015. For a two-tailed test (p = 0.025) it would be 2.571.
|
df
|
p =0.10
|
p = 0.05
|
p = 0.025
|
|
1
|
3.078
|
6.314
|
12.706
|
|
2
|
1.886
|
2.920
|
4.303
|
|
3
|
1.638
|
2.353
|
3.182
|
|
4
|
1.533
|
2.132
|
2.776
|
|
5
|
1.476
|
2.015
|
2.571
|
| |
|
|
Etc.
|
Example -8: Confidence intervals
An investor is looking at the quarterly returns from a mutual fund portfolio which are assumed to be normally distributed and have a mean of 3% and a sample standard deviation of 2%. He looks at 3 years' data and wishes to compute the 95% confidence interval. Since the sample is small he uses Equation 3-12.
He will need to use t-distribution tables to look up t0.025 for 11 degrees of freedom (since the sample size is 12), this is 2.201.
The confidence interval is
The 95% confidence interval is between 1.73% and 4.27%
The investor can be confident, at the 95% level, that this range includes the population mean.
In summary, the table below shows which statistic to use for different samples.
|
Distribution
|
Variance
|
Small Sample
|
Large Sample
|
|
Normal
|
Known
|
z
|
z
|
|
Normal
|
Unknown
|
t
|
z or t
|
|
Nonnormal
|
Known
|
Not Available
|
z
|
|
Nonnormal
|
Unknown
|
Not Available
|
z or t
|
If a larger sample size is taken then the confidence interval will decrease as the standard error is lower. As you would expect, a larger sample gives more precise results.
Biases Impacting on Data Selected
Data-Snooping Bias
This is the bias that occurs if you use the empirical results of other analysts' research, or focus on patterns that may have been identified by other research. Ideally you would study new data but unfortunately this may not be practical in financial markets where much of the research is based on historic data.
Data-Mining Bias
This is when forecasting models are derived from searching through historic data for patterns/trading rules. The problems occur when a large number of models are tested but only the successful ones reported.
Sample Selection Bias
This occurs when certain data is excluded from the analysis, possibly because the data was not available.
Survivorship Bias
This is one type of sample selection bias, which occurs when companies that have gone bankrupt, or funds or portfolios that have been liquidated, are not included in the analysis.
Look-Ahead Bias
This is when a test uses information that was not available at the test date. An example of this is when the success of valuation ratios is considered but all investors may not have had access to the accounting data incorporated in the valuation ratio at the test date.
Time-Period Bias
This is when the test period used does not match the conclusion being drawn, perhaps short-term data is being applied to provide long-term forecasts.
|