Introduction
Here
we provide examples about how the Universal Probability Calculator
(UPC) can be used to calculate probalities for normal and non-normal
data. We also perform the calculations using Microsoft Excel and
Minitab, so the user can learn how to compute probabilities in such
tools and the difference when compared with UPC.
Going straight to the point,
when using UPC you do not need to be worried if the distribution is
normal or not, you just need to paste your data and press "calculate".
When using other tools, you need to do this analysis at first in order
to decide about the next steps, which is complex and full of tricks
that can mislead the user to make bad decisions.
Example
1 (normal distribution)
Description
of the problem:
Assume you measured the
commuting time from
your house to the office 15 times. By doing that you realize that the
average
time is 53 minutes. You also wish to know the odds of taking up to 1
hour to go
to the office.
Measured
values (minutes)
|
52.7
|
43.5
|
43.3
|
59.2
|
47.8
|
65.2
|
38.7
|
51.7
|
53.6
|
54.3
|
67.9
|
49.7
|
51.6
|
63.8
|
53.6
|
Solution
using the “Universal Probability Calculator (UPC)”:
The first step is showed in the
following figure:
In the second step, copy and
paste the values
directly to the field:
After clicking on the button
“Calculate” a
message is displayed saying that the odds are 81.9%.
Solution
using “Excel - Windows”:
On the Excel menu, Data
-> Data Analysis
-> Descriptive Statistics to get table below:
Mean
|
53.10165
|
Standard
Error
|
2.137853
|
Median
|
52.6883
|
Mode
|
#N/A
|
Standard
Deviation
|
8.279871
|
Sample
Variance
|
68.55626
|
Kurtosis
|
-0.36244
|
Skewness
|
0.208165
|
Range
|
29.1862
|
Minimum
|
38.7058
|
Maximum
|
67.892
|
Sum
|
796.5247
|
Count
|
15
|
|
Because Kurtosis and Skewness
are close to zero, we can assume that the distribution is normal, or at
least close to it. The sample size is 15 which is small. We also do not
know the variance of the population. For all these reasons it is
appropriate to use a Student distribution.
With degree of freedom of 14, we have:
Using Excel command T.DIST(0.833,14,1),
we get that the probability of having a value up to 60 minutes is
79.06%.
|
Solution
using “Minitab”:
Initially we perform a test of
goodness for a normal distribution. On Minitab: Stat-> Basic
Statistics -> Normality Test. For Anderson-Darling and
Kolmogorov-Smirnov we have the results on the right, both not rejecting
the null hypothesis of normality. Therefore, it is plausible
to
assume the distribution is normal.
|
|
|
Because sample size is small
and we do not know the variance of the population, we use Student
distribution. On Minitab: Calc ->
Probability Distributions-> t . Select “cumulative
probability”, and in the field “input
constant” entry 0.833 (same
value previously used for the
Excel solution). By doing so, we get the answer (on the right) saying
the probability of having a value smaller than 60 minutes is 79.06%.
|
|
Discussion
of the results
Initially, regarding the source
of
the data, it was generated a population of 20000 values using the
software
Matlab, function: (randn(20000,1) * 5 ) + 50. From
this population, we collect 15 values by chance (our working sample).
Summary of the results:
|
UPC-Dunamath
|
Excel
|
Minitab
|
Correct answer
|
81.9%
|
79.06%
|
79.06%
|
97.72%
|
Excel and Minitab returned the
same
result because they both used Student Distribution with the same
parameter t. Both
assumed normal distribution which is correct in this case because the
population was generated from a normal distribution. However, the mean
and
standard deviation parameters used to calculate t are
significantly wrong. For the sample, mean
and standard deviation are 53.1 and 8.28 respectively, while for the
population
they are 50 and 5. It explains the big error of the answer.
Other point is that even using
Excel
and/or Minitab correctly, it is likely that the decision maker will
believe in
the result (79.06%) and make his decision. These tools do not give
information
about the accuracy of the answer and many times the users are not even
aware of
the existence of uncertainty in the result.
The Universal Probability
Calculator
(UPC) retuned a probability of 81.9%, a little bit better than the
others, and
it also reports a small confidence level of 64%, alerting the user
about it.
The UPC not only computes the
probability in very simple and straight way, but also gives an estimate
of the
uncertainty present in the result. By doing so, it seems to be fairer
with the
decision maker. If he wishes to have a smaller uncertainty, he needs to
increase the size of the sample.
Note
that we are not calculating the probability of having a sample mean
smaller than 1 hour. That would be a different question.
Example
2 (non-normal distribution)
Problem
description:
A product engineer is studying
the life time of
a hard drive disc. In one experiment it is measured the life time in
hours of
10 discs. Results in the following table:
1988.77
|
2026.69
|
2074.94
|
2018.67
|
1973.65
|
1921.29
|
1941.77
|
1937.22
|
1895.03
|
1942.83
|
A)
What is the probability of
having a disc lasting longer than 1900 hours?
Solution using the “Universal
Provability Calculator (UPC)”:
Step 1 as follows:
Note that you could have
selected the symbol >=, but because the date refers
to consitnuous variables, that
is not relevant.
Step 2 as follows, copying and
pasting the data:
After clicking on
“Calculate” is displayed a
message saying that the probability of having a disc lasting longer
than 1900 hours
is 90.56%.
Solution
using “Excel - Windows”:
Excel Menu: Data -> Data
Analysis ->
Descriptive Statistics to get the following table:
Mean
|
1972.086
|
Standard
Error
|
17.48647
|
Median
|
1958.24
|
Mode
|
#N/A
|
Standard
Deviation
|
55.29706
|
Sample
Variance
|
3057.765
|
Kurtosis
|
-0.35796
|
Skewness
|
0.552605
|
Range
|
179.91
|
Minimum
|
1895.03
|
Maximum
|
2074.94
|
Sum
|
19720.86
|
Count
|
10
|
|
Because Kurtosis and Skewness
are close to zero, it is plausible to assume the distribution is normal
or approximately normal. The sample size is small and we do not know
the variance of the population, therefore we decide to use Student
distribution.
With degree of freedom 9, we have:
Using the Excel command T.DIST(-1.304,9,1), we have that the
probability of having a value greater than 1900 is 88.8%.
|
Solution
using “Minitab”:
Initially we perform a test of
goodness for a normal distribution. On Minitab: Stat-> Basic
Statistics -> Normality Test. For Anderson-Darling and
Kolmogorov-Smirnov we have the results on the right, both not rejecting
the null hypothesis of normality. Therefore, it is plausible
to
assume the distribution is normal.
|
|
|
Because sample size is small
and we do not know the variance of the population, we use Student
distribution. On Minitab: Calc ->
Probability Distributions-> t. Select “cumulative
probability”, and in the field “input
constant” entry -1.304 (same
t computed
in Excel). The answer is on the right, where the probability is
(100-11.2)=88.8%.
|
|
B)
Regarding the probability you
have just calculated, how sure you are?
Using UPC-Dunamath, a message
is displayed saying:
“We are 68% confidence that the true value is
between 85.56% and 95.56%”.
It means, we are 68% confident
the true value falls within this
interval. It also means that if you collect more 15 samples to repeat
the test,
and keep doing that many times, at least 68% of the probabilities will
fall
within the interval. Note that we do not have this information from
Excel or
Minitab.
C)
In order to improve the
confidence level, you get the lifetime of others
30 discs, and you repeat the test using also the previous sample,
totalling a
sample of 40 discs. What is the probability of having a disc lasting
longer
than 1900 hours?
1988.77
|
2026.69
|
2053.48
|
2140.11
|
2132.87
|
2062.56
|
1970.53
|
2164.22
|
2074.94
|
2018.67
|
1982.92
|
1924.92
|
2154.11
|
1788.89
|
2046.63
|
2019.41
|
1973.65
|
1921.29
|
1968.29
|
1753.65
|
1972.47
|
2028.2
|
2000.97
|
1960.72
|
1941.77
|
1937.22
|
1943.67
|
1957.47
|
1909.35
|
2018.27
|
2102.17
|
1695.47
|
1895.03
|
1942.83
|
2063.94
|
1678.59
|
1948.96
|
2050.25
|
1899.61
|
2058.53
|
Solution
using the “Universal Provability Calculator (UPC)”:
Step 1:
Step 2 (copy and paste data):
Note there are more values on
the right of the field in the picture.
After clicking on
“Calculate” is displayed a
message saying that the probability of having a disc lasting longer
than 1900
hours is 85.09%, with 79% confidence that the true value is between
80.09% and 90.09%.
Solution
using “Excel - Windows”:
Excel menu: Data -> Data
Analysis ->
Descriptive Statistics:
Mean
|
1979.302
|
Standard
Error
|
17.43848
|
Median
|
1978.285
|
Mode
|
#N/A
|
Standard
Deviation
|
110.2906
|
Sample
Variance
|
12164.02
|
Kurtosis
|
1.321178
|
Skewness
|
-0.90893
|
Range
|
485.63
|
Minimum
|
1678.59
|
Maximum
|
2164.22
|
Sum
|
79172.09
|
Count
|
40
|
|
Kurtosis and Skewness are NOT
close to zero, not too far too, but in this case, it is safer not
assume the distribution is normal.
In Excel there is no straight
method to deal with non-normal data. One alternative is to assume the
data is not far from normal, and use Student Distribution, with t =
(1900-1979.30)/110.29 = -0.719,
Excel command T.DIST(-0.719,39,1),
resulting in 76.2%.
Another
alternative is to use an Empirical Distribution Function (EDF), as
showed in the next step.
|
A table with the Empirical
Distribution is
showed as follows:
X(i)
|
q < X(i)
|
EDF <x
|
EDF > x
|
1678.59
|
1
|
0.025
|
0.975
|
1695.47
|
2
|
0.05
|
0.95
|
1753.65
|
3
|
0.075
|
0.925
|
1788.89
|
4
|
0.1
|
0.9
|
1895.03
|
5
|
0.125
|
0.875
|
1899.61
|
6
|
0.15
|
0.85
|
1909.35
|
7
|
0.175
|
0.825
|
1921.29
|
8
|
0.2
|
0.8
|
1924.92
|
9
|
0.225
|
0.775
|
1937.22
|
10
|
0.25
|
0.75
|
1941.77
|
11
|
0.275
|
0.725
|
1942.83
|
12
|
0.3
|
0.7
|
1943.67
|
13
|
0.325
|
0.675
|
1948.96
|
14
|
0.35
|
0.65
|
1957.47
|
15
|
0.375
|
0.625
|
1960.72
|
16
|
0.4
|
0.6
|
1968.29
|
17
|
0.425
|
0.575
|
1970.53
|
18
|
0.45
|
0.55
|
1972.47
|
19
|
0.475
|
0.525
|
1973.65
|
20
|
0.5
|
0.5
|
|
X(i)
|
q < X(i)
|
EDF <x
|
EDF > x
|
1982.92
|
21
|
0.525
|
0.475
|
1988.77
|
22
|
0.55
|
0.45
|
2000.97
|
23
|
0.575
|
0.425
|
2018.27
|
24
|
0.6
|
0.4
|
2018.67
|
25
|
0.625
|
0.375
|
2019.41
|
26
|
0.65
|
0.35
|
2026.69
|
27
|
0.675
|
0.325
|
2028.2
|
28
|
0.7
|
0.3
|
2046.63
|
29
|
0.725
|
0.275
|
2050.25
|
30
|
0.75
|
0.25
|
2053.48
|
31
|
0.775
|
0.225
|
2058.53
|
32
|
0.8
|
0.2
|
2062.56
|
33
|
0.825
|
0.175
|
2063.94
|
34
|
0.85
|
0.15
|
2074.94
|
35
|
0.875
|
0.125
|
2102.17
|
36
|
0.9
|
0.1
|
2132.87
|
37
|
0.925
|
0.075
|
2140.11
|
38
|
0.95
|
0.05
|
2154.11
|
39
|
0.975
|
0.025
|
2164.22
|
40
|
1
|
0
|
|
In the Empricial Distributon
table, in the first column we have the data sorted in ascending order.
In the
second column we have for each value the amount of values smaller or
equal to
the current value (which coincides with the row number). In the third
column we
have the value of the second column divided by the sample size
resulting in a
cumulative frequency. Finally, in the fourth column we have the
complement of
the third column.
We want to calculate the
probability of having a value greater than 1900. In the table, the
value 1900
is between lines 6 and 7 (1899.61
and 1909.35). By
doing so it is possible to say that the probability is around 82.5% and
85%.
Note that there is no guarantee the true value is within this interval.
But for
a non-normal data, this is a simple method to give a notion of the
probability.
Solution
using “Minitab”:
Initially we perform a test of
goodness for a normal distribution. On Minitab: Stat-> Basic
Statistics -> Normality Test. For Anderson-Darling the null
hypothesis of normality is rejected. Therefore, it is not plausible to
assume the distribution is normal.
|
|
|
Because the distribution is not
normal, we need to estimate the type of the distribution. Minitab menu: Stat
> Quality Tools > Individual Distribution Identification.
We get the table on the right,
with an Anderson Darling test applied to different types of
distribution. In general, all distributions with P smaller than 0.05
are immediately discarded. From the remaining ones, we get the one with
greatest P value.
In our case, the first is
“Johnson Transformation”, then “Box-Cox
Transformation”, and after that,
“Weibull”. Because the first two are
transformations and not native distributions, and also, there is no
straight method to use them in Minitab, we pick the
“Weibull” distribution.
|
|
With the previous
table we also have the following table with the parameter of each
distribution.
In our case, for “Weibull”, there
are 2 parameters: 22.30053
(shape) and 2027 (scale).
In
the next step, on Minitab menu: Calc
-> Probability Distributions->Weibull. Select
“cumulative
probability”, type the 2 values of the parameters, in the
field “input
constant”, type the value 1900.
By
doing so, we have the answer as follows:
We
want the probability of having values greater than 1900, so we have
(1-0.2098) = 0.7902 = 79.02%.
Phew!!!
Finally!
Discussion
of the results:
Initially, regarding the source
of
the data, we generated 20000 values using the software Matlab,
function: wblrnd(2042.6,25.8773,20000,1) generating
a population with Weibull
distribution, mean 2000.3 and standard deviation 97.192. From that, we
collect
our samples by chance.
For the first sample
(N=10):
UPC-Dunamath
|
Excel
|
Minitab
|
Correct answer
|
90.56%
|
88.8%
|
88.8%
|
85.82%
|
For
the extended sample (N=40):
UPC-Dunamath
|
Excel (using Student Distribution)
|
Excel (empirical distribution)
|
Minitab
|
Correct answer
|
85.09%
|
76.2%
|
[82.5% - 85%]
|
79.02%
|
85.82%
|
In the first table (N=10), we
see
that Excel and Minitab returned the same result because they both used
Student
distribution with the same
t . But
note that, despite the approval in the test of goodness for normal
distribution, the correct distribution is Weibull.
The mean and standard deviation
of
the sample are 1972.09 and
55.30 respectively,
while for the population we
have 2000.3 and 97.192. Note that
despite the fact we have assumed the wrong distribution the probability
error
is small, which might be just lucky, for example a numerical
coincidence
influenced by the relation between mean and standard deviation.
For UPC, the probability is
90.56%,
with 68% confidence that the true value is between 85.56% and 95.56%.
The
confidence is low due to the small sample which is an alert for the
user, not
provided by the other tools. Despite that, the true value is within the
interval.
In the second table (N=40),
both in
Excel and Minitab we rejected the assumption of normality. For Excel,
we
proposed the utilization of the Empirical Distribution Function just to
have an
idea of the probability, obtaining a value around 82.5% and 85%, which
compared
with the correct answer is a plausible value.
Using Minitab, after a hard
work
identifying a suitable distribution type, its parameters, and
performing the
calculation, the result was even worse than the case with N=10.
This is an inconvenient but possible, because we are using small
samples, and
maybe the additional samples are less representative of the population
than the
initial sample, or it is just a numerical coincidence.
For UPC, the probability is
85.09%,
with 79% confidence that the true value is between 80.09% and 90.09%.
Compared
with N=10,
the confidence level increased significantly due to a bigger
sample
size. The error is smaller than Excel and Minitab, and the true value
is with
the estimated interval.
By this example, we see how
complicated these analyses can become. It is complicated to calculate a
value
for the probability, and after that, you still do not know the
uncertainty of
the result. The Universal Probability
Calculator (UPC) makes this calculation much easier, and also
gives an
estimate for the uncertainty involved. You do not need to be worried
with all statistical assumptions and trick details, it is
everything treated
by our
algorithm.
|