Data Fitting with R

By Jeffrey Strickland on April 9, 2015 • ( 2 Comments )

What is Data Fitting?

Introduction

Fitting distributions consists of finding a mathematical function which represents a statistical variable. Data scientists and statisticians are often faced with this problem: they have some observations of a quantitative character x₁, x₂, …, x_n and they wish to test if those observations, being a sample of an unknown population, belonging to a population with a pdf (probability density function) f(x,θ), where θ is a vector of parameters to estimate with available data.

We can identify 4 steps in fitting distributions:

Model/function choice: hypothesize families of distributions;
Estimate parameters;
Evaluate quality of fit;
Goodness-of-fit statistical tests.

This article aims to face fitting distributions dealing shortly with theoretical issues and practical ones using the statistical environment and language R.

R is a language and an environment for statistical computing and graphics flexible and powerful. We are going to use some R statements concerning graphical techniques, model/function choice, parameters estimate, measures of goodness-of-fit and most common goodness-of-fit tests.

To understand this work a basic knowledge of R is needed. We suggest a reading of “An introduction to R”. R statements, if not specified, are included in stats package.

Graphics

Exploratory data analysis can be the first step, getting descriptive statistics (mean, standard deviation, skewness, kurtosis, etc.) and using graphical techniques (histograms, density estimate, ECDF) which can suggest the kind of pdf to use to fit the model.

We can obtain samples from some pdf (such as Gaussian, Poisson, Weibull, gamma, etc.) using R statements and after we draw a histogram of these data.

EXAMPLE 1. Fitting a Normal Distribution

Suppose we have a sample of size n=100 belonging from a normal population N(10,2) with mean=10 and standard deviation=2:

x.norm<-rnorm(n=200,m=10,sd=2)

We can get a histogram using hist() statement (Figure 1):

hist(x.norm,main="Histogram of observed data")

Figure 1. Histogram of random Normal data

Histograms can provide insights on skewness, behavior in the tails, presence of multi-modal behavior, and data outliers; histograms can be compared to the fundamental shapes associated with standard analytic distributions.

We can estimate frequency density using density()and plot()to plot the graphic ( Figure 2):

plot(density(x.norm),main="Density estimate of data")

R allows to compute the empirical cumulative distribution function by ecdf() (Figure 3):

plot(ecdf(x.norm),main=” Empirical cumulative distribution function”)

A Quantile-Quantile (Q-Q) plot3 is a scatterplot comparing the fitted and empirical distributions in terms of the dimensional values of the variable (i.e., empirical quantiles). It is a graphical technique for determining if a data set come from a known population. In this plot on the -axis we have empirical quantiles e on the x- axis we have the ones got by the theoretical model.

R offers to statements: qqnorm(), to test the goodness-of-fit of a Gaussian distribution, or qqplot() for any kind of distribution. In our example we have (Figure 4):

z.norm<-(x.norm-mean(x.norm))/sd(x.norm) ## standardized data qqnorm(z.norm) drawing the QQplot abline(0,1) ## drawing a 45-degree reference line abline(0,1) ## drawing a 45-degree reference line

Figure 2. Estimated frequency density plot

A 45-degree reference line is also plotted. If the empirical data come from the population with the chosen distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the data set have come from a population with a different distribution.

Figure 3. Empirical cumulative distribution function

Figure 4. Goodness-of-fit of a Gaussian distribution

EXAMPLE 2. Fitting other Dsitributions

If data differ from a normal distribution (i.e. data belonging from a Weibull pdf) we can use qqplot() in this way (Figure 5):

## sampling from a Weibull distribution with parameters ## shape=2.1 and scale=1. x.wei<-rweibull(n=200,shape=2.1,scale=1.1) ## theorical quantiles from a Weibull population with ## known paramters shape=2 e scale=1 x.teo<-rweibull(n=200,shape=2,scale=1) ## QQ-plot abline(0,1) a 45-degree reference line is ## plotted qqplot(x.teo,x.wei,main="QQ-plot distr. Weibull")

Figure 5. Q-Q plot of data belonging from a Weibull pdf

where x.wei is the vector of empirical data, while x.teo are quantiles from theoretical model.

Model choice

The first step in fitting distributions consists in choosing the mathematical model or function to represent data in the better way.

Sometimes the type of model or function can be argued by some hypothesis concerning the nature of data, often histograms and other graphical techniques can help in this step, but graphics could be quite subjective, so there are methods based on analytical expressions such as the Pearson’s K criterion. Solving a particular differential equation we can obtain several families of function able to represent quite all empirical distributions. Those curves depend only by mean, variability, skewness and kurtosis. Standardizing data, the type of curve depends only by skewness and kurtosis measures. For a mathematical treatment, see Strickland, J. (2015). Operations Research using Open-Source Tools. Lulu.com. ISBN 978-1-329-00404-7.

According to the value of K, obtained by available data, we have a particular kind of function. Here are some examples of continuous and discrete distributions, they will be used afterwards in this paper. For each distribution there is the graphic shape and R statements to get graphics.

EXAMPLE 3. Fitting a Discrete Distribution

Dealing with discrete data we can refer to Poisson’s distribution (Figure 6) with probability mass function:

x.poi<-rpois(n=200,lambda=2.5) hist(x.poi,main="Poisson distribution")

As concern continuous data we have:

normal (Gaussian) distribution (Figure 7)

curve(dnorm(x,m=10,sd=2),from=0,to=20,main="Normal distribution")

gamma distribution (Figure 8)

curve(dgamma(x, scale=1.5, shape=2),from=0, to=15, main="Gamma distribution")

Figure 6. A Poisson Distribution

Figure 7. A normal (Gaussian) distribution

Figure 8. A Gamma Distribution

EXAMPLE 4. Weibull Distribution (Figure 9)

Suppose we have a Weibull distribution with scale parameter 2.5 and shape parameter 1.5 (Figure 9).

curve(dweibull(x, scale=2.5, shape=1.5),from=0, to=15, main="Weibull distribution")

To compute skewness and kurtosis index we can use those statements: skewness() and kurtosis() included in fBasics package (you need to download this package from CRAN website):

library(fBasics) ## package loading skewness(x.norm) ## skewness of a normal distribution [1] 0.1242952 kurtosis(x.norm) ## kurtosis of a normal distribution [1] 0.01372539 skewness(x.wei) ## skewness of a Weibull distribution [1] 0.7788843 kurtosis(x.wei) ## kurtosis of a Weibull distribution [1] 0.4331281

Figure 9. A Weibull Distribution

Parameters’ estimate

After choosing a model that can mathematically represent our data we have to estimate parameters of such model. There are several estimate methods in statistical literature, but in this paper we are focusing on these ones:

analogic
moments
maximum likelihood

Analogic method consists in estimating model parameters applying the same function to empirical data. I.e., we estimate the unknown mean of a normal population using the sample mean:

mean.hat<-mean(x.norm) mean.hat [1] 9.935537

The method of moments is a technique for constructing estimators of the parameters that is based on matching the sample moments with the corresponding distribution moments. This method equates sample moments to population (theoretical) ones. When moment methods are available, they have the advantage of simplicity. To see how to define sample (empirical) moments in this way, please refer to see Strickland, J. (2015). Operations Research using Open-Source Tools. Lulu.com. ISBN 978-1-329-00404-7.

EXAMPLE 5. A Gamma Distribution

Suppose we have a gamma distribution with rate parameter 0.5 and shape parameter 3.5.

x.gam<-rgamma(200,rate=0.5,shape=3.5) ## sampling from a gamma distribution with λ=0.5 (scale parameter12) and α=3.5 (shape parameter) med.gam<-mean(x.gam) ## sample mean var.gam<-var(x.gam) ## sample variance l.est<-med.gam/var.gam ## lambda estimate (corresponds to rate) a.est<-((med.gam)^2)/var.gam ## alfa estimate l.est [1] 0.5625486 a.est [1] 3.916339

The method of maximum likelihood is used in statistical inference to estimate parameters. We have a random variable with a known pdf f(x,θ) describing a quantitative character in the population. We should estimate the vector of constant and unknown parameters θ according to sampling data: x₁, x₂, …, x_n. Maximum likelihood estimation begins with the mathematical expression known as a likelihood function of the sample data. Loosely speaking, the likelihood of a set of data is the probability of obtaining that particular set of data given the chosen probability model. This expression contains the unknown parameters. Those values of the parameter that maximize the sample likelihood are known as the maximum likelihood estimates (MLE). MLE have several statistical properties and advantages.

In R environment we can get MLE by two statements:

mle() included in package stats4
fitdistr() included in package MASS

mle() allows to fit parameters by maximum likelihood method using iterative methods of numerical calculus to minimize the negative log-likelihood (which is the same of maximizing the log-likelihood). You have to specify the negative log-likelihood analytical expression as argument and giving some starting parameters estimates. In case of a gamma distribution:

## loading package stats4 library(stats4) ll<-function(lambda,alfa) {n<-200+ x<-x.gam-n*alfa*log(lambda)+n*log(gamma(alfa))-(alfa-1)*sum(log(x))+ lambda*sum(x)} ## -log-likelihood function est<-mle(minuslog=ll,start=list(lambda=2,alfa=1)) summary(est) Maximum likelihood Call: mle(minuslogl = ll, start = list(lambda = 2, alfa = 1)) Coefficients: Estimate Std. Error lambda 0.5290189 0.05430615 alfa 3.6829126 0.35287672 -2 log L: 1044.282

We supply as starting values of parameters estimates arbitrary ones, but, we could use estimates got by the methods of moments. The statement mle() allows to estimate parameters for every kind of pdf, it needs only to know the likelihood analytical expression to be optimized.

In MASS package is available fitdistr() for maximum-likelihood fitting of univariate distributions without any information about likelihood analytical expression. It is enough to specify a data vector, the type of pdf (densfun) and eventually the list of starting values for iterative procedure (start).

library(MASS) ## loading package MASS fitdistr(x.gam,"gamma") ## fitting gamma pdf parameters shape rate 3.68320097 0.52910229 (0.35290545) (0.05431458) ## fitting Weibull pdf parameters scale shape fitdistr(x.wei,densfun=dweibull,start=list(scale=1,shape=2)) 1.04721828 2.04959159 (0.03814184) (0.11080258) ## fitting gaussian pdf parameters mean sd fitdistr(x.norm,"normal") 9.9355373 2.0101691 (0.1421404) (0.1005085)

Measures of goodness-of-fit

A goodness-of-fit measure is useful for matching empirical frequencies with fitted ones by a theoretical model. We have absolute and relative measures. The details are covered in see Strickland, J. (2015). Operations Research using Open-Source Tools. Lulu.com. ISBN 978-1-329-00404-7.

EXAMPLE 6. Using R for Count Data

Here is an example using R for count data (Poisson distribution):

lambda.est<-mean(x.poi) ## estimate of parameter lambda tab.os<-table(x.poi) ## table with empirical frequencies tab.os x.poi freq.os<-vector() for(i in 1: length(tab.os)) freq.os[i]<-tab.os[[i]] ## vector of emprical frequencies freq.ex<-(dpois(0:max(x.poi),lambda=lambda.est)*200) ## vector of fitted (expected) frequencies freq.os [1] 21 29 46 53 28 16 4 2 1 freq.ex [1] 15.0040080 38.8603808 50.3241931 43.4465534 28.1316433 14.5721912 6.2903292
[8] 2.3274218 0.7535028 acc<-mean(abs(freq.os-trunc(freq.ex))) ## absolute goodness-of-fit index acc [1] 2.111111 acc/mean(freq.os)*100 ## relative (percent) goodness-of-fit index [1] 17

A graphical technique to evaluate the goodness-of-fit can be drawing pdf curve and histogram together (Figure 10):

h<-hist(x.norm,breaks=15) xhist<-c(min(h$breaks),h$breaks) yhist<-c(0,h$density,0) xfit<-seq(min(x.norm),max(x.norm),length=40) yfit<-dnorm(xfit,mean=mean(x.norm),sd=sd(x.norm)) plot(xhist,yhist,type="s",ylim=c(0,max(yhist,yfit)), main=”Normal pdf and histogram”) lines(xfit,yfit, col=”red”)

Figure 10. A Normal PDF and Histogram

Goodness of fit tests

Goodness-of-fit tests indicate whether or not it is reasonable to assume that a random sample comes from a specific distribution. They are a form of hypothesis testing where the null and alternative hypotheses are:

H₀ : Sample data come from the stated distribution

H_A : Sample data do not come from the stated distribution

These tests are sometimes called as omnibus tests and they are distribution free, meaning they do not depend according the pdf. We shall point out our attention to normality tests.

The Chi-square test is the oldest goodness-of-fit test dating back to Karl Pearson (1900). The test may be thought of as a formal comparison of a histogram with the fitted density.

An attractive feature of the Chi-square (χ²) goodness-of-fit test is that it can be applied to any univariate distribution for which you can calculate the cumulative distribution function. The chi-square goodness-of-fit test is applied to binned data (i.e., data put into classes). This is actually not a restriction since for non-binned data you can simply calculate a histogram or frequency table before generating the chi-square test. However, the value of the chi-square test statistic is dependent on how the data is binned. Another disadvantage of this test is that it requires a sufficient sample size in order for the chi square approximation to be valid. The chi-square goodness-of-fit test can be applied either to discrete distributions or continuous ones while the Kolmogorov-Smirnov and Anderson-Darling tests are restricted to continuous distributions. Estimating model parameters with sample is allowed with this test. The chi-square test is defined for the hypothesis:

H₀ : the data follow a specified distribution

H_A : the data do not follow the specified distribution

In R environment there are three ways to perform a Chi-square test.

In case of count data we can use goodfit() included in vcd package (to be downloaded from CRAN website):

library(vcd)## loading vcd package gf<-goodfit(x.poi,type= "poisson",method= "MinChisq") summar Goodness-of-fit test for Poisson distribution X^2 df P(> X^2) Pearson 8.378968 7 0.3003653 plot(gf,main="Count data vs Poisson distribution")

In case of a continuous variable, such as a gamma distribution as in the following example, with parameters estimated by sample data:

x.gam.cut<-cut(x.gam,breaks=c(0,3,6,9,12,18)) ##binning data

table(x.gam.cut) ## binned data table x.gam.cut (0,3] (3,6] (6,9] (9,12] (12,18] 26 64 60 27 23 ## computing expected frequencies (pgamma(3,shape=a.est,rate=l.est)-pgamma(0,shape=a.est,rate=l.est))*200 [1] 19.95678 (pgamma(6,shape=a.est,rate=l.est)-pgamma(3,shape=a.est,rate=l.est))*200 [1] 70.82366 (pgamma(9,shape=a.est,rate=l.est)-pgamma(6,shape=a.est,rate=l.est))*200 [1] 60.61188 (pgamma(12,shape=a.est,rate=l.est)-pgamma(9,shape=a.est,rate=l.est))*200 [1] 30.77605 (pgamma(18,shape=a.est,rate=l.est)-pgamma(12,shape=a.est,rate=l.est))*200 [1] 16.12495 f.ex<-c(20,71,61,31,17) ## expected frequencies vector f.os<-vector() for(i in 1:5) f.os[i]<- table(x.gam.cut)[[i]] ## empirical frequencies X2<-sum(((f.os-f.ex)^2)/f.ex) ## chi-square statistic gdl<-5-2-1 ## degrees of freedom 1-pchisq(X2,gdl) ## p-value [1] 0.07652367

H₀ is accepted as the p-value is grater than of a significance level fixed al least in 5%.

Figure 11. Count Data versus the Poisson Distribution

Whether we are dealing with a continuous variable and all its pdf parameters are known we can use chisq.test():

## computing relative expected frequencies p<-c((pgamma(3,shape=3.5,rate=0.5)-pgamma(0,shape=3.5,rate=0.5)), (pgamma(6,shape=3.5,rate=0.5)-pgamma(3,shape=3.5,rate=0.5)), (pgamma(9,shape=3.5,rate=0.5)-pgamma(6,shape=3.5,rate=0.5)), (pgamma(12,shape=3.5,rate=0.5)-pgamma(9,shape=3.5,rate=0.5)), (pgamma(18,shape=3.5,rate=0.5)-pgamma(12,shape=3.5,rate=0.5))) chisq.test(x=f.os,p=p) ## chi-square test Chi-squared test for given probabilities data: f.os X-squared = 2.8361, df = 4, p-value = 0.5856

We can’t reject null hypothesis as p-value is rather high, so probably sample data belong from a gamma distribution with shape parameter=3.5 and rate parameter=0.5.

The Kolmogorov-Smirnov test is used to decide if a sample comes from a population with a specific distribution. It can be applied both for discrete (count) data and continuous binned (even if some Authors do not agree on this point) and both for continuous variables. It is based on a comparison between the empirical distribution function (ECDF) and the theoretical one F(x) = ∫f(y,θ)dy, where f(y,θ) is the pdf.

Kolmogorov-Smirnov test is more powerful than Chi-square test when sample size is not too great. For large size sample both the tests have the same power. The most serious limitation of Kolmogorov-Smirnov test is that the distribution must be fully specified, that is, location, scale, and shape parameters can’t be estimated from the data sample. Due to this limitation, many analysts prefer to use the Anderson-Darling goodness-of-fit test. However, the Anderson-Darling test is only available for a few specific distributions.

EXAMPLE 7. The K-S Test in R

In R we can perform Kolmogorov-Smirnov test using the function ks.test() and apply this test to a sample belonging from a Weibull pdf with known parameters (shape=2 and scale=1):

ks.test(x.wei,"pweibull", shape=2,scale=1) One-sample Kolmogorov-Smirnov test data: x.wei D = 0.0623, p-value = 0.4198 alternative hypothesis: two.sided

We accept null hypothesis that the data follow a Weibull distribution because the -value is enough higher than significance levels usually referred in statistical literature. In Figure 12 is drawn both ECDF and theoretical one in the same plot:

x<-seq(0,2,0.1) plot(x,pweibull(x,scale=1,shape=2),type="l",col="red", main="ECDF and Weibull CDF") plot(ecdf(x.wei),add=TRUE)

Figure 12. ECDF and a Weibull CDF

Normality tests

Very often a statistician is called to test if data collected come or not from a normal population, we shall examine the main normality tests. There are in statistical literature some tests useful for testing only skewness or only kurtosis (or both the two at the same time) of a distribution based on the well-known b3 e b4 (or gamma3 e gamma4). Shapiro-Wilk test is one of the most powerful normality tests, especially for small samples. Normality is tested by matching two alternative variance estimates: a non-parametric estimator got by a linear combination of ordered sample values and the usual parametric estimator. The weights () are available in a statistical table.

The statement performing Shapiro-Wilk test is shapiro.test() and it supplies statistic and the -value:

shapiro.test(x.norm) Shapiro-Wilk normality test data: x.norm W = 0.9938, p-value = 0.5659

The -value is higher than significance levels usually used to test statistical hypotheses, we accept null hypothesis that is sample data belong form a gaussian distribution.

Jarque-Bera test is used a lot to test normalità in Econometric. It is based on skewness and kurtosis measures of a distribution considering the asymptotic distribution of b3 e b4, which, under null hypothesis, is a chi-square with 2 degrees of freedom.

In R such test is available in tseries package (it should be downloaded from CRAN website) using this statement: jarque.bera.test() which supplies the value of statistic, the degrees of freedom and the p-value:

library(tseries) ## package tseries loading jarque.bera.test(x.norm) Jarque Bera Test data: x.norm X-squared = 0.539, df = 2, p-value = 0.7638

A test proposed by Cucconi (an Italian statistician) allows to test normality without the problem of estimating parameters by sample data.

It could be demonstrated that, if come from a normal population, have a normal standardized distribution. We can employ Kolmogorov-Smirnov test to check this hypothesis. Here is an example of R code:

zz<-rnorm(n=200,m=0,sd=1) ## sampling random numbers from N(0,1) r<-zz[200] q<-sd(zz[-200]) m<-mean(x.norm) s<-sqrt(var(x.norm)) y<-q*((x.norm-m)/s)+(r/sqrt(200)) ks.test(y,”pnorm”,m=0,sd=1) One-sample Kolmogorov-Smirnov test data: y D = 0.0298, p-value = 0.9943 alternative hypothesis: two.sided

A package called nortest (it should be downloaded from CRAN website) allows to perform 5 different normality test:

1) sf.test() performs Shapiro-Francia test:

library(nortest) ## package loading sf.test(x.norm) Shapiro-Francia normality test data: x.norm W = 0.9926, p-value = 0.3471

2) ad.test() performs Anderson-Darling21 test:

It is a modification of Kolmogorov-Smirnov test and it makes use of the specific distribution in calculating critical values. Currently, tables of critical values are available for the normal, lognormal, exponential, Weibull, extreme value type I, and logistic distributions. Anderson-Darling test is based on this statistic, A².

Where is the sample size and F(x) is the theoretical CDF (in our case is the normal CDF). In R environment this test can be used only to check normality:

library(nortest) ## package loading ad.test(x.norm) Anderson-Darling normality test vdata: x.norm = 0.4007, p-value = 0.3581

3) cvm.test() performs Cramer-Von Mises test based on the statistic W²:

library(nortest) ## package loading cvm.test(x.norm) Cramer-von Mises normality test data: x.norm W = 0.0545, p-value = 0.4449

4) lillie.test() performs Lilliefors test:

It is a modification of Kolmogorov-Smirnov test which can’t be used normality when the mean and standard deviation of the hypothesized normal distribution are not known (i.e., they are estimated from the sample data), it is particularly useful in case of small samples. The Lilliefors test evaluates the hypothesis that X has a normal distribution with unspecified mean and variance, against the alternative that X does not have a normal distribution. This test compares the empirical distribution of X with a normal distribution having the same mean and variance as X. It is similar to the Kolmogorov-Smirnov test, but it adjusts for the fact that the parameters of the normal distribution are estimated from X rather than specified in advance.

library(nortest) ## package loading lillie.test(x.norm) Lilliefors (Kolmogorov-Smirnov) normality test data: x.norm D = 0.0414, p-value = 0.5509

5) pearson.test() performs Pearson’s chi-square test: it is the same χ² test previously treated used to test the goodness-of-fit of a normal distribution:

library(nortest) ## package loading pearson.test(x.norm) Pearson chi-square normality test data: x.norm P = 10.12, p-value = 0.753

Appendix

List of R statements useful in fitting distributions. The package including statement is written in parenthesis.

ad.test(): Anderson-Darling test for normality (nortest) chisq.test(): chi-squared test (stats) cut: divides the range of data vector into intervals cvm.test(): Cramer-von Mises test for normality (nortest) ecdf(): computes an empirical cumulative distribution function (stats) fitdistr(): Maximum-likelihood fitting of univariate distributions (MASS) goodfit(): fits a discrete (count data) distribution for goodness-of-fit tests (vcd) hist(): computes a histogram of the given data values (stats) jarque.bera.test(): Jarque-Bera test for normality (tseries) ks.test(): Kolmogorov-Sminorv test (stats) kurtosis(): returns value of kurtosis (fBasics) lillie.test(): Lilliefors test for normality (nortest) mle(): estimate parameters by the method of maximum likelihood (stats4) pearson.test(): Pearson chi-square test for normality (nortest) plot(): generic function for plotting of R objects (stats) qqnorm(): produces a normal QQ plot (stats) qqline(), qqplot(): produce a QQ plot of two datasets (stats) sf.test(): test di Shapiro-Francia per la normalità (nortest) shapiro.test():Shapiro-Francia test for normalità (stats) skewness(): returns value of skewness (fBasics) table(): builds a contingency table (stats)

Authored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of “Predictive Analytics Using R” and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional. He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

Operations Research using Open-Source Tools
Discrete Event simulation using ExtendSim
Crime Analysis and Mapping
Missile Flight Simulation
Mathematical Modeling of Warfare and Combat Phenomenon
Predictive Modeling and Analytics
Using Math to Defeat the Enemy
Verification and Validation for Modeling and Simulation
Simulation Conceptual Modeling
System Engineering Process and Practices
Weird Scientist: the Creators of Quantum Physics
Albert Einstein: No one expected me to lay a golden eggs
The Men of Manhattan: the Creators of the Nuclear Era
Fundamentals of Combat Modeling
LinkedIn Memoirs
Quantum Phaith
Dear Mister President
Handbook of Handguns
Knights of the Cross: The True Story of the Knights Templar

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

Categories: Articles, Education & Training, Featured, Jeffrey Strickland

Tagged as: Big Data, Business Analytics, Data Mining, Data Science, Data Visualization, Predictive Analytics, Prescriptive Analytics, R programming

2 replies »

Marc d. Paradis says:

May 13, 2015 at 10:11 am

Thank you Dr. Strickland for a great blog post. I am having trouble with your code in a few places and I am wondering if the error is mine. I have listed the issues below:

1) Method of Maximum Likelihood:
>est gf summar
Error: object ‘summar’ not found
> summary
standardGeneric for “summary” defined from package “base”

function (object, …)
standardGeneric(“summary”)

Methods may be defined for arguments: object
Use showMethods(“summary”) for currently available ones.

2b) Goodness-of-Fit Methods:
## computing relative expected frequencies
>pchisq.test(x=f.os,p=p) ## chi-square test
Error in chisq.test(x = f.os, p = p) : probabilities must sum to 1.

3) Normality Tests
Is the code below for the test proposed by Cucconi?

zz<-rnorm(n=200,m=0,sd=1) ## sampling random numbers from N(0,1)
r<-zz[200]
q<-sd(zz[-200])
m<-mean(x.norm)
s<-sqrt(var(x.norm))
y<-q*((x.norm-m)/s)+(r/sqrt(200))
ks.test(y,"pnorm",m=0,sd=1)

Sincerely,

-Marc d. Paradis

LikeLike

- Jeffrey Strickland says:
  
  May 13, 2015 at 11:56 am
  
  Sometime the code gets “corrupted” when it is processed by WordPress for posting. I am working on the issue, but your corrections are probably right.
  
  LikeLike

BI Corner

At the intersection of Business Intelligence & Analytics

BI Corner Newsletter

Most Viewed Posts

Tag Cloud

Data Fitting with R

What is Data Fitting?

Introduction