Terms used in Statistical Analysis
Terms used in Statistical Analysis
Statistics is the scientific discipline which provides methods to help the people to make sense of the data. It is the science of learning from data. It is a set of methods used to analyze data. One of the goals with statistics is to extract information from data to get a better understanding of the situations the data represent. Hence, the statistics can be thought of as the science of learning from data.
Statistics is a mathematical science which includes methods of collecting, organizing, analyzing, and summarizing data in such a way that meaningful conclusions can be drawn from them. In the present-day environment, a familiarity with statistical techniques and statistical literacy is necessary for the people working in iron and steel industry.
In general, investigations and analyses of statistics fall into two broad categories called descriptive and inferential statistics. Descriptive statistics deals with the processing of data without attempting to draw any inferences from it. It involves the tabulating, depicting, and describing of the collections of data. The data are presented in the form of tables and graphs. The characteristics of the data are described in simple terms. The data can be either quantitative or qualitative. The data provide a picture or description of the properties of data collected in order to summarize them into manageable form. Events that are dealt with include everyday happenings.
Inferential statistics is a scientific discipline which uses mathematical tools to make forecasts and projections by analyzing the given data. Inferential statistics is a formalized body of techniques which infer the properties of a larger collection of data from the inspection of that collection. They build on these statistics as they infer the properties of samples to various populations.
Design and analysis statistics have been developed for the discovery and confirmation of causal relationships among variables. It uses a variety of statistical tests related to aspects such as prediction and hypothesis testing. Experimental analysis is related to comparisons, variance, and ultimately testing whether variables are significant between each other. These two types of statistics are normally either parametric or non-parametric.
The widespread use of statistical analyses in diverse fields has led to increased recognition that statistical literacy which is a familiarity with the goals and methods of statistics is important for the people involved in designing, engineering, constructing, operating, and maintaining iron and steel plant as well as marketing its products. This is since the field of statistics teaches how to make intelligent judgments and informed decisions in the presence of uncertainty and variation. Various terms used in the statistical analysis along with their definitions are given below.
Absolute value (modulus) – The absolute value is the value of a number, disregarding its sign. It is denoted by a pair of ‘|’ signs. For example, the modulus of –2.5 is |-2.5| = 2.5.
Accelerated failure-time model – It is a type of regression model in survival analysis in which the study endpoint is the natural logarithm of survival time. It necessitates knowledge of the probability distribution for survival time, since the estimation method is maximum likelihood.
Accuracy – It is the degree to which some estimate matches the true state of nature.
Adjusted means – It is estimated means on a study endpoint for different groups after controlling for the groups’ different distributions on the quantitative control variables in an ANCOVA (analysis of covariance) model.
Aggregate – It is the value of a single variable which summarizes, adds, or represents the mean of a group or collection of data.
Aggregation – It is the compounding of primary data, normally for the purpose of expressing them in summary form.
Alpha reliability (also known as Cronbach’s alpha) – It is a measure ranging from 0 to 1 which represents the proportion of a composite measure (i.e., a sum of individual items) which consists of a stable underlying attribute.
Alpha-level for a test – It is the criterion probability which is compared to the ‘p’ value to determine whether the null hypothesis is to be rejected or not. The normal alpha level is 0.05.
Analysis of covariance (ANCOVA) – It is a statistical technique for equating groups on one or more variables when testing for statistical significance using the F-test statistic (An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis). It adjusts scores on a dependent variable for initial differences on other variables, such as pre-test performance or IQ (IQ tests are made to have an average score of 100).
Analysis of variance (ANOVA) – It is a statistical technique for determining the statistical significance of differences among means. It can be used with two or more groups and uses the F-test statistic.
Approximation error – It is in general, an error because of the approximation from making a rough calculation, estimate, or guess. In numerical calculations, approximations result from rounding errors, for example Pi = 22/7 = 3.1417.
Arithmetic mean – It is the result of summing all measurements from a population or sample and dividing by the number of population or sample members. The arithmetic mean is also called the average, which is a measure of central tendency.
Area under the curve or AUC (also known as concordance index) – In logistic regression, the area under the ROC (receiver operating characteristic) curve. It represents the likelihood that a case will have a higher predicted probability of the event than a control across the range of criterion probabilities. In Cox regression, it is referred to as the concordance, or ‘c’ index and serves the comparable function of indicating the predictive power of the model.
Association – It is a relationship between two variables in which the distribution of the first variable changes over the levels of the second variable. That is, the second variable appears to influence the distribution on the first variable. If the second variable causes the first, then the two variables are to be associated.
Assumptions – Statistical inference normally involves using a sample to estimate the parameters of a model. The conclusions, i.e., the validity of the estimates, only hold if certain assumptions are true. For example, a sample of 50 years of rainfall data can be used to estimate the parameters of a normal model. The assumptions are then that (i) the 50 years behave like a random sample, (they are not random, they are 50 successive years), (ii) the data are from a single population, i.e., there is no climate change, and (iii) the population has a normal distribution.
Autoregressive model – It is a regression model in which one of the explanatory variables is an earlier measurement of the study endpoint.
Autoregressive integrated moving average (ARIMA) – This statistic is a Box-Jenkins approach to time series analysis. It tests for changes in the data patterns pre-intervention and post-intervention within the context of analyzing the outcomes of a time series design.
Average – For a numeric variable the average is a loosely used term for a measure of location. It is normally taken to be the mean, but it can also denote the median, the mode, among other things.
Average causal effect – It is the average of the causal effects for all cases in the population.
Bar chart – It is a diagram for showing the frequencies of a variable which is categorical or discrete. The lengths of the bars are proportional to the frequencies or the percentages. The widths of the bars are be equal. Fig 1 shows bar chart.
Fig 1 Bar chart
Before-after study – It is a study wherein data are collected prior to and following an event, treatment, or action. The event, treatment, or action applied between the two periods is thought to affect the data under investigation. The purpose of this type study is to show a relationship between the data and the event, treatment, or action. In experimental study all other factors are either randomized or controlled.
Beginning of observation – In survival analysis, it is the moment in time when subjects begin to be followed by the person carrying the studies.
Between-subjects variable – it is a variable in repeated-measures ANOVA or linear mixed modelling which does not change over time for a given subject but takes on different values for different subjects.
Bias – In problems of estimation of population parameters, an estimator is assumed biased if its expected value does not equal the parameter it is intended to estimate. In sampling, a bias is a systematic error introduced by selecting items non-randomly from a population which is assumed to be random. A survey question can be biased if it is poorly phrased.
Biased estimator – It is a sample statistic which is an inaccurate estimator of the corresponding population parameter, in particular, the mean of its sampling distribution is not equal to the population parameter.
Binomial distribution – The binomial distribution is used to model data from categorical variables, when there are just two categories, or levels.
Binomial test – It is an exact test of the statistical significances of derivations from a theoretically expected distribution of observations into two categories.
Bivariate statistics – It is the statistical procedures for testing and assessing the association between two different variables.
Block diagram – It consists of vertically placed rectangles on a common base line, normally the height of the rectangles being proportional to a quantitative variable.
Bonferroni post-hoc test – It is a multiple- comparison procedure allowing the person carrying out the study to test differences between pairs of group means without incurring capitalization on chance.
Boxplot (or ‘box and whisker’ plot) – It is a graphical representation of numerical data, based on the five-number summary and introduced by Tukey in 1970. The diagram has a scale in one direction only. A rectangular box is drawn, extending from the lower quartile to the upper quartile, with the median shown dividing the box. ‘Whiskers’ are then drawn extending from the end of the box to the greatest and least values. Multiple boxplots, arranged side by side, can be used for the comparison of several samples. In refined boxplots, the whiskers have a length not exceeding 1.5 times the interquartile range. Any values beyond the ends of the whiskers are shown individually as outliers. Sometimes any values further than 3 times the interquartile range are indicated with a different symbol as extreme outliers. Fig 2 shows boxplot.
Fig 2 Boxplot
Capitalization on chance – It is the situation in which performing multiple tests of hypothesis raises the probability of rejecting a true null hypothesis beyond the alpha level desired for the group of tests.
Cases – These are units of analysis in one’s study. In logistic regression, however, the term also refers to the units of analysis who have experienced the event of interest.
Categorical variable – It is a variable with values which range over categories, rather than being numerical. Examples include gender (male, female), paint colour (red, white, blue), type of animal (elephant, leopard, lion). Some categorical variables are ordinal.
Causal effect – It is the difference between the study endpoint’s value if a subject experiences the treatment condition vs. its value if the same subject is to experience the control condition instead. This is a counter-factual definition since it is impossible to observe.
Censored cases – In survival analysis, these are cases with incompletely observed survival times. Right censoring occurs when the subject has not yet experienced the event by the end of the observation period. Left censoring occurs when subjects have already experienced the event by the beginning of observation.
Centre of a distribution – It is the typical or average value in a variable’s distribution.
Central limit theorem – This result explains why the normal distribution is so important in statistics. Frequently, it is desired to use the sample mean ‘x’ to estimate the mean, ‘mu’, of the population. The central limit theorem says that, as long as the sample size is reasonably large, the distribution of ‘x’ about ‘mu’ is roughly normal, whatever the distribution of the data.
Check sheets – In the early stages of process improvement, it becomes frequently necessary to collect either historical or current operating data about the process under investigation. This is a common activity in the measure step of DMAIC (define, measure, analyze, improve, and control). A check sheet can be very useful in this data collection activity. The check sheet was developed by an aerospace company engineer who was investigating defects which occurred on one of the company’s tanks. The time-oriented summary is particularly valuable in looking for trends or other meaningful patterns.
Chi-squared test – It is a test of hypothesis used for testing the association between two qualitative variables or for testing model utility for models estimated with maximum likelihood.
Class – it consists of observations grouped according to convenient divisions of the variate range, normally to simplify subsequent analysis (the upper and lower limits of a class are called class boundaries, the interval between them the class interval, and the frequency falling into the class is the class frequency).
Classification table – In logistic regression, it is a table showing the cross-tabulation of a subject’s actual status as case or control with the model’s prediction of whether that subject is a case or control.
Clinical significance – It is the condition in which sample results which are significant are also clinically meaningful.
Cluster sampling – It is a type of sampling whereby observations are selected at random from several clusters instead of at random from the entire population. It is intended that the heterogeneity in the phenomenon of interest is reflected within the clusters, i.e., members in the clusters are not homogenous with respect to the response variable. Cluster sampling is less satisfactory from a statistical stand-point, but frequently can be more economical and / or practical.
Coefficient of determination – It is the square of the correlation coefficient (‘r’). It indicates the degree of relationship strength by potentially explained variance between two variables.
Confidence interval – It is an interval of numbers which people are very confident contains the true value of a population parameter. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter. The width of the confidence interval gives an idea of how uncertain people about the unknown parameter. A very wide interval can indicate that more data needs to be collected before an effective analysis can be undertaken.
Continuous variable – A numeric variable is continuous if the observations can take any value within an interval. Variables such as height, weight, and temperature are continuous. In descriptive statistics the distinction between discrete and continuous variables is not very important. The same summary measures, like mean, median, and standard deviation can be used. There is frequently a bigger difference once inferential methods are used in the analysis. The model which is assumed to generate a discrete variable is different to models which are appropriate for a continuous variable. Hence different parameters are estimated and used.
Control chart – it is a graphical device used to display the results of small scale, repeated sampling of a manufacturing process, normally showing the average value, together with upper and lower control limits between which a stated portion of the sample statistics falls. Fig 3 gives a control chart.
Fig 3 Control chart
Controls – In logistic regression, these are the units of analysis which have not experienced the event of interest.
Correlation coefficient – It is a decimal number between 0.00 and +/- 1.00 which indicates the degree to which two quantitative variables are related. The most common one used is the ‘Pearson Product Moment’ correlation coefficient or just the Pearson coefficient. It indicates the strength and direction of linear association between two quantitative variables.
Cox regression model (also known as proportional hazards model) – It is the most commonly used regression model for survival data. The response variable is the log of the hazard function.
Cronbach’s alpha coefficient – It is a coefficient of consistency which measures how well a set of variables or items measures a single, unidimensional, latent construct in a scale or inventory. Alpha scores are conventionally interpreted as (i) high – 0.90, (ii)medium -m, 0.70 to 0.89, and (iii) low – 0.55 to 0.69.
Cross tabulation (also known as contingency) table – It is a table displaying the association between two qualitative variables.
Cumulative frequency distribution – It is a graphic depiction of how many times groups of scores appear in a sample.
Cumulative frequency graph – For a numerical variable, the cumulative frequency corresponding to a number ‘x’ is the total number of observations which are less than or equal to ‘x’. The y-axis of this graph (Fig 4) can show the frequency, the proportion or the percentage. With the percentage, this graph allows any percentile to be read from the graph. For example, in the graph (Fig 4), the 25 % point (lower quartile) is about 940 mm. The 90 % point is around 1,300 mm.
Fig 4 Cumulative frequency graph
Data – Data consist of numbers, letters, or special characters representing measurements of the properties of one’s analytic units, or cases, in a study. Data are the raw material of statistics.
Decile – Deciles are used to divide a numeric variable into 10ths, whereas the quartiles divide it into quarters, and percentiles into 100ths. An approximate value for the ‘r’th decile can be read from a 6 cumulative frequency graph as the value corresponding to a cumulative relative frequency of 10r %. So, the 5th decile is the median and the second decile is the 20th percentile. The term decile was introduced by Galton in 1882.
Dependent t-test – It is a data analysis procedure which assesses whether the means of two related groups are statistically different from each other, for example, one group’s mean score (time one) compared with the same group’s mean score (time two). It is also called the paired samples t-test.
Degrees of freedom – It is a technical term reflecting the number of independent elements comprising a statistical measure. Certain distributions need a degrees of freedom value to fully characterize them.
Descriptive statistics – It is the body of statistical techniques concerned with describing the salient features of the variables used in one’s study. If one has a large set of data, then descriptive statistics provides graphical (e.g., boxplots) and numerical (e.g., summary tables, means, quartiles) ways to make sense of the data. The branch of statistics devoted to the exploration, summary, and presentation of data is called descriptive statistics. If people need to do more than descriptive summaries and presentations, they are to use the data to make inferences about some larger population. Inferential statistics is the branch of statistics devoted to making generalizations.
Deviation score – It is the difference between a variable’s value and the mean of the variable. Directional conclusion – It is a conclusion in a two-tailed test which uses the nature of the sample results to suggest where the true parameter lies in relation to the null hypothesized value.
Discrete variable – A set of data is discrete if the values belonging to it are distinct, i.e., they can be counted. Examples are the number of children in a family, the number of rainy days in the month, or the length (in days) of the longest dry spell in the growing season. A discrete variable is measured on the nominal or ordinal scale, and can assume a finite number of values within an interval or range. Discrete variables are less informative than are continuous variables.
Dispersion– It is the degree of scatter or concentration of observations around its centre or middle. Dispersion is normally measured as a deviation around some central value such as the mean, standard, or absolute deviation, or by an order statistic such as deciles, quintiles, and quartiles. Distribution – The set of frequencies or probabilities assigned to different outcomes of a particular event or trial.
Dispersion of a distribution – It is the degree of spread shown by a variable’s values, typically assessed with the standard deviation.
Distribution function – It is the function, denoted F (x), which gives the cumulative frequency or probability that random variable ‘X’ takes on a value less than or equal to ‘x’.
Distribution (or probability distribution) of a variable – It is the collection of all values of a variable along with their associated probabilities of being observed.
Dot plot – A dot-plot is an alternative to a boxplot where each value is recorded as a dot. It is used when there are only few data values. The dots can be jittered, so each value is made visible. They can alternatively be stacked, to produce a simple histogram.
Dummy variable – It is a variable in a regression model coded 1 if the case falls into a certain category of an explanatory variable and 0 otherwise. Used to represent qualitative predictors in a regression model.
Effect size – It is a measure of the strength of a relationship between two variables. Effect size statistics are used to assess comparisons between correlations, percentages, mean differences, probabilities, and so on.
Efficiency (statistical) – A statistical estimator or estimate is said to be efficient if it has small variance. In majority of the cases a statistical estimate is preferred if it is more efficient than alternative estimates. It can be shown that the Cramer-Rao bound represents the best possible efficiency (lowest variance) for an unbiased estimator. That is, if an unbiased estimator is shown to be equivalent to the Cramer-Rao bound, then there are no other unbiased estimators which are more efficient. It is possible in some cases to find a more efficient estimate of a population parameter which is biased.
Error – In a statistical interpretation, the word ‘error’ is used to denote the difference between an observed value and its ‘expected’ value as predicted or explained by a model. In addition, errors occur in data collection, sometimes resulting in outlying observations. Finally, type I and type II errors refer to specific interpretive errors made when analyzing the results of hypothesis tests.
Estimation – Estimation is the process by which sample data are used to indicate the value of an unknown quantity in a population. The results of estimation can be expressed as a single value, known as a point estimate. It is normal to also give a measure of precision of the estimate. This is called the standard error of the estimate. A range of values, known as a confidence interval can also be given.
Estimator – An estimator is a quantity calculated from the sample data, which is used to give information about an unknown quantity (normally a parameter) in the population. For example, the sample mean is an estimator of the population mean. Estimators of population parameters are sometimes distinguished from the true (but unknown) population value, by using the symbol ‘hat’.
Eta – It is an index which indicates the degree of a curvilinear relationship.
Exogenous variables – An exogenous variable in a statistical model refers to a variable whose value is determined by influences outside of the statistical model. An assumption of statistical modelling is that explanatory variables are exogenous. When explanatory variables are endogenous, problems arise when using these variables in statistical models.
Expectation – It is the expected or mean value of a random variable, or function of that variable such as the mean or variance. Exponential: A variable raised to a power of ‘x’. The function F (x) = (a)x is an exponential function.
Exponential distribution – The exponential distribution is a continuous distribution, and is typically used to model life cycles or decay of materials or events.
Exponential smoothing – It is a time series regression in which recent observations are given more weight by way of exponentially decaying regression coefficients.
F ratio – It is the ratio of two independent unbiased estimates of variance of a normal distribution. It has widespread application in the analysis of variance (ANOVA).
F-test – It is as parametric statistical test of the equality of the means of two or more samples. It compares the means and variances between and within groups over time. It is also called analysis of variance (ANOVA).
Factor analysis – It is a statistical method for reducing a set of variables to a smaller number of factors or basic components in a scale or instrument being analyzed. Two main forms are exploratory (EFA) and confirmatory factor analysis (CFA).
Fixed effect – It is an unobserved characteristic of subjects which is both a predictor of the study endpoint and correlated with one or more explanatory variables in a regression model. Left unaddressed, it leads to biased regression estimates.
Five-number summary – It is the summary for a numeric variable, the least value (minimum), the lower quartile, the median, the upper quartile, and the greatest value (maximum), in that order. These are shown graphically in a boxplot.
Fisher’s exact test – It is a non-parametric statistical significance test used in the analysis of contingency tables where sample sizes are small. The test is useful for categorical data which result from classifying objects in two different ways. It is used to examine the significance of the association (contingency) between two kinds of classifications.
Frequencies – The frequency is the number of times that particular values are obtained in a variable. For example, with the data: dry, rain, dry, dry, dry, rain, rain, dry, the frequency of ‘rain’ is 3 and the frequency of ‘dry’ is 5. The sum of the frequencies is the number of observations, ‘n’, in the variable. In this example, n = 8. The percentage is (100*frequency/n), = 100* 3/8 = 37.5 % for rain, in this example. With as few as 8 values percentages are not normally used, instead state 3 out of the 8 values are rain.
Friedman two-way analysis of variance – It is a non-parametric inferential statistic which is used to compare two or more groups by ranks which are not independent.
Gamma distribution– The gamma distribution includes as special cases the chi-square distribution and the exponential distribution. It has several important applications. In Bayesian inference, for example, it is sometimes used as the a priori distribution for the parameter (mean) of a Poisson distribution.
Gantt chart – It is a bar chart showing actual performance or output expressed as a percentage of a quota or planned performance per unit of time. Fig 5 shows a Gantt chart.
Fig 5 Gantt chart
Gaussian distribution – The gaussian distribution is another name for the normal distribution. Goodness of fit – Goodness of fit describes a class of statistics used to assess the fit of a model to observed data. There are several measures of goodness of fit which include the coefficient of determination, the F-test, the chi-square test for frequency data, and numerous other measures. It is to be noted that goodness can refer to the fit of a statistical model to data used for estimation, or data used for validation.
Greek letters – In statistics, Greek letters are used for the parameters of the population and for a few other things.
Growth-curve modelling (also known as the linear mixed model) – It is a regression analysis in which the response variable is the trajectory of change over time in a quantitative study end point. Interest centres on describing the average trajectory of change, as well as what subject characteristics lead to different trajectories of change for different types of subjects.
Heterogeneity – This term is used in statistics to describe samples or individuals from different populations, which differ with respect to the phenomenon of interest. If the populations are not identical then they are said to be heterogeneous, and by extension, the sample data is also said to be heterogeneous.
Heteroscedasticity – In regression analysis, the property that the conditional distributions of the response variable ‘Y’ for fixed values of the independent variables do not all have constant variance. Non-constant variance in a regression model results in inflated estimates of model mean square error. Standard remedies include transformations of the response, and / or employing a generalized linear model.
Histogram – A histogram is a graphical representation (bar chart) of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson. A histogram is a representation of tabulated frequencies, shown as adjacent rectangles, erected over discrete intervals (bins), with an area equal to the frequency of the observations in the interval. The height of a rectangle is also equal to the frequency density of the interval, i.e., the frequency divided by the width of the interval. The total area of the histogram is equal to the quantity of data. Fig 6 shows different patterns of histograms.
Fig 6 Patterns of histograms
Homogeneity – This term is used in statistics to describe samples or individuals from populations, which are similar with respect to the phenomenon of interest. If the populations are similar then they are said to be homogenous, and by extension, the sample data is also said to be homogenous. Homoscedasticity – In regression analysis, it is the property that the conditional distributions of ‘Y’ for fixed values of the independent variable all have the same variance.
Hypothesis – A statistical hypothesis is a hypothesis concerning the value of parameters or form of a probability distribution for a designated population or populations. More generally, a statistical hypothesis is a formal statement about the underlying mechanisms which generated some observed data.
Hypothesis test – Testing of hypotheses is a common part of statistical inference. To formulate a test, the question of interest is simplified into two competing hypotheses, between which one has a choice. The first is the null hypothesis, denoted by H0, against the alternative hypothesis, denoted by H1. For example, with 50 years of annual rainfall totals a hypothesis test can be whether the mean is different in El Nino and ordinary years. Then normally (i) the null hypothesis, H0, is that the two means are equal, i.e., there is no difference, and (ii) the alternative hypothesis, H1, is that the two means are unequal, i.e., there is a difference. If the 50 years are considered as being of three types, El Nino, ordinary, and La Nina then normally (i) the null hypothesis, H0, is that all three means are equal, and (ii) the alternative hypothesis, H1, is that there is a difference somewhere between the means. The hypotheses are frequently statements about population parameters.
Independent t-test – It is a statistical procedure for comparing measurements of mean scores in two different groups or samples. It is also called the independent samples t-test.
Inference – Inference is the process of deducing properties of the underlying distribution or population, by analysis of data. It is the process of making generalizations from the sample to a population.
Inferential statistics – The body of statistical techniques concerned with making inferences about a population based on drawing a sample from it.
Internal validity – It is the extent to which treatment-group differences on a study endpoint represent the causal effect of the treatment on the study endpoint.
Inter-quartile range (IQR) – The interquartile range is the difference between the upper and lower quartiles. If the lower and upper quartiles are denoted by Q1 and Q3, respectively, the interquartile range is (Q3 – Q1). The phrase ‘inter-quartile range’ was first used by Galton in 1882.
Jittered dot-plot – In a simple dot plot, the dots can overlap. In some cases, they can coincide completely, so obscuring some of the points. A solution is to randomly move the dots perpendicularly from the axis, to separate them from one another. This is called jittering. It results in a jittered dot-plot.
Joint probability – The joint probability is the joint density function of two random variables, or bivariate density.
Kaplan–Meier (also known as known a product-limit) estimator – It is a non-parametric estimator of the survival function in correlation survival analysis.
Kendall’s coefficient of rank – Denoted as ‘t’ij, where i and j refer to two variables, Kendall’s coefficient of rank correlation reflects the degree of linear association between two ordinal variables, and is bounded between +1 for perfect positive correlation and –1 for perfect negative correlation.
Least squares estimation – It is a technique of estimating statistical parameters from sample data whereby parameters are determined by minimizing the squared differences between model predictions and observed values of the response. The method can be regarded as possessing an empirical justification in that the process of minimization gives an optimum fit of observation to theoretical models, but for restricted cases such as normal theory linear models, estimated parameters have optimum statistical properties of unbiasedness and efficiency.
Left skewed – It is said of distributions where the majority of the cases have high values of the variable, and a few outliers have very low values.
Left-truncated cases – It is subjects in survival analysis who have already been at risk for event occurrence for some time when they come under observation.
Levels – The levels are the number of categories in a categorical (factor) variable. The categorical variable for gender (male, female) has two levels. There are five levels in the variable with the categories (very bad, bad, middling, good, and very good).
Level of significance – The level of significance is the probability of rejecting a null hypothesis, when it is in fact true. It is also known as alpha, or the probability of committing a Type I error.
Likelihood function – It is the probability or probability density of obtaining a given set of sample values, from a certain population, when this probability or probability density is regarded as a function of the parameter(s) of the population and not as a function of the sample data. It is the formula for the probability of observing the collection of study end points observed in the sample, written as a function of the statistical model in question. Once a sample is collected, this formula is only influenced by the values of the coefficients in one’s model.
Likelihood-ratio chi-squared test – It is the counterpart of linear regression’s F test for logistic regression. This is a test of overall model utility.
Line graph – It is a line graph is a scatter plot where individual points are connected by a line. The line represents a sequence in time, space, or some other quantity. Where the graph also includes a category variable, a separate line can be drawn for each level of this variable. Fig 7 shows a line graph.
Fig 7 Line graph
Linear correlation – It is a somewhat ambiguous expression used to denote either (a) Pearson’s Product Moment Correlation in cases where the corresponding variables are continuous, or (b) a Correlation Coefficient on ordinal data such as Kendall’s Rank Correlation Coefficient. There are other linear correlation coefficients besides the two listed here as well.
Linear model – It is a mathematical model in which the equation relating the random variables and parameters are linear in parameters.
Linear regression – It is a type of analysis in which a quantitative study endpoint is posited to be determined by one or more explanatory variables in a linear equation, i.e., a formula involving a weighted sum of coefficients times variables plus an error term.
Linearity in the parameters – It is the condition in which the right-side of a statistical model is a weighted sum of coefficients times variables.
Lot – It is a group of units produced under similar conditions.
Maximum – The maximum is the highest value in a numerical variable. In the variable with the values 12, 15, 11, 18, 13, 14, 18 then 18 is the maximum value.
Maximum likelihood estimation – it is a means of estimating the coefficients of a statistical model which relies on finding the coefficient values that maximize the likelihood function for the collection of study endpoints in the sample.
Maximum likelihood method – It is a method of parameter estimation in which a parameter is estimated by the value of the parameter which maximizes the likelihood function. In other words, the maximum likelihood estimator is the value of theta which maximizes the probability of the observed sample. The method can also be used for the simultaneous estimation of several parameters, such as regression parameters. Estimates obtained using this method are called maximum likelihood estimates.
Mean – The mean is a measure of the ‘middle’, sometimes called the ‘average’. It is that value of a variate such that the sum of deviations from it is zero, and hence it is the sum of a set of values divided by their number.
Mean deviation – The mean deviation is a measure of spread. The mean deviation is an average (mean) difference from the mean. The value it gives is similar, but slightly smaller than the standard deviation. Although it is intuitively simpler than the standard deviation it is used less. The reason is largely since the standard deviation is used in inference, because the population standard deviation is one of the parameters of the normal distribution.
Mean of a variable – It is the arithmetic average of the variable’s values.
Mean square error – For unbiased estimators, the mean square error is an estimate of the population variance, and is normally denoted as MSE. For biased estimators, the mean squared deviation of an estimator from the true value is equal to the variance plus the squared bias. The square root of the mean square error is referred to as the root mean square error.
Mechanism – It is a characteristic which transmits the effect of one variable on another. It is also called an intervening variable or a mediating variable.
Median – The median is the middle most number in an ordered series of numbers. It is a measure of central tendency, and is frequently a more robust measure of central tendency, that is, the median is less sensitive to outliers than is the sample mean. If the list has an odd number of entries, the median is the middle entry after sorting the list into increasing order. If the list has an even number of entries, the median is halfway between the two middle numbers after sorting.
Median of a variable – It is the value of the variable such that half of the cases are lower in value and half are higher in value.
Missing at random – It is said of missing data when the probability of being missing on a variable is unrelated to the value of that variable had it been observed.
Missing data – It is the problem of data being absent for one or more variables in one’s study.
Mixed variable – Some variables are between being categorical and numerical. For example, daily rainfall is exactly zero on all dry days, but is a continuous variable on rainy days. Wind speed is similar, with zero being calm. Frequently there is a single ‘special value’, here zero, and otherwise the variable is continuous. This is not always the case. For example, sunshine hours expressed as a fraction of the day length, is zero on cloudy days and 1 (or 100 %) on days with no cloud (or haze). In the analysis it is normal to treat the categorical and the numerical parts separately. However, categorical variables are normally summarized using frequencies and percentages.
Mode – It is the most commonly occurring value in a distribution. It is the most common or most probable value observed in a set of observations or sample.
Model (statistical) – The word ‘model’ is used in several ways and means different things, depending on the discipline. Statistical models form the bedrock of data analysis. A statistical model is a simple description of a process which can have given rise to observed data. A model is a formal expression of a theory or causal or associative relationship between variables, which is regarded by the analyst as having generated the observed data. A statistical model is always a simplified expression of a more complex process, and hence, the analyst is to anticipate some degree of approximation a priori. A statistical model which can explain the greatest amount of underlying complexity with the simplest model form is preferred to a more complex model. There are several probability distributions which are key parts of models in statistics, including the normal distribution and the binomial distribution.
Moving average (MA) processes – These are stationary time series which are characterized by a linear relationship between observations and past innovations. The order of the process ‘q’ defines the number of past innovations on which the current observation depends.
Multi-collinearity – It is the situation in a regression model in which two are more predictors are highly correlated with each other, leading to poor-quality coefficient estimates. Multi-collinearity is a term used to describe when two variables are correlated with each other. In statistical models, multi-collinearity causes problems with the efficiency of parameter estimates. It also raises some philosophical issues, since it becomes difficult to determine which variables (both, either, or none), are causal and which are the result of illusory correlation.
Multinomial logistic regression – It is a logistic regression model for a study end point with more than two values.
Multiple-comparison procedure – It is a statistical procedure for comparing group means which avoids capitalization on chance.
Multiple imputation – It is a means of filling in missing data which involves using the inter-relationships among variables in one’s analysis, along with random error, to estimate the missing values. This process is repeated to create multiple copies of one’s data; then one’s statistical analysis of the data is repeated with each copy of the dataset and the results are combined into one final set of results.
Multiple linear regression – It is a linear regression involving two or more independent variables. Simple linear regression, which is merely used to illustrate the basic properties of regression models, contains one explanatory variable and is rarely if ever used in practice.
Multivariate (or multivariable) analysis – It is an analysis in which one examines the simultaneous effect of two or more explanatory variables on a study end point.
Multivariate normal distribution – It is a multi-dimensional version of the normal distribution which characterizes a collection of variables. If a set of variables has a multivariate normal distribution, then the variables are all inter-correlated and each individual variable is normally distributed.
Mutually exclusive events – In probability theory, two events are said to be mutually exclusive if and only if they are represented by disjoint subsets of the sample space, namely, by subsets which have no elements or events in common. By definition the probability of mutually exclusive events A and B occurring is zero.
Natural logarithm -The number that Euler’s constant (around 2.72) is raised to in order to arrive at the value in question.
Negative binomial regression – it is similar to Poisson regression except that there is no restriction that the mean and variance of the study end point are to be identical.
Newman-Keuls test – It is a type of post hoc or a posteriori multiple comparison test of data which makes precise comparisons of group means after ANOVA has rejected the null hypothesis.
Noise – It is a convenient term for a series of random disturbances, or deviation from the actual distribution. Statistical noise is a synonym for error term, disturbance, or random fluctuation.
Nominal scale – It is a variable measured on a nominal scale which is the same as a categorical variable. The nominal scale lacks order and does not possess even intervals between levels of the variable. An example of a nominal scale variable is vehicle type, where levels of response include truck, van, and auto. The nominal scale variable provides the statistician with the least quantity of information relative to other scales of measurement.
Non-linear association – It is an association between two quantitative variables in which the scatter plot does not follow a linear trend.
Non-linear interaction effect – It is an interaction effect in which the non-linear relationship between the study end point and an explanatory factor takes on different shapes over levels of another explanatory variable.
Non-linear model – It is a statistical model which is not linear in the parameters, e.g., the logistic regression model, the Poisson regression model, the proportional hazards model.
Non-linear relation – A non-linear relation is one where a scatter plot between two variables X1 and X2 does not produce a straight-line trend. In several cases a linear trend can be observed between two variables by transforming the scale of one or both variables. For example, a scatter plot of log(X1) and X2 can produce a linear trend. In this case the variables are said to be non-linearly related in their original scales, but linear in transformed scale of X1.
Non-parametric test – It is a statistical test which makes very few assumptions about population distributions.
Non-probability sample – It is a sample which is not a probability sample, i.e., a hand-picked sample, a convenience sample, or a ‘snowball sample’, etc. Study results using this type of sample can only be generalized to a hypothetical population.
Non-random sample – It is a sample selected by a non-random method. For example, a scheme whereby units are self-selected yields a non-random sample, where units which prefer to participate do so. Some aspects of non-random sampling can be overcome, however.
Normal distribution – It is a continuous distribution which was first studied in connection with errors of measurement and, hence, referred to as the normal curve of errors. The normal distribution forms the cornerstone of a substantial portion of statistical theory. The normal distribution is used to model some continuous variables. It is a symmetrical bell-shaped curve which is completely determined by two parameters. They are the distribution (or population) mean, ‘ mu’, and the standard deviation, ‘sigma’. Hence, once the mean and standard deviation are provided, it is possible to calculate any percentile (or risk) of the distribution. It is a population distribution which is symmetric and for which 68 % of the cases are within one standard deviation of the mean, 95 % of the cases are within two standard deviations of the mean, and around all the cases are within three standard deviations of the mean. This is the origin of the ’70 %, 95 %, 100 % rule of thumb’ which is used to help interpretation of the sample standard deviation, ‘sigma’. The real reason that the sample mean, ‘mu’ and sample standard deviation, ‘sigma’, are so important is since as well as being simple summaries of average and spread, they can also be used to estimate the parameters of the normal distribution. In addition, the central-limit theorem justifies the use of the methods of inference developed for data from a normal model (and hence also the use of the sample ‘mean’ and standard deviation), even when the raw data are not normally distributed. Certain sample statistics have a normal sampling distribution. Normal distribution is also called the Gaussian distribution. The normal distribution has the two parameters mu and sigma. When mu = 0 and sigma = 1 it is said to be in its standard form, and it is referred to as the standard normal distribution. The normal distribution is characterized by its symmetric shape and bell-shaped appearance.
Null hypothesis – The null hypothesis represents a theory which has been put forward, normally as a basis for argument. The null hypothesis is normally simpler than the alternative hypothesis and is given special consideration. Hence, the conclusion is given in terms of the null hypothesis. Null hypothesis is the opposite of the study hypothesis. In general, this term relates to a particular study hypothesis being tested, as distinct from the alternative hypothesis, which is accepted if the study is rejected. Contrary to intuition, the null hypothesis is frequently a study hypothesis which the analyst prefers to reject in favour of the alternative hypothesis, but this is not always the case. Erroneous rejection of the null hypothesis is known as a Type I error, whereas erroneous acceptance of the null hypothesis is known as a Type II error.
Numerical variable – It refers to a variable whose possible values are numbers (as opposed to categories).
Observational data – Observational data are non-experimental data, and there is no control of potential confounding variables in the study. Because of the weak inferential grounds of statistical results based on observational data, the support for conclusions based on observational data is to be strongly supported by logic, underlying material explanations, identification of potential omitted variables and their expected biases, and caveats identifying the limitations of the study.
Observational study – It is the study in which the study treatments (or levels of the explanatory variables) are not randomly assigned to cases.
Odds – It is the ratio of probabilities for two different events for one group.
Odds ratio – It is the ratio of the odds of an event for two different groups.
Offset – It is the log of the length of the time period over which an event count is taken, entered into a regression model with its coefficient constrained to equal 1. This converts the study end point into the rate of event occurrence.
Omitted variable bias – Variables which affect the dependent variable that are omitted from a statistical model are problematic. Irrelevant omitted variables cause no bias in parameter estimates. Important variables which are uncorrelated with included variables also cause no bias in parameter estimates, but the estimate of ‘sigma’ square biased high. Omitted variables which are correlated with an included variable X1 produce biased parameter estimates. The sign of the bias depends on the product of the covariance of the omitted variable and X1 and b1, the biased parameter. For example, if the covariance is negative and b1 is negative, then the parameter is biased positive. In addition, ‘sigma’ square is also biased.
One-tail test – It is a test of hypothesis for which the study hypothesis is directional, i.e., if the null hypothesis is false, the true parameter value is hypothesized to be either strictly above the null-hypothesized value or strictly below it. One-tail test is also known as a one-sided test, a test of a statistical hypothesis in which the region of rejection consists of either the right-hand tail or the left-hand tail of the sampling distribution of the test statistic. Philosophically, a one-sided test represents the analyst’s a priori belief that a certain population parameter is either negative or positive.
One-way analysis of variance (ANOVA) – It is an extension of the independent group t-test where one has more than two groups. It computes the difference in means both between and within groups and compares variability between groups and variables. Its parametric test statistic is the F-test.
Opinion – It is a belief or conviction, based on what seems probable or true but not demonstrable fact. The collective views of a large number of people, especially on some particular topic. Several studies have shown that individuals do not possess the skills to adequately assess risk or estimate probabilities, or predict the natural process of randomness. Hence, opinions can frequently be contrary to statistical evidence.
Ordinary differencing – It consists of creating a transformed series by subtracting the immediately adjacent observations.
Ordinary least squares (OLS) – It is a means of estimating coefficients in linear regression and ANOVA models which depends on finding the estimates which minimize the sum of squared prediction errors.
Ordinal logistic regression – It is a logistic regression model for a study end point with more than two values where the values also represent rank order on the characteristic of interest.
Ordinal scale – The ordinal scale of measurement occurs when a random variable can take on ordered values, but there is not an even interval between levels of the variable. Examples of ordinal variables include the choice between three automobile brands, where the response is highly desirable, desirable, and least desirable. Ordinal variables provide the second lowest quantity of information compared to other scales of measurement.
Ordinal variable – An ordinal variable is a categorical variable in which the categories have an obvious order, e.g. (strongly disagree, disagree, neutral, agree, strongly agree), or (dry, trace, light rain, heavy rain).
Orthogonality condition – It is the assumption that the experimental- error term in a statistical model is uncorrelated with the explanatory variables in the model.
Outliers – An outlier is an observation which is very different to other observations in a set of data. Since the most common cause is recording error, it is sensible to search for outliers (by means of summary statistics and plots of the data) before conducting any detailed statistical modelling. Outliers are identified as such since they ‘appear’ to be outlying with respect to a large number of apparently similar observations or experimental units according to a specified model. In several cases, outliers can be traced to errors in data collecting, recording, or calculation, and can be corrected or appropriately discarded. However, outliers can be so without a plausible explanation. In these cases, it is normally the analyst’s omission of an important variable which differentiates the outlier from the remaining otherwise similar observations, or a mis-specification of the statistical model which fails to capture the correct underlying relationships. Outliers of this latter kind are not to be discarded from the ‘other’ data unless they can be modelled separately, and their exclusion justified. Several indicators are normally used to identify outliers. One is that an observation has a value which is more than 2.5 standard deviations from the mean. Another indicator is an observation with a value more than 1.5 times the interquartile range beyond the upper or the lower quartile. It is sometimes tempting to discard outliers, but this is imprudent unless the cause of the outlier can be identified, and the outlier is determined to be spurious. Otherwise, discarding outliers can cause one to under-estimate the true variability of the data.
Over-dispersion parameter – It is a parameter in the negative binomial regression model which allows for the possibility that the variance of the study end point can be larger than the mean of the study end point.
p-value – p-value is the probability of obtaining sample results as least as unfavourable to the null hypothesis as is observed if the null hypothesis is true. The probability value (p-value) of a hypothesis test is the probability of getting a value of the test statistic as extreme, or more extreme, than the one observed, if the null hypothesis is true. Small p-values suggest the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the evidence to reject the null hypothesis. In the pre-computer era, it is common to select a particular p-value, (frequently 0.05 % or 5 %) and reject H0 if (and only if) the calculated probability is less than this fixed value. Now it is much more common to calculate the exact p-value and interpret the data accordingly
Paired t test – It is a test for the difference between means for two groups when the groups are not independently sampled.
Parameter – It is a summary measure of some characteristic for the population, such as the population mean or proportion. This word occurs in its customary mathematical meaning of an unknown quantity which varies over a certain set of inputs. In statistical modelling, it most normally occurs in expressions defining frequency or probability distributions in terms of their relevant parameters (such as mean and variance of normal distribution), or in statistical models describing the estimated effect of a variable or variables on a response. Of utmost importance is the notion that statistical parameters are merely estimates, computed from the sample data, which are meant to provide insight as to what the true population parameter value is, although the true population parameter always remains unknown to the analyst. The population values are frequently modelled from a distribution. Then the shape of the distribution depends on its parameters. For example, the parameters of the normal distribution are the mean, and the standard deviation. For the binomial distribution, the parameters are the number of trials, and the probability of success.
Partial likelihood estimation – It is the estimation method for the Cox regression model. It uses only the part of the likelihood function which is based exclusively on the regression coefficients.
Partial regression coefficient (also known as partial slope) – It is the coefficient for a predictor in a regression model which contains more than one explanatory variable. It represents the effect of that predictor controlling for all other predictors in the model.
Pattern – A good statistical analysis is one which takes account of all the ‘pattern’ in the data. In inference this can be expressed or ‘modelled’ as data = pattern + residual. The idea is that the data has variability, i.e., the values differ from each other. Some of the variability can be understood, and hence, it is part of the ‘pattern’ or ‘signal’ in the data. What is left over is not understood and is called the residual, (or ‘noise’ or ‘error’). A good analysis is one which explains as much as possible. Hence if one can still see any patterns in the residual part, then consider how it can be moved over into the pattern (or model). Even with a descriptive-statistics, it is important that the analysis reflects the possible patterns in the data. At least the obvious patterns in the data are to be considered when doing the analysis.
Percentile – It is the value in a distribution such that a certain percentage of cases are lower than that value. For example, the 75th percentile is the value such that 75 % of cases have lower values.
Pearson correlation coefficient (r) – This is a measure of the correlation or linear relationship between two variables ‘x’ and ‘y’, giving a value between +1 and −1 inclusive. It is widely used in the sciences as a measure of the strength of linear dependence between two variables.
Pearson’s product moment correlation coefficient – It is denoted as ‘r’ij, where i and j refer to two variables, Pearson’s product moment correlation coefficient reflects the degree of linear association between two continuous (ratio or interval scale) variables, and is bounded between +1 for perfect positive correlation and –1 for perfect negative correlation.
Percentage – For a variable with ‘n’ observations, of which the frequency of a particular characteristic is ‘r’, the percentage is 100*r/n. For example, if the frequency of an activity is 11 times in 55 years, then the percentage is 100*11/55 = 20 % of the years. Percentages are widely used (and misused). Whenever percentages are used it is to be made clear what is the 100 %. In the example above it is the value 55.
Percentile – The ‘p’th percentile of a list is the number such that at least ‘p’ % of the values in the list are no larger than it. So, the lower quartile is the 25th percentile and the median is the 50th percentile. One definition used to give percentiles, is that the ‘p’th is the 100/p*(n+1)’th observation. For example, with 7 observations, the 25th percentile is the 100/25*8 = 2nd observation in the sorted list. Similarly, the 20th percentile = 100/20*8 = 1.6th observation. An approximate value for the ‘p’th percentile can be read from a cumulative frequency graph as the value of the variable corresponding to a cumulative frequency of ‘r’ %. So, the lower quartile is the 25th percentile and the median is the 50th percentile. The term ‘percentile’ was introduced by Galton in 1885.
Person-period data format – It is a type of dataset for statistical analysis in which each subject contributes to the dataset as many records as there are occasions on which that subject is measured. Datasets in this format are frequently necessary in survival analysis and growth-curve analysis.
Pilot survey – It consists of a study, normally on a minor scale, carried out prior to the main survey, primarily to gain information about the appropriateness of the survey instrument, and to improve the efficiency of the main survey. Pilot surveys are an important step in the survey process, specifically for removing unintentional survey question biases, clarifying ambiguous questions, and for identifying gaps and / or inconsistencies in the survey instrument.
Point estimate – It is the best single estimated value of a parameter.
Poisson distribution – It is a probability distribution for an integer variable representing an event count. The Poisson distribution is frequently referred to as the distribution of rare events. It is typically used to describe the probability of occurrence of an event over time, space, or length. In general, the Poisson distribution is appropriate when such conditions hold as the probability of ‘success’ in any given trial is relatively small, the number of trials is large, and the trials are independent.
Poisson regression – It is a type of regression analysis in which the study end point is a count of the number of occurrences of an event which has happened to subjects in some fixed period of time.
Pooled point estimate – it is an approximation of a point, normally a mean or variance, which combines information from two or more independent samples believed to have the same characteristics. It is used to assess the effects of treatment samples versus comparative samples.
Population – It is the total collection of cases people wish to generalize the results of their study to. In statistical usage the term population is applied to any finite or infinite collection of individuals. The term population is also used for the infinite population of all possible results of a sequence of statistical trials, for example, tossing a coin. It is important to distinguish between the population, for which statistical parameters are fixed and unknown at any given instant in time, and the sample of the population, from which estimates of the population parameters are computed. Population statistics are normally unknown since the analyst can rarely afford to measure all members of a population, and so a random sample is drawn. Much of statistics is concerned with estimating numerical properties (parameters) of an entire population from a random sample of units from the population. Greek letters are normally used for population parameters. This is to distinguish them from sample statistics.
Post hoc test – The post hoc test (or post hoc comparison test) is used at the second stage of the analysis of variance (ANOVA) or multiple analyses of variance (MANOVA) if the null hypothesis is rejected.
Post-hoc theorizing – Post hoc theorizing is likely to occur when the analyst attempts to explain analysis results after-the-fact. In this second-rate approach to scientific discovery, the analyst develops hypotheses to explain the data, instead of the converse (collecting data to nullify the hypotheses). The number of post-hoc theories which can be developed to ‘fit’ the data is limited only by the imagination of a group of people. With an abundance of competing hypothesis, and little forethought as to which hypothesis can be afforded more credence, there is little in the way of statistical justification to prefer one hypothesis to another. More importantly, there is little evidence to eliminate the prospect of illusory correlation.
Power – In general, the power of a statistical test of some hypothesis is the probability which it rejects the alternative hypothesis when the alternative is false. The power is greatest when the probability of a Type II error is least. Power is 1-beta, whereas level of confidence is 1-alpha.
Power of the test – It is the probability which one rejects a false null hypothesis with a particular statistical test.
Precision – Precision is a measure of how close an estimator is expected to be to the true value of a parameter. It is the degree of agreement within a given set of observations. The precision or efficiency of an estimator is its tendency to have its values cluster closely around the mean of its sampling distribution. Precise estimators are preferred to less precise estimators. Precision is normally expressed in terms of the standard error of the estimator. Less precision is reflected by a larger standard error. Fig 8 illustrates precision and bias, where the target value is the bullseye.
Fig 8 Precision and bias
Prediction interval – A prediction interval is a calculated range of values known to contain some future observation over the average of repeated trials with specific certainty (probability). The correct interpretation of a prediction interval is that if the analyst is to repeatedly draw samples at the same levels of the independent variables and compute the test statistic (mean, regression slope, etc.), then a future observation lies in the (1-alpha) % prediction interval a times out of 100. The prediction interval differs from the confidence interval in that the confidence interval provides certainty bounds around a mean, whereas the prediction interval provides certainty bounds around an observation.
Predictive nomogram – It is a mathematical formula, based on statistical modelling, which facilitates forecasting patient outcomes. In survival analysis, the predicted outcome is typically the probability of surviving a given length of time before experiencing the study end point.
Probability density functions – It is synonymous with probability distributions, knowing the probability that a random variable takes on certain values, judgements can be made as to how likely or unlikely were the observed values. In general, observing an unlikely outcome tends to support the notion that chance is not acting alone. By posing alternative hypotheses to explain the generation of data, an analyst can conduct hypothesis tests to determine which of two competing hypotheses best supports the observed data.
Probability sample – It is a type of sample for which one can specify the probability that any member of the population is selected into it. This type of sample enables generalization of the study results to a known population.
Productivity – It is the unit output per unit of resource input.
Propensity scores – it is the predicted probabilities of receiving the treatment for different subjects. Subjects which have the same propensity scores can be treated in statistical analyses as though they are randomly assigned to treatment groups.
Propensity-score analysis – it is a statistical analysis which controls for propensity scores and thereby balances the distributions on control variables across groups of subjects.
Proportion – For a variable with ‘n’ observations, of which the frequency of a particular characteristic is ‘r’, the proportion is r/n. For example, if the frequency of an activity is 11 times in 55 years, then the proportion is 11/55 = 0.2 of the years, or one fifth of the years.
Pseudo- R2 measure – It consists of any of several analogs of the linear regression R2 used for non-linear models such as logistic regression, Poisson regression, and Cox regression, etc.
Quadratic model – It is a regression model which includes a variable along with its square as explanatory factors in the model. Such a model allows for a non-linear relationship between the -study end point and that factor, the curve describing that relationship would be able to have one bend in it.
Qualitative variable – It is a variable whose values indicate a difference in kind, or nature, only. Even if represented by numbers (which it normally is), the values convey no quantitative meaning.
Quantiles – Quantiles are a set of ‘cut points’ which divide a numerical variable into groups containing (as far as possible) equal numbers of observations. Examples of quantiles include quartiles, quintiles, deciles, percentiles.
Quantitative variable – It is a variable whose values indicate either the exact quantity of the characteristic present or a rank order on the characteristic.
Quartiles – There are three quartiles. To find them, first sort the list into increasing order. The first or lower quartile of a list is a number (not necessarily in the list) such that 1/4 of the values in the sorted list are no larger than it, and at least 3/4 are no smaller than it. With ‘n’ numbers, one definition is that the lower quartile is the (n+1)/4th observation in the sorted list. The second quartile is the median. The third or upper quartile is the ‘mirror image’ of the lower quartile.
Quintile – It is like a quartile, but dividing the data into five sets, rather than four. The lowest quintile is the 20 % point. The first quintile (20 % point) is the (n+1)/5 value.
R-square – It is a measure of the strength of association between a quantitative study end point and one or more quantitative explanatory variables. It has the additional property that it can be interpreted as the proportion of variation in the study end point which is accounted for by the explanatory variable(s).
Random error – It is a deviation of an observed from a true value which occurs as though chosen at random from a probability distribution of such errors.
Randomization – Randomization is used in the design of experiments. When certain factors cannot be controlled, and omitted variable bias has potential to occur, randomization is used to randomly assign subjects to treatment and control groups, such that any systematic omitted variable bias is distributed evenly among the two groups. Randomization is not to be confused with random sampling, which serves to provide a representative sample.
Random sampling – It is a sample strategy whereby population members have equal probability of being recruited into the sample. Frequently called simple random sampling, it provides the greatest assurance that the sample is representative of the population of interest.
Random selection – Synonymous with random sampling, a sample selected from a finite population is said to be random if every possible sample has equal probability of selection. This applies to sampling without replacement. A random sample with replacement is still considered random as long as the population is sufficiently large such that the replaced experimental unit has small probability of being recruited into the sample again.
Random variable – It is a variable whose exact value is not known prior to measurement. Typically, independent variables in experiments are not random variables since their values are assigned or controlled by the analyst. For example, steel billet is applied in exacting quantities to the plant under study, hence quantity of steel billet is a known constant. In observational studies, in contrast, independent variables are frequently random variables since the analyst does not control them. For example, in a study of the effect of acid rain on habitation in the north-east, the analyst cannot control the concentration of pollutants in the rain, and so concentration of contaminant X is a random variable.
Range – It is the difference between the maximum and the minimum values in a distribution. It is a simple measure of the spread of the data.
Rate of event occurrence – It is an event count divided by the time period over which the count is there.
Ratio scale – A variable measured on a ratio scale has order, possesses even intervals between levels of the variable, and has an absolute zero. An example of a ratio scale variable is height, where levels of response include 0.000 and 5,000 centimetres. The ratio scale variable provides the statistician with the greatest amount of information relative to other scales of measurement.
Raw data – It is the data which has not been subjected to any sort of mathematical manipulation or statistical treatment such as grouping, coding, censoring, or transformation.
Receiver operating characteristic (ROC) curve – In logistic regression, it is a curve showing the sensitivity of classification plotted against the false positive rate as the criterion probability is varied from 0 to 1. It is used to indicate the predictive efficacy, or discriminatory power, of the model.
Regression – It is a statistical method for investigating the inter-dependence of variables.
Relative risk – It is the ratio of the probability of an event for two different groups.
Repeatability – It is the degree of agreement between successive runs of an experiment.
Repeated-measures ANOVA – It is a type of ANOVA in which subjects are repeatedly measured on the study endpoint over time, so that time becomes an additional explanatory variable in the analysis. Frequently, repeated-measures ANOVA features a treatment factor and time as the two explanatory variables.
Replication – It is the execution of an experiment or survey more than once so as to increase the precision and to obtain a closer estimation of the sampling error.
Representative sample – It is a sample which is representative of a population (it is a moot point whether the sample is chosen at random or selected to be ‘typical’ of certain characteristics. Hence, it is better to use the term for samples which turn out to be representative, however chosen, rather than apply it to a sample chosen with the object of being representative).
Reproducibility – An experiment or survey is said to be reproducible if, on repetition or replication under similar conditions, it gives the same results.
Research – Research is a systematic search for facts or information. What separates scientific research from other means for making statements about the universe, society, and the environment is that scientific research is rigorous. It is constantly reviewed by professional colleagues, and it relies on consensus building based on repeated similar results. One of the foundations of research is the scientific method, which relies heavily on statistical methods. Frequently to the dismay of the general public, the use of statistics and the scientific method cannot prove that a theory, relationship, or hypothesis is true. On the contrary, the scientific method can be used to prove that a theory, relationship, or hypothesis is false. Through consensus building and peer review of scientific work, theories, hypotheses, and relationships can be shown to be highly likely, but there always exists a shred of uncertainty which puts the results at risk of being incorrect, and there can always be an alternative explanation of the phenomenon which better explains the phenomenon scientists are trying to explain.
Research hypothesis – It is the hypothesis which the researcher is trying to marshal evidence for. This is normally the hypothesis which is suggested either by prior study or theory as being true.
Residual – A residual is defined as the difference between the observed value and the fitted value in a statistical model. Residual is synonymous with error, disturbance, and statistical noise.
Residual method – In time series analysis, it is a classical method of estimating cyclical components by first eliminating the trend, seasonal variations, and irregular variations, hence leaving the cyclical relatives as residuals.
Return period – The return period is the average time to one occurrence of an event. For example, if the probability of an event each year is ‘p’ = 0.2, or 20 %, then the return period = 1/p = 1/0.2 = 5 years. Hence, on average the event ‘return’ once in five years.
Reverse causation – It is the situation in which the study end point in a regression model is actually the cause of one of the explanatory variables in the model, rather than the other way around.
Right skewed – It is said of distributions where majority of the cases have low values of the variable, and a few outliers have very high values.
Risk – The risk of an event is the probability of that event occurring. If an activity is needed on 10 years out of 50, this is a probability, or risk, of 0.2 (or 20 %).
Risk set – In survival analysis, it is the total group of subjects who are at risk for event occurrence at any given time.
Robust – It is the property of a statistical procedure of providing valid results even when the assumptions for that procedure are not met.
Robustness – A method of statistical inference is said to be robust if it remains relatively unaffected when all of its underlying assumptions are not met.
Runs test – It is a test where measurements are made according to some well-defined ordering, in either time or space. A frequent question is whether or not the average value of the measurement is different at different points in the sequence. This non-parametric test provides a means for this.
Sample – It is a part or subset of a population, which is obtained through a recruitment or selection process, normally with the objective of understanding better the parent population. Statistics are computed on sample data to make formal statements about the population of interest. If the sample is not representative of the population, then statements made based on sample statistics is incorrect to some degree. By studying the sample, it is hoped to draw valid conclusions (inferences) about the population. A sample is normally used since the population is too large to study in its entirety. The sample is to be representative of the population. This is best achieved by random sampling. The sample is then called a random sample.
Sampling distribution – A sampling distribution describes the probabilities associated with an estimator, when a random sample is drawn from a population. The random sample is considered as one of the several samples which might have been taken. Each has given a different value for the estimator. The distribution of these different values is called the sampling distribution of the estimator. Deriving the sampling distribution is the first step in calculating a confidence interval, or in conducting a hypothesis test. The standard deviation of the sampling distribution is a measure of the variability of the estimator, from sample to sample, and is called the standard error of the estimator. In several examples, the sampling distribution of an estimator is approximately normal. This follows from the central limit theorem. Then approximate (95 %) confidence intervals are found by simply taking the value of the estimate +/- 2.
Sample size – it is the number of sampling units which are to be included in the sample.
Sampling distribution – It is the probability distribution for a sample statistic. This distribution determines the ‘p’ values for statistical tests.
Sampling error – It is that part of the difference between a population value and an estimate thereof, derived from a random sample, which is due to the fact that only a sample of values is observed.
Sampling to a population – It is conjuring up a hypothetical population which non-probability sample results can be generalizable to by imagining repeating the sampling procedure ad infinitum to generate a population. One’s current sample can then be considered a random sample from this hypothetical population.
Scatter diagram – It is also known as scatter plot and is a simple display when the data consists of pairs of values. The data are plotted as a series of points. If the data are ordered (for example, in time) then it can be sensible to join the successive points with a line. This is then called a line graph. If there are other categorical variables, their values can be indicated using different plotting symbols or different colours. The scatter diagram consists of graphs drawn with pairs of numerical data, with one variable on each axis, to look or relationship between them. If the variables are correlated, the points fall along a line or curve. A scatter diagram is a type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis (x axis) and the value of the other variable determining the position on the vertical axis (y axis). Typical scatter diagrams are shown in Fig 9.
Fig 9 Typical scatter diagrams
Science – Science is the accumulation of knowledge acquired by careful observation, by deduction of the laws which govern changes and conditions, and by testing these deductions by experiment. The scientific method is the corner-stone of science, and is the primary mechanism by which scientists make statements about the universe and phenomenon within it.
Scientific method – it is the theoretical and empirical processes of discovery and demonstration considered characteristic and necessary for scientific investigation, normally involving observation, formulation of a hypothesis, experimentation to provide support for the truth or falseness of the hypothesis, and a conclusion which validates or modifies the original hypothesis. The scientific method cannot be used to prove that a hypothesis is true, but can be used to disprove a hypothesis. However, it can be used to mount substantial evidence in support of a particular hypothesis, theory, or relationship.
Seasonal cycle length – It is the length of the characteristic recurrent pattern in seasonal time series, given in terms of number of discrete observation intervals.
Seasonal differencing – It is creating a transformed series by subtracting observations which are separated in time by one seasonal cycle.
Seasonality – It is the time series characteristic defined by a recurrent pattern of constant length in terms of discrete observation intervals.
Selection bias – It is the bias in one’s regression estimates brought about either by an unmeasured characteristic of cases which causes only certain kinds of cases to be assigned certain treatments (self-selection bias) or by an unmeasured characteristic which causes only certain kinds of cases to be present in one’s sample (sample-selection bias).
Self-selection – Self-selection is a problem which plagues survey study. Self-selection is a term used to describe what happens when survey respondents are allowed to deny participation in a survey. The belief is that respondents who are opposed or who are apathetic about the objectives of the survey refuses to participate, and their removal from the sample bias the results of the survey. Self-selection can also occur since respondents who are either strongly opposed or strongly supportive of a survey’s objectives respond to the survey.
Sensitivity analysis – It is an alternative analysis using a different model or different assumptions to explore whether one’s main findings are robust to different analytical approaches to the study problem.
Sensitivity of classification – In logistic regression, the probability of a case being classified as a case by the prediction equation.
Siegel-Tukey test – It is a non-parametric test named after Sidney Siegel and John Tukey, which tests for differences in scale between two groups. Data measured is to be at least be ordinal.
Sign test – It is a test which can be used whenever an experiment is conducted to compare a treatment with a control on a number of matched pairs, provided the two treatments are assigned to the members of each pair at random.
Significance – An effect is significant if the value of the statistic used to test it lies outside acceptable limits, i.e., if the hypothesis that the effect is not present is rejected.
Significance level, of a hypothesis test – The significance level of a statistical hypothesis test is a probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, one wants to make the significance level as small as possible in order to protect the null hypothesis and to prevent the investigator from inadvertently making false claims. Normally, the significance level is chosen to be 0.05 (or equivalently, 5 %).
Simple random sample – It is a sample in which every member of the population has the same chance of being selected into the sample.
Skew – If the distribution (or ‘shape’) of a variable is not symmetrical about the median or the mean it is said to be skew. The distribution has positive skewness if the tail of high values is longer than the tail of low values, and negative skewness if the reverse is true.
Skewness – Skewness is the lack of symmetry in a probability distribution. In a skewed distribution the mean and median are not coincident.
Smoothing – It is the process of removing fluctuations in an ordered series so that the result is ‘smooth’ in the sense that the first differences are regular and higher order differences are small. Although smoothing can be carried out by free-hand methods, it is normal to make use of moving averages or the fitting of curves by least squares procedures. The philosophical grounds for smoothing stem from the notion that measurements are made with error, such that artificial ‘bumps’ are observed in the data, whereas the data really is to represent a smooth or continuous process. When these ‘lumpy’ data are smoothed appropriately, the data are thought to better reflect the true process which generated the data. An example is the speed-time trace of a vehicle, where speed is measured in integer kilometres per hour. Accelerations of the vehicle computed from differences in successive speeds are over-estimated because of the lumpy nature of measuring speed. Hence, an appropriate smoothing process on the speed data results in data which more closely resembles the underlying data generating process. Of course, the technical difficulty with smoothing lies in selecting the appropriate smoothing process, since the real data are never typically observed.
Spearman’s rank order correlation – It is a non-parametric test used to measure the relationship between two rank ordered scales. Data are in ordinal form.
Specificity of classification – In logistic regression, it is the probability of a control being classified as a control by the prediction equation.
Spread – Majority of the data sets show variability i.e., all the values are not the same. Two important aspects of the distribution of values are particularly important, they are the centre, and the spread. The ‘centre’ is a typical value around which the data are located. The mean and median are examples of typical values. The spread describes the distance of the individual values from the centre. The range (maximum – minimum) and the inter-quartile range (upper quartile – lower quartile) are two summary measures of the spread of the data. The standard deviation is another summary measure of spread.
Standard deviation – The sample standard deviation is the square root of the sample variance. The standard deviation is the most commonly used summary measure of variation or spread of a set of data. It is used measure of dispersion, and represents approximately the average distance of values from the mean of a distribution. The sample standard deviation is a biased estimator, even though the sample variance is unbiased, and the bias becomes larger as the sample size gets smaller. The standard deviation is a ‘typical’ distance from the mean. Normally, around 70 % of the observations are closer than 1 standard deviation from the mean and most (around 95 %) are within 2 standard-deviation of the mean. The standard deviation is a symmetrical measure of spread, and hence is less useful and more difficult to interpret for data sets which are skew. It is also sensitive to (i.e., its value can be greatly changed by the presence of outliers in the data.
Standard error – It is the standard deviation of the sampling distribution of a statistic. The positive square root of the variance of the sampling distribution of a statistic. The standard error is a measure of precision. It is a key component of statistical inference. The standard error of an estimator is a measure of how close it is likely to be, to the parameter it is estimating.
Standard error of estimate – It is the standard deviation of the observed values about a regression line.
Standard error of the ‘mean’ (SEM) – It is the standard deviation of the means of several samples drawn at random from a large population. It is an estimate of the quantity by which an obtained mean can be expected to differ by chance from the true mean. It is an indication of how well the mean of a sample estimates the mean of a population
Standard normal transformation – Fortunately, the analyst can transform any normal distributed variable into a standard normal distributed variable by making use of a simple transformation.
Standard scores – These are scores expressed in terms of standard deviations away from the mean. Statistic – It is a summary value calculated from a sample of observations.
Statistics – It is the branch of mathematics which deals with all aspects of the science of decision-making and analysis of data in the face of uncertainty.
Statistical control – It consists of statistically holding other explanatory variables constant when looking at the effect of a given predictor on a study end point. It is designed to mimic the kind of control achieved with random assignment to levels of the predictor. However, it is no substitute for random assignment, as it only controls for measured characteristics.
Statistical independence – In probability theory, two events are said to be statistically independent if, and only if, the probability that they both occur equals the product of the probabilities which each one, individually occur, i.e., one event does not depend on another for its occurrence or non-occurrence.
Statistical inference – It is also called inductive statistics, statistical inference is a form of reasoning from sample data to population parameters, that is, any generalization, prediction, estimate, or decision based on a sample and made about the population. There are two schools of thought in statistical inference, classical or frequentist statistics for which RA Fisher is considered to be the founding father, and Bayesian inference, discovered by a man bearing the same name.
Statistical interaction (also known as stratification effects) – It is the situation in which the nature of the association between a predictor and a study end point is different for different levels of a third variable.
Statistical methods – Statistical methods are similar to a glass lens through which the analyst inspects phenomenon of interest. The underlying mechanisms present in the population represents reality, the sample represents a blurry snap shot of the population, and statistical methods represent a means of quantifying different aspects of the sample.
Statistical model – It is a set of one or more equations describing the process or processes which generated the scores on the study end point.
Statistical power – It is the capability of a test to detect a significant effect or how frequently a correct interpretation can be reached about the effect if it is possible to repeat the test several times.
Statistical significance – It is the condition in which the ‘p’ value for a statistical test is below the alpha level for the test, leading to rejection of the null hypothesis.
Strength of association – It is the degree to which knowledge of one’s status on one variable enables prediction of one’s status on another variable to which it is associated with. Measures of strength of association ideally range in absolute value from 0 to 1.
Stochastic – The adjective ‘stochastic’ implies that a process or data generating mechanism involves a random component or components. A statistical model consists of stochastic and deterministic components.
Stratification – It is the division of a population into parts, known as strata stratified random sampling. It is a method of sampling from a population whereby the population is divided into parts, known as strata, especially for the purpose of drawing a sample, and then assigned proportions of the sample are then sampled from each stratum. The process of stratification is undertaken in order to reduce the variability of stratification statistics. In other words, strata are normally selected such that inter-strata variability is maximized, and intra-strata variability is small. When stratified sampling is performed as desired, estimates of strata statistics are more precise than the same estimates computed on a simple random sample.
Student-Newman-Keuls (SNK) test – It is a non-parametric post ANOVA test, also called a post hoc test. It is used to analyze the differences found after the performed F-test (ANOVA) is found to be significant, for example, to locate where differences truly occur between means.
Student t-test (t) – It is a statistical hypothesis test in which the test statistic follows a student’s t-distribution if the null hypothesis is true, for example, a t-test for paired or independent samples.
Study endpoint (also known as outcome, dependent variable, criterion variable or response variable) – It is the ‘effect’ variable whose ‘behaviour’ one is trying to explain using one or more explanatory variables in the study.
Sub-classification on propensity scores – It is a means of performing propensity-score analysis in which the substantive analysis is repeated on different groups having roughly the same propensity scores. The analysis results from the different groups are then combined into one final result through weighted averaging.
Survival analysis – It is the analysis of time-to- event data, i.e., the length of time until an event occurs to subjects. The most popular multi-variable technique, Cox regression, is a model for the log of the hazard of the event.
Survival function – It is the probability of surviving to a particular point in time without experiencing the event of interest. This changes over time and is hence a function of time.
Symmetric – It is said of distributions which shows no skewness, and for which exactly 50 % of cases lie above and below the mean of the distribution.
Symmetrical – A list of numbers is symmetrical if the data values are distributed in the same way, above and below the middle. Symmetrical data sets are (i) easily interpreted, (ii) allow the presence of outliers to be detected similarly (i.e., using the same criteria), whether they are above the middle or below and (iii) allow the spread (variability) of similar data sets to be compared. Some statistical techniques are appropriate only for data sets which are roughly symmetrical, (e.g., calculating and using the standard deviation). Hence, skew data are sometimes transformed, so they become roughly symmetric.
Systematic error – It is an error, which is in some sense biased, having a distribution with a mean that is not zero (as opposed to a random error).
T-distribution – It is a statistical distribution describing the means of samples taken from a population with an unknown variance. It is the distribution of values with particular degrees of freedom of difference between sample and population mean divided by the standard error of mean.
It is a population distribution which is symmetric and resembles the normal distribution except that it shows more dispersion. Some sample statistics have a ‘t’ sampling distribution.
T-score – It is a standard score derived from a z-score by multiplying the z-score by 10 and adding 50. It is useful in comparing different test scores to each other as it is a standard metric which reflects the cumulative frequency distribution of the raw scores.
t-statistic – When a sample is used to calculate ‘sigma’ square, an estimate of the population variance sigma square, and the parent population is normally distributed, the sampling distribution of the test statistic ‘t’ is approximately t-distributed.
t-test for correlated means – It is a parametric test of statistical significance used to determine whether there is a statistically significant difference between the means of two matched, or non-independent, samples. It is also used for pre–post comparisons.
t-test for correlated proportions – It is a parametric test of statistical significance used to determine whether there is a statistically significant difference between two proportions based on the same sample or otherwise non-independent groups.
t-test for independent means – It is a parametric test of significance used to determine whether there is a statistically significant difference between the means of two independent samples.
t-test for independent proportions – It is a parametric test of statistical significance used to determine whether there is a statistically significant difference between two independent proportions.
Table – When data are split into categories, tables provide a way of summary. A simple table gives the frequency, or the percentage, in each category. There are as many cells in the table, as there are categories, plus the last cell, which is called the margin. Tables can summarize data for two or more factors (category variables), and an example is shown below. The contents of a table can be the frequencies (or percentages) at each combination of the factor levels. Alternatively, they can be summary values of a numeric variable, for each category.
Technology – It is the science of technical processes which is a wide, though related, body of knowledge. Technology embraces the chemical, mechanical, electrical, and physical sciences as they are applied to society, the environment, and otherwise human endeavours.
Technology transfer – It is the dissemination of knowledge leading to the successful implementation of the results of research and development. Technology transfer outputs from a research project, such as prototypes, software, devices, specifications designs, processes, or practices, etc. are either expendable or frequently have only temporary and limited utility.
Test of hypothesis – It is a statistical test of the plausibility of the null hypothesis in a study.
Test statistic – It is a sample statistic measuring the discrepancy between what is observed in the sample, as opposed to what one expects to observe if the null hypothesis is true. A test statistic is a quantity calculated from the sample of data. It is used in hypothesis testing, where its value dictates whether the null hypothesis is to be rejected or not. The choice of a test statistic depends on the assumed model and the hypothesis being tested.
The central limit theorem – It is a mathematical theorem specifying the sampling distribution of a sample statistic (e.g., the sample mean) when one has a large sample.
Third quartile – It is the value in a distribution such that 75 % of cases have lower values.
Time series – It is a series of measurements of a variable over time, normally at regular intervals. It is a set of ordered observations on a quantitative characteristic of an individual or collective phenomenon taken at different points of time. Although it is not a requirement, it is common for these points to be equidistant in time.
Time-varying covariates – These are explanatory variables whose values can change at different occasions of measurement for the same subject.
Transformation – A transformation is the change in the scale of a variable. Transformations are performed to simplify calculations, to meet specific statistical modelling assumptions, to linearize an otherwise non-linear relation with another variable, to impose practical limitations on a variable, and to change the characteristic shape of a probability distributions of the variable in its original scale.
Transforming variables – If there is evidence of marked skewness in a variable, then applying a transformation can make the resulting transformed variable more symmetrical. Transforming skew data was very important 50 years ago, since the analysis was frequently simpler for variables which were symmetrical. This was partly since a normal distribution was then frequently an appropriate model, and much of the statistical inference / modelling depended on the data being from a normal distribution. Recent advances in statistics have led to analyses being (almost as) simple for a wide range of statistical models, some of which are appropriate for modelling skew data. So now it is more important to consider the appropriate statistical model than to assume that data always need to be transformed if they lack symmetry. Transforming data is not ‘cost free’. One is to be beware of transforming when there are zeros in the data. A popular action used to be to add a small arbitrary value to the zeros and then to transform. Analysing the zeros separately is almost always to be preferred.
Transportation – It is the act and / or means for moving people and goods.
Truncated distribution – A truncated statistical distribution occurs when a response above or below a certain threshold value is discarded. For example, assume that certain instrumentation which can only read measurements within a certain range—data obtained from this instrument which can result in a truncated distribution, as measurements outside the range are discarded. If measurements are recorded at the extreme range of the measurement device, then the distribution is to be censored.
Tukey’s test of significance – It is a single-step multiple comparison procedure and statistical test normally used in conjunction with an ANOVA to find which means are significantly different from one another. Named after John Tukey, it compares all possible pairs of means and is based on a studentized range distribution ‘q’ (this distribution is similar to the distribution of ‘t’ from the t-test).
Two-tailed test – It is a test of hypothesis for which the study hypothesis is not directional, i.e., the study hypothesis allows for the possibility that the true parameter value can fall on either side of the null-hypothesized value. It is a test of significance in which both directions are, a priori, equally likely
Type I error – It is the probability of rejecting a true null hypothesis in a statistical test. If, as the result of a test statistic computed on sample data, a statistical hypothesis is rejected when it is to be accepted, i.e., when it is true, then a type I error has been made. Alpha, or level of significance, is pre-selected by the analyst to determine the type I error rate. The level of confidence of a particular test is given by 1- alpha.
Type II error – It is the probability of failing to reject a false null hypothesis in a statistical test. If, as the result of a test statistic computed on sample data, a statistical hypothesis is accepted when it is false, i.e., when it should have been rejected, then a type II error has been made. Beta is pre-selected by the analyst to determine the type II error rate. The Power of a particular test is given by 1-beta.
Unbiased estimator – It is a sample statistic for which the mean of its sampling distribution is equal to the population parameter it is designed to estimate. An estimator whose expected value (namely the mean of the sampling distribution) equals the parameter it is supposed to estimate. This is considered a desirable property of an estimator. In general, unbiased estimators are preferred to biased estimators of population parameters. There are rare cases, however, when biased estimators are preferred since they are much more efficient than alternative estimators.
Uncensored cases – In survival analysis, those subjects who experience the event of interest during the observation period of the study.
Uniform distribution – Uniform distributions are appropriate for cases when the probability of achieving an outcome within a range of outcomes is constant. An example is the probability of observing a crash at a specific location between two consecutive post miles on a homogenous section of freeway.
Universe – Universe is synonymous with population and is found primarily in older statistical textbooks. Majority of the newer textbooks and statistical literature use population to define the experimental units of primary interest.
Unmeasured heterogeneity – It is an unmeasured characteristic of one’s cases which is related to one or more explanatory variables in the study, as well as the study endpoint. Part or all of the supposed ‘effect’ of the explanatory variables on the study endpoint is actually attributable to this unmeasured confounding factor.
Validity – It is degree to which some procedure is founded on logic (internal or formal validity) or corresponds to nature (external or empirical validity).
Validation – Validation is a term used to describe the important activity of validating a statistical model. The only way to validate the generalizability or transferability of an estimated model is to make forecasts with a model and compare them to data which are not used to estimate the model. This exercise is called external validation. The importance of this step of model building cannot be overstated, but it remains perhaps the least practiced step of model building, since it is expensive and time consuming, and since some modelers and practitioners confuse goodness of fit statistics computed on the sample data with the same computed on validation data.
Variable – It is a quantity which can take any one of a specified set of values. It is the characteristic measured or observed when an observation is made. Variables can be non-numerical or numerical. The distinction between a categorical variable and a numerical variable is sometimes blurred. A categorical variable can always be coded numerically, for example, a gender (male, female) can be coded as 1 for male or 2 for female (or vice versa). Similarly, a numerical variable can be recoded into categories if needed. For example, the variable ‘age’ can be recoded into the three categories, of young, middle, or old. An average (e.g., mean or median) and a measure of spread, (e.g., standard deviation or quartiles) are frequently used to summarize a numerical variable. A table of the frequencies or percentages, at each level (or category) is frequently used to summarize a categorical variable.
Variability: Variability is a statistical term used to describe and quantify the spread or dispersion of data around its centre, normally the mean. Knowledge of data variability is essential for conducting statistical tests and for fully understanding data. Hence, it is frequently desirable to obtain measures of both central tendency and spread. In fact, it can be misleading to consider only measures of central tendency when describing data.
Variability or variation or dispersion – The variability (or variation) in data is the extent to which successive values are different. The quantity of variability in the data, and the different causes of the variability are frequently of importance in their own right. The variability (or ‘noise’) in the data can also obscure the important information (or ‘signal’).
Variance – It is square of standard deviation. The variance is a measure of variability, and is frequently denoted by ‘sigma’ square. In simple statistical methods the square root of the variance, ‘sigma’, which is called the standard deviation, is frequently used more. The standard deviation has the same units as the data themselves and is hence easier to interpret. The variance becomes more useful in its own right when the contribution of different sources of variation are being assessed. This leads to a presentation called the ‘analysis of variance’, frequently written as ANOVA.
Variance of a variable – It is the average of the squared deviation scores.
Variate – It is a quantity which can take any of the values of a specified set with a specified relative frequency or probability, also known as a random variable.
Weight – It is a numerical coefficient attached to an observation, frequently by multiplication, in order that it assumes a desired degree of importance in a function of all the observations of the set. Weighted average – It is an average of quantities to which have been attached a series of weights in order to make allowance for their relative importance.
White noise – For time series analysis, white noise is defined as a series whose elements are uncorrelated and normally distributed with mean zero and constant variance. The residuals from properly specified and estimated time series models are tobe white noise.
Wald-Wolfowitz test – It is a non-parametric statistical test used to test the hypothesis that a series of numbers is random. It is also known as the runs test for randomness.
Wilcoxon sign rank test – it is a non-parametric statistical hypothesis test for the case of two related samples or repeated measurements on a single sample. It can be used as an alternative to the paired student’s t-test when the population cannot be assumed to be normally distributed.
Wilcoxon rank sum test – It is a non-parametric test for the difference in the study end point between two independently sampled groups.
Wilks’s lambda – It is a general test statistic used in multi-variate tests of mean differences among more than two groups. It is the numeral index calculated when carrying out MANOVA or MANCOVA.
Within-subjects variable – It is a variable in repeated-measures ANOVA or linear mixed modelling which takes on different values over time for the same subject.
Z-score – It is a score expressed in units of standard deviations from the mean. It is also known as a standard score.
Z-statistic – If ‘x’(average) is calculated on a sample selected from a distribution with mean ‘mu’ and known finite variance ‘sigma’ square, then the sampling distribution of the test statistic ‘Z’ is around standard normal distributed, regardless of the characteristics of the parent distribution (i.e., normal, Poisson, binomial, etc.). Hence, a random variable ‘Z’ computed is around standard normal (mean=0, variance=1) distributed.
Z-test – It is a test of any of a number of hypotheses in inferential statistics which has validity if sample sizes are sufficiently large and the underlying data are normally distributed.
Zero values – Variables can include zero values or other special values. Zeros are to be considered as an opportunity, rather than a problem. The data is normally to be analysed (or modelled) in two parts. The first considers the chance of getting a zero value (as opposed to non-zero). Then the non-zero data are analysed further. In the past, one strategy was to treat the zeros as representing something that had to be hidden, normally by adding a small value and then transforming. This is almost always counter-productive.