Data Analysis Techniques

Data Analysis Techniques

Every organization collects a large volume of data during its day-to-day operation. This data is of little use till it is organized, and analyzed to extract useful information from it. Data analysis technique constitutes the science of examining raw data with the purpose of drawing conclusions about the information contained in the data. Data analysis is used in several industries to allow the organizations to take better decisions.

There are two prerequisites which are needed for the data namely (i) to secure the data, and (ii) to use techniques to extract useful information from the data. The process of analyzing and discovering hidden patterns, undiscovered correlations, and other valuable information from a vast volume of data is known as data analysis. The data analysis is crucial for taking effective decisions and for the management of the organization.

The purpose of analyzing data is for getting usable and useful information. The analysis, irrespective of whether the data is qualitative or quantitative, can (i) describe and summarize the data, (ii) identify relationships between variables, (iii) compare variables, (iv) identify the difference between variables, and (v) forecast outcome.

Data analysis is described as a set of concepts and methods intended for presentation of the data in a form which improves the quality of the decision making in the organization. It is the process of collecting, modelling, and analyzing data using different logical methods and techniques. Organizations rely on analytic processes and tools to extract insights which support strategic and operational decision making. Data analysis is important, since it leads to the valuable and useful information which can be used for improving the organizational performance.

Data analysis primarily consists of analysis methodologies, systematic architecture, data mining, and analysis tools. The most crucial aspect of data analysis involves processing of the real-time data, analyzing it, and producing highly accurate analysis results. Data analysis can also be used to investigate potential values where the information arrived through data analysis can be used for organizational development and performance improvement and taking decisions for actions to be implemented. Data analysis is wide, dynamic, and complex since data comes in different types and grows considerably with time.

The purpose of the data analysis varies depending on the type of application needed. Hence, data analysis normally aims to answer three categories of questions namely (i) what happened in the past, (ii) what is happening now, and (iii) what is anticipated in future. As a result, processing and obtaining the necessary information from an extensive database need a lot of time and processing power.  Moreover, interdisciplinary investigation makes it difficult for the organizations to identify the specialist skills and the abilities needed for data analysis.

Data analysis can be defined as a data science used to break data into individual components for inspection and integrating these components to create knowledge. Informally, there is a five-step ‘value-chain’ approach for extracting useful value from the data using data analysis. These steps are (i) organizational needs identification of data, (ii) collection of the data, (iii) data exploring and cleaning, (iv) data analyzing, (v) interpreting the data and drawing inferences. The data interpretation helps in making a decision and measuring of the outcome for the purpose of updating the process or system with the results of the decision.  Fig 1 gives the data analysis process.

Fig 1 Data analysis process

Data analysis is simply the process of converting the gathered data to meaningful information. Different techniques such as modelling to reach trends, relationships, and hence conclusions to address the decision making process are employed in this process. However, the data needs to be prepared before being used in the data analysis process.

Data are obtained and collected after the measurements. Subsequent to collection, ideally the data are organized, displayed, and examined by using various graphical techniques. As a general rule, the data are to be arranged into categories so that each measurement is classified into one, and only one, of the categories. This procedure eliminates any ambiguity which can otherwise arise when categorizing measurements.

After the data are organized, there are several ways to display the data graphically. The first and simplest graphical procedure for data organization in this manner is the pie chart. It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partitioning a circle (similar to slicing a pie).

A second graphical technique of the organized data, is the bar chart, or bar graph. There are several variations of the bar chart. Sometimes the bars are displayed horizontally. Bars can also be used to display data across time. Bar charts are relatively easy to construct. The two other graphical techniques are the frequency histogram and the relative frequency histogram. These two are constructed with the help of frequency table, class intervals, class frequency, and relative frequency. A histogram can be unimodal, bimodal, uniform, symmetric or skewed (either to the right or to the left).

Probability is an important aspect while using these techniques. It is because, if a single measurement is selected at random from the set of sample measurements, the chance, or probability, that it lies in a particular interval is equal to the fraction of the total number of sample measurements falling in that interval. This same fraction is used to estimate the probability that a measurement randomly selected from the population lies in the interval of interest.

Since probability is the tool for making inferences, it is required to define probability. In the previous paragraph, the term probability is used in its everyday sense. However, this idea needs examination more closely. Observations of phenomena can result in several different outcomes, some of which are more likely than others. Several attempts have been made to give a precise definition for the probability of an outcome. Here three of these are given.

The first interpretation of probability, called the classical interpretation of probability, which arose from games of chance. Typical probability statements of this type are, for example, the probability that a flip of a balanced coin shows ‘heads’ is 1/2, and the probability of drawing an ace when a single card is drawn from a standard deck of 52 cards is 4/52. The numerical values for these probabilities arise from the nature of the games. A coin flip has two possible outcomes (a head or a tail), hence the probability of a head is then 1/2 (1 out of 2). Similarly, there are 4 aces in a standard deck of 52 cards, so the probability of drawing an ace in a single draw is 4/52, or 4 out of 52. In the classical interpretation of probability, each possible distinct result is called an outcome, and an event is identified as a collection of outcomes. The probability of an event ‘E’ under the classical interpretation of probability is computed by taking the ratio of the number of outcomes, ‘Ne’, favourable to event ‘E’ to the total number ‘N’ of possible outcomes. Hence ‘P(event E) is around Ne/N’. The applicability of this interpretation depends on the assumption that all outcomes are equally likely. If this assumption does not hold, the probabilities indicated by the classical interpretation of probability shows an error.

The second interpretation of probability is called the relative frequency concept of probability. This is an empirical approach to probability. If an experiment is repeated a large number of times and event ‘E’ occurs 40 % of the time, then 0.4 is a very good approximation to the probability of the event ‘E’. Symbolically, if an experiment is conducted ‘n’ different times and if event ‘E’ occurs on ‘ne’ of these trials, then the probability of event ‘E’ is ‘P(event E) is around ne/n’. The word ‘around’ is used since the actual probability ‘P(event E)’ is thought as the relative frequency of the occurrence of event ‘E’ over a very large number of observations or repetitions of the phenomenon. The fact that people can check probabilities which have a relative frequency interpretation (by simulating several repetitions of the experiment) makes this interpretation very appealing and practical.

The third interpretation of probability can be used for problems in which it is difficult to imagine a repetition of an experiment. These are ‘one-shot’ situations. For example, the HR director of an organization who estimates the probability that a proposed revision in eligibility rules is to be passed by the board of directors of the organization is not thinking in terms of a long series of trials. Rather, the director uses a personal or subjective probability to make a one-shot statement of belief regarding the likelihood of passage of the proposed revision in the board of director meeting. The problem with subjective probabilities is that they can vary from person to person and they cannot be checked.

Of the three interpretations given here, the relative frequency concept seems to be the most reasonable one since it provides a practical interpretation of the probability for most events of interest. Even though people never run the necessary repetitions of the experiment to determine the exact probability of an event, the fact that people can check the probability of an event gives meaning to the relative frequency concept.

One of the graphical techniques is the exploratory data analysis (EDA). Professor John Tukey has been the leading proponent of this practical philosophy of data analysis aimed at exploring and understanding data.

The stem-and-leaf plot is a clever, simple graphical technique for constructing a histogram-like picture of a frequency distribution. It allows the people to use the information contained in a frequency distribution to show the range of scores where the scores are concentrated, and the shape of the distribution whether there are any specific values or scores not represented, and whether there are any stray or extreme scores. The stem-and-leaf plot does not follow the organization principles used for histograms.

There is another graphical technique which deals with how certain variables change over time. For macro-data and micro-data, plots of data over time are fundamental to the organizational management. The technique shows how variables change over time. A pictorial method of presenting changes in a variable over time is called a time series. Normally, time points are labelled chronologically across the horizontal axis, and the numerical values (frequencies, percentages, rates, etc.) of the variable of interest are labelled along the vertical axis. Time can be measured in days, months, years, or whichever unit is most appropriate.

As a rule of thumb, a time series is to consist of no fewer than four or five time points. Typically, these time points are equally spaced. Many more time points than this are desirable, though, in order to show a more complete picture of changes in a variable over time. How to display the time axis in a time series frequently depends on the time intervals at which data are available.  When information about a variable of interest is available in different units of time, it is to be decided which unit or units are most appropriate.  Time series plots are useful for examining general trends and seasonal or cyclic patterns.

Sometimes it is important to compare trends over time in a variable for two or more groups. Sometimes information is not available in equal time intervals. When information is not available in equal time intervals, it is important for the interval width between time points (the horizontal axis) to reflect this fact.

Numerical descriptive measures are measures of central tendency and measures of variability. These are normally used to convey a mental image of pictures, objects, and other phenomena. There are two main reasons for this. First, graphical descriptive measures are inappropriate for statistical inference, since it is difficult to describe the similarity of a sample frequency histogram and the corresponding population frequency histogram. The second reason for using numerical descriptive measures is one of expediency as people never seem to carry the appropriate graphs or histograms with them, and so are to resort to their powers of verbal communication to convey the appropriate picture.

The two most common numerical descriptive measures are measures of central tendency and measures of variability, i.e., people seek to describe the centre of the distribution of measurements and also how the measurements vary around the centre of the distribution. A distinction between numerical descriptive measures for a population is to be drawn. These are called parameters, and numerical descriptive measures for a sample, called statistics. In problems needing statistical inference, people are not able to calculate values for various parameters, but they are able to compute corresponding statistics from the sample and use these quantities to estimate the corresponding population parameters.

The mode is taken as the midpoint of the modal interval and it is an approximation to the mode of the actual sample measurements. It is also normally used as a measure of popularity which reflects central tendency or opinion.

The second measure of central tendency is the median. The median of a set of measurements is defined to be the middle value when the measurements are arranged from lowest to highest. The median is most frequently used to measure the midpoint of a large set of measurements. The median reflects the central value of the data, i.e., the value which divides the set of measurements into two groups, with an equal number of measurements in each group. However, the definition of median for small sets of measurements is used by using a convention. As per the convention, the median for an even number of measurements is the average of the two middle values when the measurements are arranged from lowest to highest. When there are an odd number of measurements, the median is still the middle value. Hence, whether there are an even or odd number of measurements, there are an equal number of measurements above and below the median.

The median for grouped data is slightly more difficult to compute. Since the actual values of the measurements are unknown, it is known that the median occurs in a particular class interval, but it is not known where to locate the median within the interval. If it is assumed that the measurements are spread evenly throughout the interval, the following result is obtained. Let ‘L’ is lower class limit of the interval which contains the median ‘n’, total frequency ‘cf’ is the sum of frequencies (cumulative frequency) for all classes before the median class, ‘fm’ is the frequency of the class interval containing the median, ‘w’ is interval width, then, for grouped data, median = L + w/fm(.5n – cf).

The third, and last, measure of central tendency is the ‘arithmetic mean’, known simply as the mean. The arithmetic mean, or mean, of a set of measurements is defined to be the sum of the measurements divided by the total number of measurements. When people talk about an ‘average’, they quite frequently are referring to the mean. It is the balancing point of the data set. Because of the important role that the mean plays in statistical inference, special symbols are normally given to the population mean and the sample mean. The population mean is denoted by the Greek letter ‘mu’, and the sample mean is denoted by the symbol ‘y-bar’.

The mean is a useful measure of the central value of a set of measurements, but it is subject to distortion because of the presence of one or more extreme values in the set. In these situations, the extreme values (called outliers) pull the mean in the direction of the outliers to find the balancing point, hence distorting the mean as a measure of the central value. A variation of the mean, called a trimmed mean, drops the highest and lowest extreme values and averages the rest.

By trimming the data, people are able to reduce the impact of very large (or small) values on the mean, and hence get a more reliable measure of the central value of the set. This is particularly important when the sample mean is used to predict the corresponding population central value. It is to be noted that in a limiting sense the median is a 50 % trimmed mean. Hence, the median is frequently used in place of the mean when there are extreme values in the data set.

The measures of central tendency (mode, median, mean, and trimmed mean) are in which way related for a given set of measurements, depends on the skewness of the data. If the distribution is mound-shaped and symmetrical about a single peak, the mode, median, mean, and trimmed mean are the same. If the distribution is skewed having a long tail in one direction and a single peak, the mean is pulled in the direction of the tail i.e., the median falls between the mode and the mean, and depending on the degree of trimming, the trimmed mean normally falls between the median and the mean. The distributions can be skewed to the left and to the right. The important thing to remember is that people are not restricted to using only one measure of central tendency. For some data sets, it is necessary to use more than one of these measures to provide an accurate descriptive summary of central tendency for the data.

It is not sufficient to describe a data set using only measures of central tendency, such as the mean or the median. It is also a need to determine how dispersed is the data about the mean. Graphically, the need for some measure of variability can bs observed by examining the relative frequency histograms. There can be several histograms having the same mean but each having a different spread, or variability, about the mean.

The simplest but least useful measure of data variation is the range. Range of a set of measurements is defined to be the difference between the largest and the smallest measurements of the set.

For grouped data, the range is taken to be the difference between the upper limit of the last interval and the lower limit of the first interval since the individual measurements are not known. Although the range is easy to compute, it is sensitive to outliers since it depends on the most extreme values. It does not give much information about the pattern of variability. The mean and range can be identical to the mean and range calculated for the data. However, the data in one case can be more spread out about the mean than the data in the second case. What is required to be sought is a measure of variability which discriminates between data sets having different degrees of concentration of the data about the mean.

A second measure of variability involves the use of percentiles. The ‘pth’ percentile of a set of ‘n’ measurements arranged in order of magnitude is that value which has at most ‘p’ % of the measurements below it and at most ‘(100 – p)’ % above it. Specific percentiles of interest are the 25th, 50th, and 75th percentiles, which are frequently called the lower quartile, the middle quartile (median), and the upper quartile, respectively.

The computation of percentiles is accomplished as described here. Each data value corresponds to a percentile for the percentage of the data values which are less than or equal to it. Let y(1), y(2), —-, y(n) denote the ordered observations for a data set, i.e., y(1) less than / equal to y(2) ——- less than / equal to y(n). The ‘jth’ observation, y(j), corresponds to the 100(j – 0.5)/n percentile. This formula can be used in place of assigning the percentile 100j/n so that assigning of the 100th percentile to ‘y(n)’ can be avoided, which implies that the largest possible data value in the population is observed in the data set, is an unlikely happening.

When dealing with large data sets, the percentiles are generalized to quantiles, where a quantile, denoted ‘Q(u)’, is a number which divides a sample of ‘n’ data values into two groups so that the specified fraction u of the data values is less than or equal to the value of the quantile, ‘Q(u)’. Plots of the quantiles ‘Q(u)’ versus the data fraction ‘u’ provide a method of obtaining estimated quantiles for the population from which the data are selected.

The second measure of variability, the interquartile range (IQR) is defined as a set of measurements which is the difference between the upper and lower quartiles, i.e., IQR = 75th percentile – 25th percentile. The interquartile range, although more sensitive to data pileup about the midpoint than the range, is still not sufficient for different purposes. In fact, the IQR can be very misleading when the data set is highly concentrated about the median.

Deviation is a sensitive measure of variability, not only for comparing the variabilities of two sets of measurements but also for interpreting the variability of a single set of measurements. For doing this, people work with the deviation ‘y – y-bar’ of a measurement ‘y’ from the mean ‘y-bar’ of the set of measurements.

A data set with very little variability has majority of the measurements located near the centre of the distribution. Deviations from the mean for a more variable set of measurements are relatively large. Several different measures of variability can be constructed by using the deviations ‘y – y-bar’. A first thought is to use the mean deviation, but this is always equal to zero. A second possibility is to ignore the minus signs and compute the average of the absolute values. However, a more easily interpreted function of the deviations involves the sum of the squared deviations of the measurements from their mean. This measure is called the variance. The variance of a set of ‘n’ measurements y1, y2, —–, yn with mean ‘y-bar’ is the sum of the squared deviations divided by ‘n – 1’.  As with the sample and population means, there are special symbols to denote the sample and population variances. The symbol ‘s square’ represents the sample variance, and the corresponding population variance is denoted by the symbol ‘sigma square’.

The definition for the variance of a set of measurements depends on whether the data are regarded as a sample or population of measurements. The definition of variance assumes working with the sample, where the population measurements normally are not available. Several people define the sample variance to be the average of the squared deviations. However, the use of ‘(n – 1)’ as the denominator of ‘s square’ is not arbitrary. This definition of the sample variance makes it an unbiased estimator of the population variance ‘sigma square’. This means roughly that if very large number of samples are drawn, each of size ‘n’, from the population of interest and if ‘s square’ is computed for each sample, the average sample variance is equal to the population variance ‘sigma square’.

Another useful measure of variability is the standard deviation. It involves the square root of the variance. One reason for defining the standard deviation is that it yields a measure of variability having the same units of measurement as the original data, whereas the units for variance are the square of the measurement units. The standard deviation of a set of measurements is defined to be the positive square root of the variance. Hence, the symbol ‘s’ is denoting the sample standard deviation and the symbol ‘sigma’ is denoting the corresponding population standard deviation.

As seen above, there are several measures of variability, each of which can be used to compare the variabilities of two or more sets of measurements. The standard deviation is particularly appealing for two reasons namely (i) it compares the variabilities of two or more sets of data using the standard deviation, and (ii) it can also use the results of the rule which follows to interpret the standard deviation of a single set of measurements. This rule applies to data sets with roughly a ‘mound-shaped’ histogram, i.e., a histogram which has a single peak, is symmetrical, and tapers off gradually in the tails. Since so many data sets can be classified as mound-shaped, the rule has wide applicability. For this reason, it is called the ‘empirical rule’.

The ‘empirical rule’ is that ‘given a set of ‘n’ measurements possessing a mound-shaped histogram, then the interval ‘y-bar +/- s’ contains around 68 % of the measurements, the interval ‘y-bar +/- s’ contains around 95 % of the measurements, and the interval ‘y-bar +/- s’ contains around 99.7 % of the measurements.

The results of the ‘empirical rule’ enable to get a quick approximation to the sample standard deviation ‘s’. The ‘empirical rule’ states that around 95 % of the measurements lie in the interval ‘y-bar +/- 2s’. The length of this interval is, hence, ‘4s’. Since the range of the measurements is around ‘4s’, an approximate value for ‘s’ is obtained by dividing the range by 4.

Some people can wonder why the range is not equated to ‘6s’, since the interval ‘y-bar +/- 3s’ contains almost all the measurements. This procedure yields an approximate value for ‘s’ which is smaller than the one obtained by the preceding procedure. If people are going to make an error (as people are bound to do with any approximation), it is better to over-estimate the sample standard deviation so that people are not led to believe there is less variability than is the case.

The standard deviation can be deceptive when comparing the quantity of variability of different types of populations. A unit of variation in one population can be considered quite small, whereas that same quantity of variability in a different population is considered excessive. For comparing the variability in two considerably different processes or populations, it is required to define another measure of variability. The coefficient of variation measures the variability in the values in a population relative to the magnitude of the population mean. In a process or population with mean ‘mu’ and standard deviation ‘sigma’, the coefficient of variation (CV) is defined as CV = sigma/|mu| provided ‘mu’ is not 0. Hence, the coefficient of variation is the standard deviation of the population or process expressed in units of ‘mu’.

The CV is a unit-free number since the standard deviation and mean are measured using the same units. Hence, the CV is frequently used as an index of process or population variability. In several applications, the CV is expressed as a percentage i.e., CV = 100(sigma/|mu|) %. Hence, if a process has a CV of 20 %, the standard deviation of the output of the process is 20 % of the process mean. Using sampled data from the population, people estimate CV with 100(s/|y-bar|) %.

A stem-and-leaf plot provides a graphical representation of a set of scores which can be used to examine the shape of the distribution, the range of scores, and where the scores are concentrated. The boxplot, which builds on the information displayed in a stem-and-leaf plot, is more concerned with the symmetry of the distribution and incorporates numerical measures of central tendency and location to study the variability of the scores and the concentration of scores in the tails of the distribution. The boxplot uses the median and quartiles of a distribution. Bedsides quartiles, there are lower quartile and upper quartile.

These three descriptive measures and the smallest and largest values in a data set are used to construct a skeletal boxplot. The skeletal boxplot is constructed by drawing a box between the lower and upper quartiles with a solid line drawn across the box to locate the median. A straight line is then drawn connecting the box to the largest value and a second line is drawn from the box to the smallest value. These straight lines are sometimes called whiskers, and the entire graph is called a box-and-whiskers plot.

With a quick glance at a skeletal boxplot, it is easy to get an impression about the several aspects of the data such as (i) the lower and upper quartiles Q1 and Q3, (ii) the interquartile range (IQR), i.e., the distance between the lower and upper quartiles, (iii) the most extreme (lowest and highest) values, and (iv) the symmetry or asymmetry of the distribution of scores.

The skeletal boxplot can be expanded to include more information about extreme values in the tails of the distribution. For doing so, people need the following additional quantities namely (i) lower inner fence, Q1 – 1.5(IQR), upper inner fence, Q3 + 1.5(IQR), lower outer fence, Q1 – 3(IQR), and upper outer fence, Q3 + 3(IQR). Any score beyond an inner fence on either side is called a mild outlier, and a score beyond an outer fence on either side is called an extreme outlier.

Several informations can be drawn from a boxplot. First, the centre of the distribution of scores is indicated by the median line in the boxplot. Second, a measure of the variability of the scores is given by the interquartile range, the length of the box. Since, the box is constructed between the lower and upper quartiles so it contains the middle 50 % of the scores in the distribution, with 25 % on either side of the median line inside the box. Third, by examining the relative position of the median line, people can gauge the symmetry of the middle 50 % of the scores. For example, if the median line is closer to the lower quartile than the upper quartile, there is a higher concentration of scores on the lower side of the median within the box than on the upper side. A symmetric distribution of scores has the median line located in the centre of the box. Fourth, additional information about skewness is obtained from the lengths of the whiskers. The longer is one whisker relative to the other one, the more skewness is there in the tail with the longer whisker. Fifth, a general assessment can be made about the presence of outliers by examining the number of scores classified as mild outliers and the number classified as extreme outliers. Boxplots provide a powerful graphical technique for comparing samples from several different treatments or populations.

The graphical methods and numerical descriptive methods for summarizing data from a single variable have been described above. Frequently, more than one variable is being studied at the same time, and people can be interested in summarizing the data on each variable separately, and also in studying relations among the variables. There are a few techniques for summarizing data from two (or more) variables such as chi-square methods, analysis of variance (ANOVA), and regression.

First the problem of summarizing data from two qualitative variables is considered. Cross-tabulations can be constructed to form a contingency table. The rows of the table identify the categories of one variable, and the columns identify the categories of the other variable. The entries in the table are the number of times each value of one variable occurs with each possible value of the other. The simplest method for looking at relations between variables in a contingency table is a percentage comparison based on the row totals, the column totals, or the overall total.

An extension of the bar graph known as stacked bar graph provides a convenient method for displaying data from a pair of qualitative variables. Another extension of the bar graph known as cluster bar graph provides a convenient method for displaying the relationship between a single quantitative and a qualitative variable.

Data plots can be constructed to summarize the relation between two quantitative variables. This can be done by using a scatterplot. Each point of the plot on y and x axes represents a data. The smooth curve fitted to the data points, called the least squares line, represents a summarization of the relationship between y and x axes. This line allows the prediction of the inference not represented in the data set.

When there is a need to construct data plots for summarizing the relation between several quantitative variables, then side-by-side boxplots are made for the comparison.

People are normally confused about what type of analysis is to be used on a set of data and the relevant forms of presentation or data display is made from the data. The decision is based on the scale of measurement of the data. These scales are nominal, ordinal, and numerical. Nominal scale  is  where the data can be classified into a non-numerical or named categories, and the order in which these categories can be written or asked is arbitrary. The ordinal scale is where the data can be classified into non-numerical or named categories and an inherent order exists among the response categories. Ordinal scales are seen in questions which call for ratings of quality (e.g., very good, good, fair, poor, very poor) and agreement (e.g., strongly agree, agree, disagree, strongly disagree). The numerical scale is where numbers represent the possible response categories there is a natural ranking of the categories zero on the scale has meaning there is a quantifiable difference within categories and between consecutive categories.

Data analysis can be categorized into the six main methods namely (i) descriptive, (ii) exploratory, (iii) inferential, (iv) predictive, (v) explanatory or causal, and (v) mechanistic.

Descriptive – It is recognized as the first type of data analysis. It is known as the method with the least amount of effort. Hence, it can be used for large volumes of data. Here the data is used to perform a data set

Exploratory – This method is used to explore the unknown relationships and discover new connections, and define future studies or questions.

Inferential – Inferential analyzing method uses a small sample to conclude a bigger population. It means, data from a subject sample of the world is used to test a general theory about its nature. The types of data sets which can be used in this method are observational, retrospective data set, and cross-sectional time study.

Predictive – Predictive analysis utilizes historical and present facts to reach future predictions. It can also use data from a subject to predict the values of another subject. There are different predictive models. However, a simple model with more data can work better in general. Hence, the prediction data set and also the determination of the measuring variables are important aspects to consider.

Explanatory – This analyzing method is used to determine the consequences happening to one variable when changing another one using randomized trial data sets.

Mechanistic – This method needs the most effort to determine the exact changes in the variables which can lead to changes in other ones using randomized trial data sets. It can also be concluded that mechanistic analysis is hardly inferable. Hence, when people need high precision in the result and the people are to minimize the errors, it can be a choice.

Descriptive is an analysis method which summarizes the data to reach a simple presentation as a result. This method can be categorized into (i) univariate analysis, and (ii) bivariate analysis. Univariate analysis is a set of different statistical tools which look for characteristics and general properties of one variable. These statistical techniques are (i) frequency, (ii) central tendency, and (iii) dispersion.

Frequency distribution is the most basic method to determine the distribution of variables. It determines all possible values for a specific variable and the number of times or the frequency which each of those values is in the data set. The central tendency of distribution which is also known as three Ms (mode, mean, and median) is used to determine the number of the most represented value which can help people to use a single variable in comparison to a set of data. The mode, mean, and median are the most commonly used central tendency methods. Mode is the amount of the most frequently occurring value, mean is simply the average of the values, and median is the middle value in the data set.

Dispersion is the way of spreading the variables around the central tendency. Common tools are range, variance, and the square root of the variance which is known as standard deviation. The range is the amount of the difference between the highest and the lowest values. The variance shows the concentration of values around the average value. It is used when there are two variables in the data set and when there is also need of comparison between two data sets. Hence, it can show the relationship and connections between these two variables.

Bivariate correlation or simply correlation is the most common measure. This measure uses a specific formula to calculate correlation using the sample mean values and standard deviations. This method also can be used when the number of variables is more than two. Although because of the complexity of using formulas manually, these programmes can be simply solved by computing with software.

Explanatory analysis as discussed before looks to find influences. It means explanatory analysis tries to reach the answer to those questions which are related to the connections, relations, and patterns between variables. The main explanatory analysis techniques are (i) dependence, and (ii) inter-dependence methods. Fig 2 gives the main tools for these techniques.

Fig 2 Explanatory analysis techniques

Dependence is concerned with the impact of a set of predictor variables on a single outcome variable. For dependence techniques the common tools used are (i) analysis of variance (ANOVA), (ii) multiple analysis of variance (MANOVA), (iii) structural equation modelling, (iv) logistic regression, and (v) multiple discriminant analysis.

Analysis of variance (ANOVA) – It is utilized to make a comparison between the calculated separate outcomes of different groups which are the predictor variable and also multi-category. This tool can be used in inferential analysis.

Multiple analysis of variance (MANOVA) – As the extension of ANOVA tool, the comparison can be gained across two or more outcome variables. Here the groups are the equivalent of a categorical predictor variable.

Structural equation modelling – The relationships between a series of inter-related and predictor variables can be handled using this method which is also known as linear structure relations (LISREL). The structural relationship between the variable which is measured and the latent construct can be defined using this tool. As a comprehensive method, it is beneficial for testing hypotheses as well as testing, presenting, and estimating, theoretical networks of relations which are mostly linear between variables. For measurement equivalence it can also be applied in complex analyses for first-order and higher-order constructs.

Logistic regression – The logistic regression method contains multiple regressions with dichotomous outcome variables and categorical or metric predictor variables. Logistic regression derives its name from the sigmoid function, which is also known as the logistic function. The logistic function is an S-shaped curve which stretches from zero to one, while never being exactly zero and never being exactly one, either. There are three main types of logistic regression namely (i) binary, (ii) multinomial, and (iii) ordinal. They differ in execution and theory. Binary regression deals with two possible values, essentially ‘yes or no’. Multinomial logistic regression deals with three or more values. And ordinal logistic regression deals with three or more classes in a pre-determined order.

Multiple discriminant analysis – As an alternative method multiple discriminant analysis can be used instead of the forms of logistic regression. Different numbers of predictor variables and a single outcome variable also can be handled in both dichotomous and multichotomous categories by using this tool.

Inter-dependence techniques are multivariate analyses which aim to determine the inter-relationships between the variables considering no assumption for the influence direction. These are used to reach the relationships of a set of variables and these consider neither explanation nor influence. The common tools used in these analyses are (i) factor analysis, (ii) cluster analysis, and (iii) multi-dimensional scaling.

Factor analysis – This method is useful to analyze and discover the patterns and relationships when people face a large number of variables by reducing the variables to either a new variate or a derived smaller set of factors. It is not a method aiming to predict the outcome variables but as a series of statistical techniques, it is used to perform latent factors driving observable variables.

Cluster analysis – This method is used to classify objects or individuals. For this purpose, mutual groups are constructed to maximize both the homogeneity and heterogeneity between clusters. Factor analysis groups variables as factors, while the cluster method groups objects or people together by considering different criteria.

Multi-dimensional scaling – It is also called perceptual mapping and is used to identify key dimensions considering individuals’ judgments and perceptions. For this purpose, using distances which are represented in multi-dimensional space, judgments and perceptions can be transformed. In this method, people have the outcome variable which is recognized as judgment or perception here and the people are to determine the independent variables which are the perceptual dimensions. This is, in fact, the main difference between this method and cluster analysis.

Leave a Comment