Data and Presentation of Data
Data and Presentation of Data
Data refer to the set of observations, values, elements, or objects under consideration. They also refer to the known facts or things used as basis for inference or reckoning facts, information, material to be processed or stored.iu
Data are a set of facts, and provide a partial picture of reality. Whether data are being collected with a certain purpose or collected data are being utilized, questions regarding what information the data are conveying, how the data can be used, and what is to be done to include more useful information is to be must constantly be kept in mind. A large quantity of data is generated during the operation of a manufacturing plant such as a steel plant.
The word ‘data’ means information. The adjective ‘raw’ attached to data indicates that the information collected cannot be used directly. It has to be converted into more suitable form before it begins to make sense to be utilized gainfully. Raw data is to be converted into proper form such as tabulation, or frequency distribution form, etc, before any inference is drawn from it.
For understanding the nature of data, it becomes necessary to study about the various forms of data. The forms of data are (i) qualitative or categorical and quantitative or numerical data, (ii) continuous and discrete data, and (iii) primary and secondary data.
The categorical or qualitative data result from information which has been classified into categories. Such categories are listed alphabetically or in order of decreasing frequencies or in some other conventional way. Each piece of data clearly belongs to one classification or category. The numerical or quantitative data result from counting or measuring. Numerical or quantitative data can be continuous or discrete depending on the nature of the elements or objects being observed.
Continuous data arise from the measurement of continuous attributes or variables, in which individual can differ by quantities just approaching zero. Discrete data are characterised by, gaps in the scale, for which no real values can ever be found. Such data are normally expressed in whole numbers. All measurements of continuous attributes are approximate in character and as such do not provide a basis for distinguishing between continuous and discrete data. The distinction is made on the basis of variable being measured. Parameter being measured is a continuous variable and the value of the parameter give discrete data.
The data collected by or on behalf of the person or people who are going to make use of the data refers to primary data. When an individual personally collects data or information pertaining to an event, a definite plan or design, it refers to primary data. The primary data are of several types namely (i) nominal data, (ii) ordinal data, (iii) ranked data, (iv) discrete data, and (v) continuous data
Nominal data – In certain type of studies, investigator meets several different types of numerical data. It is one of the simplest types of data. In nominal data, the values fall into unordered categories or classes. In a certain study, for instance, steel rounds can be assigned the value 1 and steel sections can be assigned the value 0. Numbers are used mainly for the sake of convenience. Numerical values allow the use of computers to perform complex analysis of the data. Nominal data which take on one of two distinct values, such as rounds and sections are said to be dichotomous or binary, depending on whether the Greek or the Latin root for two is preferred. However, not all nominal data need to be dichotomous. Frequently there are three or more possible categories into which the observations can fall. The different types have varying degrees of structure in relationships among possible values.
Ordinal data – When the order among categories becomes important, the observations are referred to as ordinal data. For example, injuries can be classified according to their level of severity, so that ‘1’ represents a fatal injury, ‘2’ is severe injury, ‘3’ is moderate injury, and ‘4’ is minor injury. Here a natural order exists among the groupings i.e., a smaller number represents a more serious injury.
Ranked data – In some situations, there exist a group of observations which are first arranged from highest to lowest according to magnitude and then assigned numbers which correspond to each observation’s place in the sequence. This type of data is known as ranked data. As an example, consider all possible causes of accidents in the plant. A list of all of those causes is made along with the number of man-days lost because each cause. If the causes are ordered from the one which have resulted in the highest number of man-days lost to the one which has caused the lowest and then assigned consecutive integers, the data are said to have been ranked.
Discrete data – For discrete data, both ordering and magnitude are important. In this case, the numbers represent actual measurable quantities rather than mere labels. In addition, discrete data are restricted to taking on only specified values (frequently integers or counts) which differ by fixed quantities, and no intermediate values are possible. Examples of discrete data include the number of billets produced in a particular heat, the tonnage of sections produced in a month, the percentage of primed steel produced over one year period, and the percent yield achieved in the wire rod mill. It is to be noted that for discrete data a natural order exists among the possible values.
Continuous data – Data which represent measurable quantities but are not restricted to taking on certain specified values (such as integers) are known as continuous data. In this case, the difference between any two possible data values can be arbitrarily small. Examples of continuous data include air / fuel ratio, the concentration of a pollutant, and temperature of the steel stock in a reheating furnace. In all instances, fractional values are possible. Since it is possible to measure the interval between two observations in a meaningful way, arithmetic averages can be applied. The only limiting factor for a continuous observation is the degree of accuracy with which it can be measured. Hence, it is frequently time rounded off to the nearest second and quantity to nearest unit value. The more accurate is the measuring instruments, the greater the amount of detail which can be achieved in the recorded data.
In general, the degree of precision needed in a given set of data depends on the questions which are being studied. As the investigation progresses, the nature of the relationship between possible data values become increasingly complex. Distinctions is to be made among the various types of data since different techniques are used to analyze them.
Whev a person uses the data already collected by another person for a study, then the data are caleed secondary data. Secondary data is the data collected by some other person or organization for their own use but the investigator also gets it for use. For several reasons it become necessary to use secondary data, which are to be used carefully, since the data can have been collected with a purpose different from that of the investigator and can lose some detail or may not be fully relevant. A data can be primary for one purpose and secondary for the other.
For using secondary data, it is always useful to know (i) how the data have been collected and processed, (ii) the accuracy of data, (iii) how far the data have been summarized, (iv) how comparable the data are with other tabulations, and (v) how to interpret the data, especially when figures collected for one purpose are used for another purpose.
Data collection and related terms
Population – The complete set of all possible elements or objects is called a population. Each of the elements is called a piece of data. Population is a collection of units or objects of which some property is defined for every unit or object. Population can consist of finite or infinite number of units. Population is also called universe by some peoples. The number of employees in the organization, number of rolling mills in a plant, length of rail track in the plant, and number of feeders in a sinter plant are a few examples of finite populations. All real numbers, and inclusions in liquid steel are examples of infinite populations. Normally, the population has a large number of animates and inanimates. Moreover, the units or subjects constituting the population can vary from study to study in the same area of activity depending upon the aims and objective of the study. In brief, one is to keep in mind that statistical population of data is not the human population which is normally considered for population in literary sense. It is normally a group or collection of items specified by certain characteristics or defined under certain restrictions.
Sample – A sample is the portion of the population which is examined to make inferences about the population or a part or fraction of population, which represent it. Sample consists of a few items of the population. In principle a sample should be such that it is a true representative of the population.
Sampling unit – The constituents of a population which are the individuals to be sampled from the population and cannot be further subdivided for the purpose of sampling at a time are called sampling units.
Sampling frame – For accepting any sampling procedure, it is necessary to have a list or a map identifying each sampling unit by a number. Such a list or map is called sampling frame.
Once data has been collected, it has to be classified and organized in such a way that it becomes easily readable and interpretable, i.e., converted to information. Before the calculation of descriptive statistics, it is sometimes a good idea to present data as tables, charts, diagrams or graphs. Majority of people find ‘pictures’ much more helpful than ‘numbers’ in the sense that, in their opinion, they present data more meaningfully.
People frequently have to deal with very large amounts of data. These data are represented by a jumble of numbers. To make sense out of these data, one has to organize and summarize them in some systematic fashion. The most basic method for organizing data is to classify the observations into a frequency distribution. A frequency distribution is a table which reports the number of data which fall into each category of the variable which is being analyzed. Constructing a frequency distribution is normally the first step in the analysis of the data.
Data are normally collected in a raw format and hence the inherent information is difficult to understand. Hence, raw data need to be summarized, processed, and analyzed. However, no matter how well manipulated, the information derived from the raw data is to be presented in an effective format. These days, data are frequently summarized, organized, and analyzed with statistical packages or graphics software. Data need to be prepared in such a way they are properly recognized by the program being used.
The techniques of data and information presentation are textual, tabular, and graphical forms. Methods of presentation is to be determined according to the data format, the method of analysis to be used, and the information to be emphasized. Inappropriately presented data fail to clearly convey information. Even when the same information is being conveyed, different methods of presentation are normally used depending on what specific information is going to be emphasized. A method of presentation is to be chosen after carefully weighing the advantages and disadvantages of different methods of presentation.
Text is the principal method for explaining findings, outlining trends, and providing contextual information. Data are fundamentally presented in paragraphs or sentences. Text can be used to provide interpretation or emphasize certain data. If quantitative information to be conveyed consists of one or two numbers, it is more appropriate to use written language than tables or graphs. If this information is presented in a graph or a table, it occupies an unnecessarily large space, without enhancing the readers’ understanding of the data. If more data are to be presented, or other information such as that regarding data trends are to be conveyed, a table or a graph is more appropriate. By nature, data take longer to read when presented in a text form and when the main text includes a long list of information, readers have difficulties in understanding the information
A table is best suited for representing individual information and represents both quantitative and qualitative information. Tables, which convey information which has been converted into words or numbers in rows and columns, have been used for nearly 2,000 years. Anyone with a sufficient level of literacy people can easily understand the information presented in a table. Tables are the most appropriate for presenting individual information, and can present both quantitative and qualitative information. The strength of tables is that they can accurately present information which cannot be presented with a graph. A number such as 35.253485 can be accurately expressed in a table. Another strength is that information with different units can be presented together. Tables are also useful for summarizing and comparing quantitative information of different variables. However, the interpretation of information takes longer in tables than in graphs, and tables are not appropriate for studying data trends. Furthermore, since all data are of equal importance in a table, it is not easy to identify and selectively choose the information required.
Heat maps are used for better visualization of information. Heat maps help to further visualize the information presented in a table by applying colours to the background of cells. By adjusting the colours or colour saturation, information is conveyed in a more visible manner, and people can quickly identify the information of interest. Several softwares have features which enable easy creation of heat maps.
A graph is a very effective visual tool as it displays data at a glance, facilitates comparison, and can reveal trends and relationships within the data such as changes over time, frequency distribution, and correlation or relative share of a whole. Text, tables, and graphs for data and information presentation are very powerful communication tools. A graph format which best presents information is to be chosen so that readers can easily understand the information. Further, majority of the recently introduced statistical packages and graphics software have the three-dimensional (3D) effect feature. The 3D effects can add depth and perspective to a graph. However, since they can make reading and interpreting data more difficult, they must only be used after careful consideration.
Whereas tables can be used for presenting all the information, graphs simplify complex information by using images and emphasizing data patterns or trends, and are useful for summarizing, explaining, or exploring quantitative data. While graphs are effective for presenting large quantities of data, they can be used in place of tables to present small sets of data.
Visual presentation of data such has become more popular and is frequently being used by the people for the data analysis. Visual presentation of data means presentation of the data in the form of diagrams and graphs. In these days, as it is known, every analysis of the data is supported with visual presentation since the visual presentation has several advantages which are described below.
- Visual presentation relieves the dullness of the numerical data. A list of numerical figures becomes less comprehensible and difficult to draw conclusions from as its length increases. Scanning of the numerical figures from tables causes undue strain on the mind. The data when presented in the form of diagrams and graphs, gives a bird’s eye-view of the entire data and creates interest and leaves an impression on the mind of readers for a long period.
- The visual presentation makes the comparison easy and this is one of the prime objectives of visual presentation of data. Diagrams and graphs make quick comparison between two or more sets of data simpler, and the direction of curves bring out hidden facts and associations of the data.
- The visual presentation saves time and effort. The characteristics of statistical data, through tables, can be grasped only after a great strain on the mind. Diagrams and graphs reduce the strain and save a lot of time in understanding the basic characteristics of the data.
- The visual presentation facilitates the location of different measures and establishes the trends. Graphs make it possible to locate several measures of central tendency such as median, quartiles, and mode etc. They help in establishing trends of the past performance and are useful in interpolation or extrapolation, line of best fit, and establishing correlation etc. Hence, they help in forecasting.
- The visual presentation has universal applicability since it is a universal practice to present the numerical data in the form of diagrams and graphs. In these days, it is an extensively used technique in several fields.
- Diagrammatic and graphic presentation have become an integral part of various studies. In fact, now a days, it is difficult to find a study without a visual support. It is because, the visual presentation is the most convincing and appealing way of presenting the data.
Graphic representation is one of the ways of analyzing the numerical data. A graph is a diagram through which data are represented in the form of lines or curves drawn across the coordinated points plotted on its surface. Graph enables in studying the cause-and-effect relationship between two variables. They help to measure the extent of change in one variable when another variable changes by a certain amount. They also enable in studying both time series and frequency distribution as they give clear account and precise picture of problem. They also easy to understand and eye catching.
General principles of graphic presentation are some algebraic principles which apply to all types of graphic presentation of data. In a graph, there are two lines called coordinate axes. One is vertical known as y-axis and the other is horizontal called x-axis. These two lines are perpendicular to each other. Where these two lines intersect each other is called ‘0’ or the ‘origin’. On the x-axis the distances right to the origin have positive value and distances left to the origin have negative value. On the y-axis distances above the ‘origin’ have a positive value and below the origin have a negative value.
There are some minimum standards and rules for the graphical data presentation. The data-to-ink ratio and the data density index, i.e., the number of entries in the data matrix per area of data graphic, need to be as high as possible. The things to be avoided in the graphical data presentation are unnecessary shading, gridlines, three-dimensionality, or overlap. A figure tells a story, and byplay seriously distracts the people from the key message. The graphs are needed to convey a maximum of information with a minimum number of graphical elements. Although selective use of colour can greatly enhance information flow and highlight the key message, it has little meaning from a technical view. Technical information is not improved by turning graphs into artwork, but by as simple and comprehensible a design as possible. All axes and elements of a graph are to be unequivocally labelled. Natural scales with the entire range of values are strongly desired. Font sizes are to be adapted to the size of the graph and the graph area. The frequently used graph formats and the types of data which are appropriately presented with each format are described below.
Scatter diagrams present data on the x-axis and y-axis and are used to investigate an association between two variables. A point represents each individual or object, and an association between two variables can be studied by analyzing patterns across multiple points. A regression line is added to a graph to determine whether the association between two variables can be explained or not.
Scatter diagram enables a person to verify whether there exists a causal relationship between two variables by checking the pattern of points. In fact, it even reveals the nature of the relationship, i.e., if it is linear or non-linear, by the shape of the pattern. Scatter diagrams are especially very useful in regression and correlation analyses.
Scatter diagram consists of graphs drawn with pairs of numerical data, with one variable on each axis, to look for relationship between them. If the variables are correlated, the points fall along a line or curve. The scatter diagram is a type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis (x-axis) and the value of the other variable determining the position on the vertical axis (y-axis).
Scatter diagram is used when a variable exists which is below the control of the operator. If a parameter exists which is systematically incremented and / or decremented by the other, it is called the control parameter or independent variable and is customarily plotted along the x-axis. The measured or dependent variable is customarily plotted along the y-axis. If no dependent variable exists, either type of variable can be plotted on either axis or a scatter diagram illustrates only the degree of correlation (not causation) between two variables.
The scatter diagram is a useful plot for identifying a potential relationship between two variables. Data are collected in pairs on the two variables, say, (yi, xi) – for i = 1, 2, – – -, and n. Then yi is plotted against the corresponding xi. The shape of the scatter diagram frequently indicates what type of relationship can exist between the two variables. Scatter diagram is useful in regression modelling. Regression is a very useful technique in the ‘analyze’ step of DMAIC (define, measure, analyze, improve, and control). Typical scatter diagrams are shown in Fig 1.
Fig 1 Typical scatter diagrams
Fig 1b shows a scatter diagram relating metal recovery (in %) from a magna-thermic smelting process for magnesium against corresponding values of the quantity of reclaim flux added to the crucible. The scatter diagram indicates a strong positive correlation between metal recovery and flux quantity, i.e., as the quantity of flux added is increased, the metal recovery also increases. It is tempting to conclude that the relationship is one based on cause and effect. By increasing the quantity of reclaim flux used, one can always ensure high metal recovery. This thinking is potentially dangerous, since correlation does not necessarily imply causality.
This apparent relationship can be caused by something quite different. For example, both variables can be related to a third one, such as the temperature of the metal prior to the reclaim pouring operation, and this relationship can be responsible for what is seen in Fig 1. If higher temperatures lead to higher metal recovery and the practice is to add reclaim flux in proportion to temperature, then the addition of more flux when the process is running at low temperature does nothing to improve yield. The scatter diagram is useful for identifying potential relationships. Designed experiments are to be used to verify causality.
Bar graph, also known as a column graph, is a pictorial representation of data. It is used to indicate and compare values in a discrete category or group, and the frequency or other measurement parameters (i.e., mean). Depending on the number of categories, and the size or complexity of each category, bars can be created vertically or horizontally.In the bar graph, bars are shown in the form of rectangles spaced out with equal spaces between them and having equal width. The equal width and equal space criteria are important characteristics of a bar graph. The height (or length) of each bar corresponds to the frequency of a particular observation. People can draw bar graphs both, vertically or horizontally depending on whether they take the frequency along the vertical axis or horizontal axis.
The height (or length) of a bar represents the quantity of information in a category. Bar graphs are flexible, and can be used in a grouped or subdivided bar format in cases of two or more data sets in each category. By comparing the endpoints of bars, one can identify the largest and the smallest categories, and understand gradual differences between each category. It is advised to start the x-axis and y-axis from ‘0’. Illustration of comparison results in the x-axis and y-axis which do not start from ‘0’ can deceive people’s eyes and lead to over-representation of the results.
One form of vertical bar graph is the stacked vertical bar graph. A stacked vertical bar graph is used to compare the sum of each category, and analyze parts of a category. While stacked vertical bar graphs are excellent from the aspect of visualization, they do not have a reference line, making comparison of parts of various categories challenging. Fig 2 is an example of vertical bar graphs.
Fig 2 Vertical bar graphs
A pie chart, which is used to represent nominal data (in other words, data classified in different categories), visually represents a distribution of categories. A pie chart needs a list of categorical variables and the numerical variables. Here, the term ‘pie’ represents the whole and the ‘slices’ represents the parts of the whole. The pie chart is also is known as ‘circle graph’ since it divides the circular statistical graphic into sectors or slices in order to illustrate the numerical data.
The pie chart is normally the most appropriate format for representing information grouped into a small number of categories. It is also used for data which have no other way. The application of 3D effects on a pie chart makes distinguishing the size of each slice difficult. Even if slices are of similar sizes, slices farther from the front of the pie chart can appear smaller than the slices closer to the front because of the false perspective.
Pie chart contains different segments and sectors in which each sector forms a certain portion of the total (percentage). Each sector denotes a proportionate part of the whole. The total of all the data is equal to 360-degree. The total value of the pie is always 100 %. Fig 3 shows pie charts. It also shows comparison of simple pie chart versus 3D pie chart.
Fig 3 Pie charts
A line graph is useful for representing time-series data such as monthly production or yearly capacity utilization. In other words, it is used to study variables which are observed over time. Line graphs are especially useful for studying patterns and trends across data which include climatic influence, large changes or turning points, and are also appropriate for representing not only time-series data, but also data measured over the progression of a continuous variable such as distance.
Data can also be presented in the form of line graphs. A line graph records the relationship between two variables. If one of the two variables is time then a time series line graph is obtained. If data are collected at a regular interval, values in between the measurements can be estimated. In a line graph, the x-axis represents the continuous variable, while the y-axis represents the scale and measurement values. It is also useful to represent multiple data sets on a single line graph to compare and analyze patterns across different data sets. Fig 4 gives such graph in which growth in production of a rolling mill is shown after its commissioning. In this graph, time is represented on the x-axis and production on the y-axis. Time and production are two variables in this graph. It is the production which changes with time. Since production changes with time, it is said to be dependent on time. Production is, hence, treated as a dependent variable. Time is not influenced by production and hence taken as an independent variable.
Fig 4 Line graphs
Box and whisker chart
It is a graphical representation of numerical data, based on the five-number summary and introduced by Tukey in 1970. The diagram has a scale in one direction only. A rectangular box is drawn, extending from the lower quartile to the upper quartile, with the median shown dividing the box. ‘Whiskers’ are then drawn extending from the end of the box to the greatest and least values. Multiple box charts, arranged side by side, can be used for the comparison of several samples. In refined box charts, the whiskers have a length not exceeding 1.5 times the interquartile range. Any values beyond the ends of the whiskers are shown individually as outliers. Sometimes the values further than 3 times the interquartile range are indicated with a different symbol as extreme outliers.
Box and whisker chart is also known as boxplot. It is specially designed to display dispersion and skewness in a distribution. The figure consists of a ‘box’ in the middle from which two lines (whiskers) extend respectively to the minimum and maximum values of the distribution. The position of median is also indicated in the middle of the box. A box and whisker chart can be drawn either horizontally or vertically on graph. One axis is scaled to accommodate for the values of the observations while the other has no scale given that the width of the box is irrelevant. The box and whisker chart is applicable for both discrete and continuous data. It is drawn according to five descriptive data namely (i) minimum value, (ii) lower quartile, (iii) median, (iv) upper quartile, and (v) maximum value
The box and whisker chart does not make any assumptions about the underlying statistical distribution, and represents variations in samples of a population. Hence, it is appropriate for representing non-parametric data. The box and whisker chart consists of boxes which represent inter-quartile range (one to three), the median, and the mean of the data, and whiskers presented as lines outside of the boxes. Whiskers can be used to present the largest and smallest values in a set of data or only a part of the data (i.e. 95 % of all the data). Data which are excluded from the data set are presented as individual points and are called outliers. The spacing at both ends of the box indicates dispersion in the data. Fig 5 shows box and whisker charts.
Fig 5 Box and whisker charts
Frequency of a variable is the number of times it occurs in given data. Suppose there are data on the monthly number of accidents in a plant for a year. If 4 accidents have occurred 5 times in an year, then frequency of 4 accidents monthly is 5. Hence, a large mass of data can be compressed by writing the frequency of each variable corresponding to the values or the range of values taken by the data. For example, if the variable x takes the values as x1, x2…xn, then the frequency of xi is normally denoted by fi.
There are two types of frequency distribution, namely, simple frequency distribution and grouped frequency distribution. Simple frequency distribution shows the values of the variable individually whereas the grouped frequency distribution shows the values of the variable in groups or intervals. A few useful terms associated with the grouped frequency distribution are class intervals or class, class frequency, cumulative frequency (greater than and lesser than type), class limits (upper and lower), class boundaries (upper and lower), mid-point of class interval, width of a class, relative frequency, and frequency density. These terms are defined below.
Class – When a large number of observations varying in a wide range are available, they are normally classified into several groups according to the size of the values. Each of these groups defined by an interval is called class interval or simply class.
Class frequency – The number of observations falling under each class is called its class frequency or simply frequency.
Class limits – The two numbers used to specify the limits of a class interval for tallying the original observations are called the class limits.
Class boundaries – The extreme values (observations) of a variable, which can ever be included in a class interval, are called class boundaries.
Mid-point of class interval – The value exactly at the middle of a class interval is called class mark or mid-value. It is used as the representative value of the class interval. Hence, mid-point of class interval = (lower class boundary + upper class boundary)/2.
Width of a class – Width of class is defined as the difference between the upper class and lower class boundaries. Hence, width of a class = upper class boundary – lower class boundary.
Relative frequency – The relative frequency of a class is the share of that class in total frequency. Hence, relative frequency = class frequency / total frequency.
Frequency density – Frequency density of a class is its frequency per unit width. Hence, frequency density = class frequency / width of the class.
Cumulative frequency – Cumulative frequency corresponding to a specified value of a variable or a class (in case of grouped frequency distribution) is the number of observations smaller (or greater) than that value or class. The number of observations up to a given value (or class) is called less-than type cumulative frequency distribution, whereas the number of observations greater than a value (or class) is called more-than type cumulative frequency distribution.
Histograms are one of the visual ways to represent data. A histogram is a graphical representation (bar chart) of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson. A histogram is a representation of tabulated frequencies, shown as adjacent rectangles, erected over discrete intervals (bins), with an area equal to the frequency of the observations in the interval. The height of a rectangle is also equal to the frequency density of the interval, i.e., the frequency divided by the width of the interval. The total area of the histogram is equal to the quantity of data.
Out of several methods of presenting a frequency distribution graphically, the histogram is the most popular and widely used in practice. A histogram is a set of vertical bars whose areas are proportional to the frequencies of the classes which they represent. While constructing a histogram, the variable is always taken on the x-axis while the frequencies are on the y-axis. Each class is then represented by a distance on the scale which is proportional to its class interval. The distance for each rectangle on the x-axis remains the same if the class intervals are uniform throughout the distribution. If the classes have different class intervals, they obviously vary accordingly on the x-axis. The y-axis represents the frequencies of each class which constitute the height of the rectangle.
Histogram is a compact summary of data. To construct a histogram for continuous data, it is necessary to divide the range of the data into intervals, which are normally called class intervals, cells, or bins. If possible, the bins are to be of equal width to improve the visual information in the histogram. Some judgment is to be used in selecting the number of bins so that a reasonable display can be developed. The number of bins depends on the number of observations and the amount of scatter or dispersion in the data. A histogram which uses either too few or too many bins is not informative. Normally between 5 bins and 20 bins are considered satisfactory in most cases and that the number of bins increases with the number of observations. Choosing the number of bins approximately equal to the square root of the number of observations frequently works well in practice.
Once the number of bins and the lower and upper boundary of each bin have been determined, the data are sorted into the bins and a count is made of the number of observations in each bin. To construct the histogram, the horizontal axis is used to represent the measurement scale for the data and the vertical scale to represent the counts, or frequencies. Sometimes the frequencies in each bin are divided by the total number of observations, and then the vertical scale of the histogram represents relative frequencies. Rectangles are drawn over each bin and the height of each rectangle is proportional to frequency (or relative frequency).
Histogram is unimodal if there is one hump, bimodal if there are two humps and multi-modal if there are several humps. A histogram is called skewed if it is not symmetric. If the upper tail is longer than the lower tail then it is positively skewed. If the upper tail is shorter the lower tail then than it is negatively skewed.
Histogram can also be normalized displaying relative frequencies. It then shows the proportion of cases which fall into each of several categories, with the total area equalling one. The categories are normally specified as consecutive, non-overlapping intervals of a variable. The categories (intervals) are to be adjacent, and are frequently chosen to be of the same. The rectangles of a histogram are drawn so that they touch each other to indicate that the original variable is continuous. Histograms are used to plot the density of data, and frequently for density estimation which is the estimating the probability density function of the underlying variable. The total area of a histogram used for probability density is always normalized to one. Fig 6 shows a histogram.
Fig 6 Histogram
Histograms are used to understand the variation pattern in a measured characteristic with a reference to location and spread. They give an idea about the setting of a process and its variability. Histograms indicate the ability of the process to meet the requirements as well as the extent of the non-conformance of the process. Different patterns of histogram are shown in Fig 7.
Fig 7 Patterns of histogram
Histogram clearly distinguishes from the bar chart. The most striking physical difference between these two diagrams is that, unlike the bar chart, there are no ‘gaps’ between successive rectangles of a histogram. A bar chart is one-dimensional since only the length, and not the width, matters whereas a histogram is two-dimensional since both length and width are important. A histogram is mainly used to display data for continuous variables but can also be adjusted so as to present discrete data by making an appropriate continuity correction. Moreover, it can be quite misleading if the distribution has unequal class intervals.
Advantages of histogram include (i) it is easy to draw and simple to understand, (ii) it helps people to understand the distribution easily and quickly, and (iii) it is more precise than the frequency polygons. Limitations of histogram include (i) it is not possible to plot more than one distribution on same axes as histogram, (ii) comparison of more than one frequency distribution on the same axes is not possible, and (iii) it is not possible to make it smooth. Uses of histogram are (i) it represents the data in graphic form, and (ii) it provides the knowledge of how the scores in the group are distributed i.e., whether the scores are piled up at the lower or higher end of the distribution or are evenly and regularly distributed throughout the scale.
Polygon means ‘many-angled’ diagram. This is another way of depicting a frequency distribution graphically. It facilitates comparison of two or more frequency distributions. The frequency polygon is a frequency graph which is drawn by joining the coordinating points of the mid-values of the class intervals and their corresponding frequencies. It is used when the data is continuous and very large. It is very useful for comparing two different sets of data of the same nature, for example, comparing the performance of two different sections of the same class.
Frequency polygon can be drawn either from the histogram or from the given data directly. The procedure for the construction of a frequency polygon by histogram is to first draw the histogram of the given data. Then, put a dot at the mid-point of the top horizontal line of each rectangle bar and join these dots by straight lines. Another way of drawing frequency polygon is to obtain the mid-values of class intervals and plotting them on x-axis and marking the frequency along the y-axis, then, plotting the frequency values corresponding to each mid-point and connecting them through straight lines. The area left outside is just equal to the area included in it. Hence, the area of a polygon is equal to the area of histogram. The difference between the histogram and the polygon is that the histogram depicts the frequency of each class separately whereas the polygon does it collectively. The histogram is normally associated with the data of discrete series, while frequency polygon is for continuous series data. Fig 8 shows a frequency polygon.
Fig 8 Frequency polygon
The advantages of frequency polygon include (i) it is easy to draw and simple to understand, (ii) it is possible to plot two distributions at a time on same axes, (iii) comparison of two distributions can be made through frequency polygon, and (iv) it is possible to make it smooth. Limitations of frequency polygon are (i) it is less precise, and (ii) it is not accurate in terms of area of the frequency upon each interval. Uses of frequency polygon are (i) it is used when two or more distributions are to be compared, (ii) it represents the data in graphic form, (iii) it provides knowledge of how the scores in one or more group are distributed, whether the scores are piled up at the lower or higher end of the distribution or are evenly and regularly distributed throughout the scale.
Cumulative frequency curve or ogive
Cumulative frequency is a self-explanatory term. It means that the frequencies of classes are accumulated over the entire distribution. There are two types of cumulative frequencies. The first is the ‘less than’ cumulative frequency of a class. It is the total number of observations, in the entire distribution, which are less than or equal to the upper real limit of the class. The second is the ‘more than’ cumulative frequency of a class. It is the total number of observations, in the entire distribution, which are greater than or equal to the lower real limit of the class.
Cumulative frequency curve or ogive is a curve of a data set obtained by an individual through the representation of cumulative frequency distribution on a graph. As there are two types of cumulative frequencies, accordingly there are two ogives namely (i) less than ogive, and (ii) more than ogive for any grouped frequency distribution data. Here in place of simple frequencies as in the case of frequency polygon, cumulative frequencies are plotted along y-axis against class limits of the frequency distribution. For ‘less than’ ogive the cumulative frequencies are plotted against the respective upper limits of the class intervals whereas for more than ogives the cumulative frequencies are plotted against the respective lower limits of the class interval. Tab 1 gives an example of ogive or cumulative frequency distribution.
|Tab 1 Ogive or cumulative frequency distribution|
|Group age||Number of employees||Group age||Number of employees|
|Less than cumulative frequencies||More than cumulative frequencies|
|Less than 25||2,100||More than 20||31,600|
|Less than 30||2,100+3,100=5,200||More than 25||31,600-2,100=29,500|
|Less than 35||4,600+5,200=9,800||More than 30||29,500-3,100 = 26,400|
|Less than 40||5,750+9,800=15,550||More than 35||26,400-4,600=21,800|
|Less than 45||6,550+15,550=22,100||More than 40||21,800-5,750=16,050|
|Less than 50||4,100+22,100=26,200||More than 45||16,050-6,550=9,500|
|Less than 55||3,200+26,200=29,400||More than 50||9,500-4,100=5,400|
|Less than 60||2,200+29,400=31,600||More than 55||5,400-3,200= 2,200|
An interesting feature of the two ogives together is that their intersection point gives the median of the frequency distribution. As the shapes of the two ogives suggest, ‘less than’ ogive is never decreasing and ‘more than’ ogive is never increasing. Fig 9 gives the graph showing curves for less than ogive and more than ogive as well as the line graph for comparison.
Fig 9 Cumulative frequency curves and the line graph
Cumulative percentage curve – Cumulative percentage is another way of expressing frequency distribution. It calculates the percentage of the cumulative frequency within each interval, much as relative frequency distribution calculates the percentage of frequency. The main advantage of the cumulative percentage over cumulative frequency as a measure of frequency distribution is that it provides an easier way to compare different sets of data. Cumulative frequency and cumulative percentage graphs are exactly the same, with the exception of the vertical axis scale. In fact, it is possible to have the two vertical axes, (one for cumulative frequency and another for cumulative percentage), on the same graph. Cumulative percentage is calculated by dividing the cumulative frequency by the total number of observations (n), then multiplying it by 100 (the last value is always be equal to 100 %). Hence, cumulative percentage = (cumulative frequency ÷ n) x 100.