Using statistics


Revision as of 10:19, 23 February 2012 by OusmaneAly (Talk | contribs)

Jump to:navigation, search

Using Data

Wikiprogress has a large number of data sets at Wikiprogress Stat which can be used to generate various statistics about progress of societies. However, before using any data set, it is important to have a good understanding of the data set, for example, about the source of the data, data quality, conditions of use, how the variables were created, and so on. Every data set should have documentation explaining as much about the data set as possible, to allow for the best possible use.

This page briefly discusses a number of the issues listed above, such as data quality and also some of the basic ideas of how to use the data sets. This page discusses data issues in general.

Data Quality

There are, currently, no universally agreed on definitions of data quality.[1]   A wide variety of dimensions, categories and approaches are used. This section briefly describes several of the issues relating to data quality.

What is Data Quality?

Data quality was, at first, mainly about statistical accuracy and related concerns. More recently, data quality has expanded to include issues relating to final use by "customers".[2]  Currently, many organizations use a framework consisting of the following components:  Relevance, Accuracy, Timeliness, Punctuality, Accessability, Clarity, and Comparability.[3][4][5]

These, and other components, can be organized into three general dimensions of data quality.[6] First there is the process of obtaining the data, including maintaining a description of the data, and security and confidentiality. Second there is the quality of the data itself: e.g., accuracy, completeness, consistency and validity. Finally, there are considerations relating to usage of the data, including accessibility, integrability, interpretability, relevance and timeliness.

The issues relating to the quality of the data itself generally use six components: precision, reliability, validity, integrity, completeness, and timeliness.[7]  Precision is whether the data are at a level of detail that allows for effective decisions. Reliability is about whether the data are the same or similar on repeated measurement. Validity refers whether the data measures what it is intended to measure. Integrity is about preserving the accuracy of the data, and focuses on errors in obtaining, recording or processing the data. Completeness is missingness and timeliness means the data should be up to date and represent the current situation.

A related methodological issue about data quality is comparability; when data from different time periods, domains, geographies, etc are compared, are differences due to real differences or due to different methodologies used to collect or obtain the data.

The issues around user needs can be described with the following components: relevance, the degree to which the data meets the users need; timeliness, whether the data become available soon after the phenomena they describe; punctuality, time between when the data are released and when they should have been released; accessability, how easy it is to get the data and the conditions under which they can be used; and clarity, whether there is a clear and detailed descriptions of the data.

As can be seen, "data quality" is a rather complex issue, with many different aspects to consider. Thus, documentation about data quality should make clear from the start what exactly it is that is being described, for example, issues about obtaining the data, quality of the data themselves, or user needs. Then, it would be good to spell out, as much as possible, a good amount of details about the issues to be covered.

Data Accuracy

One critical issue about data quality is accuracy of the data. Typically, accuracy refers to how close the data which are presented are to the actual reality. In some areas like national accounts and labor force statistics there are generally accepted definitions and methods of data collection and so the data may be considered to be "accurate". However, in other areas like poverty and inequality, there are serious problems in measurement. Further, in some areas like human-rights conditions and corruption, there are not even any generally agreed on definitions, and so measurement and accuracy are even more difficult.[8]  

Another source of problems for accuracy is that some economic and social data come from private organizations who have no incentive for providing the data to the public, so so there are no data at all for some things. In addition, these private organizations and sometimes even governmental agencies often have incentives to manipulate their data, or to present only certain data, so as to present the picture they want to present. "Accuracy" of these data would be in doubt.[9]

Completeness, or Missing Data

Another very important aspect of data quality is whether the data set is complete or has missing values. An understanding of the missing data tells how the data may be used, whether it can represent the larger population, what limitations apply to the analysis.

Basically, data might be missing randomly, or might be missing not randomly. Suppose, for example, some organization was attempting to collect data about every single country in the world. Suppose further that on the day they were to collect the data, their contact people for some of the countries happened to be on vacation, or some of the countries experienced a brief electrical system shut down in their communication systems. If all of these types of events happened randomly, then the data are missing randomly.

However, as may be guessed, most of the time data are not missing randomly. It may be that some particular group of countries are less likely to contribute data.

The distinction between data missing randomly and not randomly is important. If data are missing randomly, then one could make the case that the data collected represent all of the countries of the world. So, one could write a report and conclude something like "This report describes the state of affairs about freedom (or health or some other condition) in the world today."

On the other hand, if the data are not missing randomly, then it is much more difficult to apply the results of analysis to the world as a whole. Reports should not say "The state of the world is ....". 

The more appropriate statements would be like "This report describes the state of affairs among countries with data about freedom (or health or some other condition)." Further, it would be appropriate to describe the group of countries that had data and the group of countries that did not have data, and to indicate whether those countries that did have data were different, somehow, from all the other countries.

There are a variety of ways of dealing with missing data, depending on whether the data are missing randomly or not.  

The easiest way to handle missing data is to restrict analysis to only those countries with complete data. However, if any of the data are not missing at random, then the set of countries that do have complete data could be biased, and may not represent the world. Another approach is, as described above, to examine the characteristics of those countries that have missing data and those that have complete data and see if the two groups differ.  If, for example, the least developed countries more often have missing data, then analysis and reports would only draw conclusions about the world excluding the least developed countries. On the other hand, when the data are not missing at random, as in the example just given, but the researcher wants to generalize to all countries, then the missingness has to be modeled itself.[10]  That is, one has to develop a model to account for why that particular group of countries has missing data, and then use that model in estimating the missing values.  Finally, when the data are missing randomly, there are methods of estimating what the missing data might be, such as maximum likelihood and multiple imputation methods. Basically, multiple imputation breaks the data set into multiple data sets and estimates the missing values with each data set. Then each data set is used in separate analyses and the results of the multiple analyses are combined.[10],[11]  Maximum likelihood uses the single data set[11] and has a basic goal of identifying the population parameter values that are most likely to have produced a particular sample of data.[12] So for example, what is the most likely mean for all countries, given that we have data for some countries (our sample).  All of the methods mentioned above for estimating missing data are highly complex, but are available in some statistical packages.

Data Quality - What To Do

One review indicates there are serious concerns about the use of data without concern for quality.[8] For example, as mentioned previously, variables like national accounts and labor force have fairly standard guidelines on data collection and measurement, but still have some difficulties of comparability in definition and measurement across countries. Even more serious problems happen for poverty and inequality, and for variables such as human rights and corruption, there aren't even any universally accepted definitions. None the less, indicators for these concepts are often used and users may not pay attention to, or may not discuss issues with the indicators they are using.

On the other hand, there are arguments for using such data.[8] These variables, even with problems, can be useful for giving background to research or reports on other topics. Second, it may be that the data which are available are the best available data. So, even these data can be useful for rough estimates.

Finally, when using data, there are various possible steps to address data quality.[8]  These steps include: 1. using information about the quality of the data in statistical analysis, for example in calculating confidence intervals; and 2. carefully examine and document the quality of the data, in terms of how well it compares to other data, accuracy, reliability, missingness, and so on.

Thus, users of the data and people who read the results of analysis should always be given a good idea of the level of quality of the data, and the kinds of conclusions that can be drawn.  Research using data with high quality can draw much more precise conclusions than can research using data of lower quality. Research using lower quality data can still present conclusions, but they would be something like "in general, the relationship between X and Y is fairly high (or low or whatever)." But the research using lower quality data should not conclude something like "the relationship between X and Y was 0.57, which was abc points higher than was the relationship between X and Z."  That is, general and rough conclusions are acceptable, but very specific statistical statements are probably not a good idea.

How To Use Data

This section has a few brief pointers on using data.

Countries as Units of Analysis

Most international data sets use "country" as unit of analysis. That is the data sets list, say, income per capita or infant mortality rate (IMR) for each country. However, this assumes that everyone in the country experiences the same conditions, which is, obviously, not always true. For example,in the United States, one state, Mississippi, has an IMR twice as large as does another state, California.

Table 3
Infant Mortality Rate by State, 2006
State Infant Mortality Rate
California                   5.0
Mississippi 10.6

As table 3 shows, the IMR is not the same for every group within the country. The same would be true for other countries and other variables. Different groups within each country are likely to have different levels of IMR, income per capita, educational level and so on, so the IMR, income per capita, educational level, etc. of a "country" is a general indicator, not necessarily representative of every group within each country. 

Thus, analysis using multiple countries might want to include information about how the data are to be interpreted, noting the problems of unit of analysis. This issue is typically called the "ecological fallacy", drawing conclusions about individuals based on analysis at the group level.[14]

Rates and Percents

One typical problem in using countries as the unit of analysis is in the use of rates and percents. Many data sets present information by using rates or percents. For example, UNICEF presents infant mortality rate (IMR) as infant mortality per 1,000 live births.[15]  However, raw data are needed in order to calculate averages per region or per group of countries.

So for example, IMR for several countries are averaged and shown below.

Table 1

Infant Mortality Rate

China                              17.45
Afghanistan 154.8
Aruba 14.6


The table above indicates that the average IMR among these countries is 62.3. However, we can see in the table below that this is not the case.

Table 2
Country Infant Mortality Rate Number of Births    
Number of Infant Deaths
China 17.45
Aruba 14.6
Average 26.2

In table 2, average IMR is calculated by using the average number of infant deaths and the average number of births. So 146,195/5,576,374=.0262.  Ths means .0262 infant deaths per birth, or 26.2 infant deaths per 1,000 live births (the usual way infant mortality rate is reported).

In table 1, the average IMR is wrong because each country is counted equally in averaging the IMR. That is, it is assumed that each country has the same number of births. So China, which actually had over 15 million births, counts the same as Aruba, which actually had just over 1,000 births. On the other hand, in table 2, China counts the most, Afghanistan the second most, and Aruba counts very little. They are counted according to the number of actual births in their country. 

As can be seen, the raw data for number of births and infant deaths are needed in order to correctly calculate the IMR. The same is true for calculations using any other rates and percents.


  1. MIT Data Total Quality Management Program, Definition of data quality. Accessed 12/25/2010,
  2. Jeffrey Gonzalez, Catherine Hackett, Nhien To, and Lucilla Tan. Definition of Data Quality for the Consumer Expenditure Survey: A Proposal.FINAL REPORT. Submitted: 2009.10.22. US Bureau of Labor Statistics. Retreived 12/26/2010 from
  3. Statistical Data Quality in the UNECE. 2010 Version. Steven Vale, Quality Manager, Statistical Division. United Nations. Retrieved 13 December 2010 from
  4. Eurostat Quality Assurance Framework. Retrieved 12/25/2010 from
  5. Quality Framework and Guidelines for OECD Statistical Activities, Version 2003/1. Retrieved 12/25/2010 from,3343,en_2649_33715_21571947_1_1_1_1,00.html
  6. Alan F. Karr, Ashish P. Sanil, and David L. Banks. 2005. Data Quality: A Statistical Perspective. Technical Report 151. Research Triangle Park, NC: National Institute of Statistical Sciences. Retrieved 16 December 2010 from
  7. Centers for Disease Control and Prevention. 2009 Quality Assurance Standards for HIV Counseling, Testing, and Referral Data. Atlanta, GA: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2009: Accessed 12/25/2010 from Specifically see the section "What is data quality"
  8. 8.0 8.1 8.2 8.3 Data quality and comparability. Macro Data Guide. Norwegian Social Science Data Services. Retrieved 1/8/2011 from
  9. Sources of error based on: Philipp Bagus. The Problem of Accuracy of Economic Data. Mises Daily: Thursday, August 17, 2006. Retrieved 1/2/2011 from
  10. 10.0 10.1 David C. Howell, Last revised 3/7/2009, Treatment of Missing Data, Retrieved from, 12 December 2010. University of Vermont.
  11. 11.0 11.1 Karen Grace-Martin. Recommended Solutions to Missing Data. StatNews #52. Cornell University Statistical Consulting Unit. May 2002. Retrieved 12/26/2010 from
  12. Craig K. Enders. A Primer on the Use of Modern Missing-Data Methods in Psychosomatic Medicine Research.Psychosomatic Medicine 68:427-436 (2006). Retrieved 12/26/2010 from
  13. The 2011 Statistical Abstract: State Rankings. Infant Mortality Rate, 2006. US Census Bureau. Retrieved 16 January 2011 from
  14. Faculty Development and Instructional Design Center, Northern Illinois University. Responsible Authorship Quick Guide. See the section on "Detecting Common Mistakes and Considering Dilemmas in Responsible Authorship". One section is Aggregation Bias and Ecological Fallacy
  15. Infant mortality rate. ChildInfo: Statistics by Area / Child Survival and Health. UNICEF, last update 2009. Data available at

See also

Further Reading

About data quality

  • Data quality and comparability From the Macro Data Guide, Norwegian Social Science Data Services. Discussion of data quality, use of low quality data, other related issues.

About missing data

About interpreting data

Article Information
Wikigender Wikichild GPRNet Wikiprogress.Stat ProgBlog Latin America Network African Network eFrame