Exploratory data analysis
Part of a series on Statistics 
Data visualization 

Thought leaders 
Information graphic types 
Related Topics 
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),^{[1]} which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Overview
Tukey defined data analysis in 1961 as: "[P]rocedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."^{[2]}
Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems 'S'PLUS and R. This family of statisticalcomputing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study.
Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavytailed distributions than traditional summaries (the mean and standard deviation). The packages S, SPLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron's bootstrap, which are nonparametric and robust (for many problems).
Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families.^{[3]}
Development
John W. Tukey wrote the book Exploratory Data Analysis in 1977.^{[4]} Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.
The objectives of EDA are to:
 Suggest hypotheses about the causes of observed phenomena
 Assess assumptions on which statistical inference will be based
 Support the selection of appropriate statistical tools and techniques
 Provide a basis for further data collection through surveys or experiments^{[5]}
Many EDA techniques have been adopted into data mining, as well as into big data analytics.^{[6]} They are also being taught to young students as a way to introduce them to statistical thinking.^{[7]}
Techniques
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.^{[8]}
Typical graphical techniques used in EDA are:
 Box plot
 Histogram
 Multivari chart
 Run chart
 Pareto chart
 Scatter plot
 Stemandleaf plot
 Parallel coordinates
 Odds ratio
 Multidimensional scaling
 Targeted projection pursuit
 Principal component analysis
 Multilinear PCA
 Projection methods such as grand tour, guided tour and manual tour
 Interactive versions of these plots
Typical quantitative techniques are:
History
Many EDA ideas can be traced back to earlier authors, for example:
 Francis Galton emphasized order statistics and quantiles.
 Arthur Lyon Bowley used precursors of the stemplot and fivenumber summary (Bowley actually used a "sevenfigure summary", including the extremes, deciles and quartiles, along with the median  see his Elementary Manual of Statistics (3rd edn., 1920), p. 62 – he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions").
 Andrew Ehrenberg articulated a philosophy of data reduction (see his book of the same name).
The Open University course Statistics in Society (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via cointossing and the median test.
Example
Findings from EDA are often orthogonal to the primary analysis task. This is an example, described in more detail in.^{[9]} The analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. The variables available are tip, total bill, gender, smoking status, time of day, day of the week and size of the party. The analysis task requires that a regression model be fit with either tip or tip rate as the response variable. The fitted model is
tip rate = 0.18  0.01×size
which says that as the size of the dining party increase by one person tip will decrease by 1%. Making plots of the data reveals other interesting features not described by this model.

Histogram of tips given by customers with bins equal to $1 increments. Distribution of values is skewed right and unimodal, which says that there are few high tips, but lots of low tips.

Histogram of tips given by customers with bins equal to 10c increments. An interesting phenomenon is visible, peaks in the counts at the full and halfdollar amounts. This corresponds to customers rounding tips. This is a behaviour that is common to other types of purchases too, like gasoline.

Scatterplot of tips vs bill. We would expect to see a tight positive linear association, but instead see a lot more variation. In particular, there are more points in the lower right than upper left. Points in the lower right correspond to tips that are lower than expected, and it is clear that more customers are cheap rather than generous.

Scatterplot of tips vs bill separately by gender and smoking party. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and female nonsmokers tend to be very consistent tippers (with the exception of three women).
What is learned from the graphics is different from what could be learned by the modeling. You can say that these pictures help the data tell us a story, that we have discovered some features of tipping that perhaps we didn't anticipate in advance.
Software
 GGobi is a free software for interactive data visualization data visualization
 CMUDAP (CarnegieMellon University Data Analysis Package, FORTRAN source for EDA tools with Englishstyle command syntax, 1977).
 Graph Commons, a webbased collaborative network mapping, analysis, and publishing platform.
 Data Applied, a comprehensive webbased data visualization and data mining environment.
 HighD for multivariate analysis using parallel coordinates.
 JMP, an EDA package from SAS Institute.
 KNIME Konstanz Information Miner – OpenSource data exploration platform based on Eclipse.
 Orange, an opensource data mining and machine learning software suite.
 SOCR provides a large number of free online tools.
 TinkerPlots (for upper elementary and middle school students).
 Weka an open source data mining package that includes visualisation and EDA tools such as targeted projection pursuit
See also
 Anscombe's quartet, on importance of exploration
 Data dredging
 Predictive analytics
 Structured data analysis (statistics)
 Configural frequency analysis
 Descriptive statistics
References
 ↑ Chatfield, C. (1995). Problem Solving: A Statistician's Guide (2nd ed.). Chapman and Hall. ISBN 0412606305.
 ↑ John TukeyThe Future of Data AnalysisJuly 1961
 ↑ "Conversation with John W. Tukey and Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler". Statistical Science. 15 (1): 79–94. 2000. doi:10.1214/ss/1009212675.
 ↑ Tukey, John W. (1977). Exploratory Data Analysis. Pearson. ISBN 9780201076165.
 ↑ BehrensPrinciples and Procedures of Exploratory Data AnalysisAmerican Psychological Association1997
 ↑ "Merging exploratory data analysis with operational data analysis". July 28, 2015.
 ↑ Konold, C. (1999). "Statistics goes to school". Contemporary Psychology. 44 (1): 81–82. doi:10.1037/001949.
 ↑ Tukey, John W. (1980). "We need both exploratory and confirmatory". The American Statistician. 34 (1): 23–25. doi:10.1080/00031305.1980.10482706.
 ↑ Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) ″Interactive and Dynamic Graphics for Data Analysis: With R and GGobi″ Springer, 9780387717616
Bibliography
 Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3540259945
 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616.
 Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 0471097764.
 Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 0471097772.
 Inselberg, Alfred (2009). Parallel Coordinates:Visual Multidimensional Geometry and its Applications. London New York: Springer. ISBN 9780387686288.
 Leinhardt, G., Leinhardt, S., Exploratory Data Analysis: New Tools for the Analysis of Empirical Data, Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157.
 Martinez, W. L.; Martinez, A. R. & Solka, J. (2010). Exploratory Data Analysis with MATLAB, second edition. Chapman & Hall/CRC. ISBN 9781439812204.
 Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, ISBN 9781584885948
 Tucker, L; MacCallum, R. (1993). Exploratory Factor Analysis. .
 Tukey, John Wilder (1977). Exploratory Data Analysis. AddisonWesley. ISBN 0201076160.
 Velleman, P. F.; Hoaglin, D. C. (1981). Applications, Basics and Computing of Exploratory Data Analysis. ISBN 087150409X.
 Young, F. W. ValeroMora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 9780471681601
 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900
 S. H. C. DuToit,A. G. W. Steyn,R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 9781461293712