Outlier Detection: Do's and Don'ts


Examining data for outliers is a common first step in analysing data. There appears to be a great amount of confusion and misinformation regarding the appropriate method for detecting outliers. One commonly articulated outlier identification procedure is known as the 2 standard deviation rule. In the video below, I demonstrate that this procedure is invalid.


A more valid approach to detecting outliers is the 'outlier labeling rule', which is based on based on multiplying the Interquartile Range (IQR) by a factor of 1.5. Based on the video below (as well as some other published simulation research), I demonstrate that the outlier labeling rule is probably more valid when using 2.2 as a multiplier, rather than 1.5. It's important to keep in mind, here, that these rules are all only applicable for data that are normally distributed. More sophisticated approaches must be used for data that are non-normally distributed. I hope to prepare another post and video discussing this issue.



Youtube Link:

Parts:

See also:

Some Key References:

Tukey, J.W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley.
      *** First published demonstration of the IQR multiplier approach to detecting outliers.

Hoaglin, D.C., Iglewicz, B., and Tukey, J.W. (1986). Performance of some resistant rules for outlier labeling, Journal of American Statistical Association, 81, 991-999.
      *** First article to use the rubric 'outlier labeling rule'.

Hoaglin, D. C., and Iglewicz, B. (1987), Fine tuning some resistant rules for outlier labeling, Journal of American Statistical Association, 82, 1147-1149.
      *** Demonstrate that the 1.5 multiplier was inaccurate approximately 50% of the time; suggested that 2.2 is probably more valid in a lot of applied cases.