Big Data…a few Outliers = Big Mistakes. Un nuovo processo per l’individuazione di outliers

Maurizio Rosina

Abstract


The search and identification of outliers is a fundamental step, generally preparatory to the elaborations aimed at obtaining consistent results. The new approach devised for the identification of outliers in space R2 benefits from geometric / statistical techniques largely independent from the type of data distribution, and is based on four methodological pillars: clustering, the convex hull peeling technique, a specific metric and Chebyshev’s inequality, which is valid for any type of univariate distribution of values. The modularity and the generality of the approach, coupled to the research and identification of outliers based on strictly statistical parameters, make the approach presented a useful and daily tool for those who need to process bivariate data with the security of being able to previously identify outliers.


Full Text

PDF

Riferimenti bibliografici


Amidan B. G., Ferryman T. A., Cooley S. K. (2005) Data Outlier Detection using the Chebyshev Theorem, IEEE Aerospace Conference Proceedings

Porzio G. C. & G. Ragozini (2000) Peeling multivariate data sets: a new approach, Quaderni di Statistica, Vol. 2

Ester M., Kriegel H-P., Sander J., Xu X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, in Proceedings of 2nd International Conference on

Knowledge Discovery and Data Mining.

Riani M. & S. Zani (1998) Generalized Distance Measures for Asymmetric Multivariate Distributions, in Advances in Data Science and Classification: Proceedings of the 6th Conference of the

International Federation of Classification Societies (IFCS-98), Università “La Sapienza”, Rome, 21–24 July, 503-508, Springer

Savage R., (1961) Probability Inequalities of the Tchebycheff Type, Journal of Research of the National Bureau of Standards, B. Mathematics and Mathematical Physics, Vol. 65B, No.3

Zani S., Riani M., Corbellini A. (1998), Robust bivariate boxplots and multiple outlier detection, Computational Statistics & Data Analysis, Elsevier


Refback

  • Non ci sono refbacks, per ora.




Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.