Data quality affects on companies decision making, so that decisions based on data without quality incur companies high costs. Data quality has various dimensions and accuracy is the most important of these dimensions. Error detection is needed for data cleaning. Due to More
Data quality affects on companies decision making, so that decisions based on data without quality incur companies high costs. Data quality has various dimensions and accuracy is the most important of these dimensions. Error detection is needed for data cleaning. Due to the huge volume of data, an automatic system is needed to perform this process without user interaction. In this paper an approach is proposed based on k-means clustering for error detection. Firstly data are clustered for each attribute. Then for each data in each cluster a method similar to k-nearest neighbor is used for detecting errors. The proposed method is able to detect multiple errors in one record. Also this approach is able to detect errors in fields with various attribute types. Experimental results show that this approach can detect 91% of errors in data on average. Also the proposed approach is compared with an automatic method which detects errors based on rule in various attribute types. Experimental results show that the proposed approach has on average 25%better performance to detect errors.
Manuscript profile