One of the major goals for data for our databases it to have and manage data quality - accurate, consistent and available. Data quality can:
1. minimize IT project risk
2. assist to make timely business decisions
3. ensure regulatory compliance
4. expand the customer base
Characteristics of quality data (according to Loshin (2006) and Russom (2006) include:
1. uniqueness - try to have a piece of data exist only ONCE in the database
2. accuracy - data must be both accurate and precise enough for its intended use
3. consistency - values for data in one data set are in agreement with the values for related data in anotehr data set
4. completeness - all that that MUST have a value (such as a key) does have an assigned value
5. timeliness - meeting the expectation for the time between when data are expected and when they are readilty available for use
6. currency - data is recent enough to be useful
7. conformance - data is stored, exchanged and presented in a format specified by its metadata
8. referential integrity - cardinalities
Unfortunately most data in the databases are not quality. Some studies have shown:
1. up to 70 percent of data warehousing projects fail becuase users reject the data as unreliable
2. A 2001 survey by the Data Warehouse Institute determined that only one in four US companies have implemented data quality intiatives
3. Estimated that bad data can cost a business as much as 10-20 percent of total operating budgets through lost revenues
4. Experts say 2 percent of records in a customer file become obsolete in one month because of death, divorce, marriage or moves
Problems include:
1. Lack of control over data quality
2. Proliferation of databases with uncontrolled redundancy and metadata
3. Poor data capture controls
4. Companies don't recognize poor data quality as an organization issue
Steps in Data Quality Improvement include:
1. conduct a data quality audit
2. improve data capture processes
3. establish a data stewardship program (subject matter experts for particular databases)
4. Apply TQM (total quality management) principles and practices
5. apply modern data management technologies
6. estimate return on investment
7. start with a high-quality data model
Data being put into a dataware should go through a 3-step process: extract-transform-load (ETL) - this provides a single, authoritative source for data that should adhere to good data quality standards. The data will be a) detailed - rather than summarized, b) historical (periodic - point in time), 3) normalized, 4) comprehensive - conforms to enterprize model, 5) timely - current enough for decision making, 6) quality controlled
the ETL process allows data to be prepared to go into a datawarehouse. Extract is capturing the relevant data from files and databases - this is a subset of data. A statis extract is a method of capturing a snapshot of the required source data at a point in time. An incremental extract captures only the changes that have occurred in the source data since the last capture. Cleanse identifies erroneous data and correcting the errors. Errors include a) misspelled names, addresses, etc, 2) impossible or erroneous dates of birth, 3) fields used for purposes for which they were never intended, 4) mismatched addresses and area codes, 5) missing data, 6) duplicate data, 7) inconsistencies, 8) different primary keys across sources. Data scrubbing is a technique using pattern recognition and other artificial intelligence techniques to upgrade the quality of raw data before transforming and moving the data to the data warehouse - also called data cleansing. Load (and index) loads the selected cleansed data into the target data warehouse and to create the necessary indexes.