In the ancient Indian parable of the elephant, six blind men touch an elephant and report  six very different views of the same animal. Compare this scenario to a data warehouse that gets data from six different sources. “Harry Potter and the Sorcerer’s Stone” as a field in a database can be written as “HP and the Sorcerer’s Stone” or as “Harry Potter I” or simply – “Sorcerer’s Stone”.  In the data warehouse these are four separate movie titles.  For a Harry Potter fan, they are the same movie.  Now increase the number of movies to cover the entire Harry Potter series and further include fifty  languages.  You now have a set of titles which may perplex even a real Harry Potter aficionado.

What does this have to do with data analytics? In our  conversations with Information Management professionals, one common stumbling block to effective data analytics  stands out – data quality. The root cause of this problem is that the same data field is described in different ways by different data sources. When we start collecting data from different data sources into a data warehouse, we often get multiple names for the same data field. In a typical Enterprise Data Warehouse (EDW) it is not uncommon to find the same customer referred by five different names within the system. This introduces significant errors in data analytics and the business decisions that are dependent on them. Enterprises that spend large amounts of money to create a business analytics solution are often tripped up by the lack of data quality which undermines the investment in a business intelligence infrastructure. 


Read More...