…covering everything which includes the word "data" or "information"
Data quality is not option any more, since businesses rely on accurate information for various purposes such as running their transactional systems where multiple systems source data from the same system, analysis, regulatory reporting etc. There are many tools out there which can help cleanse the data, match and merge it to remove duplications and therefore help reduce storage space and data anomalies. Matching and merging capabilities are either bundled into data quality tools or MDM products. Although there are various aspects in which they try to distinguish each other and vary in capability, the main three aspects which these tools compete against are
Although lots of algorithms have been developed, these can be classified into the following categories:
The following sections will explain each algorithm in more detail.
Deterministic algorithms
This is based on a predictable comparison technique where each attribute in one record is compared each attribute in another record. This is a very simple technique but very quick in performance although it requires that data is already cleansed and standardised.
Probabilistic algorithms
This type of algorithm matches and matches similar looking records based on probability theories. Over time this algorithm is able to develop and improve its capability such as that initial records which had to be matched manually or corrected manually in case of false positives, would not be required due to its self-learning capability.
Machine learning algorithms
Using this techniques records are matched and merged based on machine learning and artificial intelligence. It compares records like human beings and improves its capabilities over time.
Hybrid algorithms
Like the name says, these algorithms consist of a hybrid i.e. combinations of the above algorithms plus others such as matching based on soundex, phonetics, fuzzy logic etc.
It is unlikely that a single algorithm will match on organisations requirement and therefore a mix of these algorithms will be used and one tool will be stronger in one type of algorithm compared to another tool. However, when selecting the appropriate tool organisations should compare tools based on some of the following criteria:
Speed of matching: How quickly does the tool match and merge given a set of records? It is worth thinking about the future where an organisation’s data volumes will increase significantly in the future.
Accuracy of matching: Although the majority of matching should be done automatically, but it would be good if the tool would allow a semi-matching approach i.e. the user should be allowed to intervene and maybe inspect and verify recommended matches by tool before the merging is done.
Consider organisation’s custom business rules: Often organisations have their own business rules which should take precedence of any of the “standard” matching technique. Again a semi-matching technique would be a good solution to assist with this in order to give custom business rules a higher priority level of matching.
Uniqueness and key generation: The tool should be able to identify unique sets of records and cluster these into so called match group. Each match group identifying a unique customer should be allocated a globally unique identifier by the tool.
Flexibility: The matching and merging tool should be flexible enough to make changes to the logic according to changing business requirements and needs.
Easy to implement and administer: The matching and merging tool should make it easy to add new matching rules or change existing rules. The tool itself should be able to integrate within the existing infrastructure without requiring any special equipment like a supercomputer. Similarly, it should integrate with existing software and hardware.
Scalability: The tool should be able to provide consistent performance despite new data sources, additional concurrent users, new business and matching rules.
Getting the matching and merging right and part of this is selecting the right tool to assist with this is crucial. Hence it is important to make the right decision about the tool and giving it the right amount of time for thought to get the benefits it has to offer.
Reference master data is master data which is used across the organisation by various departments. The data itself is reference data also known as code tables or look-up data. Examples of reference master data such as within a bank with multiple branches across the country may include currencies, branch locations, geographic divisions, region names etc. As you can see from the examples, some reference data may actually be used to build a hierarchy of reference master data such as countries, regions, cities, city areas, branch locations. From a database perspective, there will be 10s or 100s of reference master data tables which are usually smaller in size compared to others within the same database. A reference table would mainly consist of two key attributes/data items:
Code: This is like a unique key to identify each reference master data item. It varies between reference master data hubs and therefore may be purely numeric such as 1, 2, 3 etc or it may be alphanumeric such as LON1, LON2, LON3 etc where the charters could have a meaning such as ‘LON’ referring to ‘London’ and the number to which London-based branch.
Description: The value of the data item such as London branch, Birmingham branch etc. In the above example of LON1, LON2, LON3 etc it may be names of London branches such as LON1 = London Monument, London Bank, London Victoria etc.
Reference master data is generic in its nature i.e. it can be used by various types of systems within the organisation including master data repositories such as customer, product, supplier master data etc. Given its wide use across the organisation it is vital that the reference master data is kept accurate and complete.
Managing the reference data within an organisation effectively will offer the following key benefits:
Reference master data may be managed by organisations manually. For instance, when a new branch is to be added to an organisation’s network the branch details need to be added across multiple systems such as CRM, data warehouse, reports, operational systems, website, published company brochures etc. Without going into the details of the business process, in summary multiple employees working on these systems and responsible for the accuracy of the data held within it will be involved and will communicate information through various channels including phone, e-mail and/or face-to-face meetings. The relevant data elements required by each person and system is entered manually and very often this introduces data input issues. For example, one person may enter LON22 as the new branch number in the data warehouse whereas another person responsible for the CRM system may enter LON0022. Due to a small, simple error the two records would not match and therefore would later on lead to further problems in the future until the issue is discovered and raised by a person impacted by the accuracy of data. For example, when it comes to reporting sales levels from within the data warehouse the sales figures at the end of the month may not show an accurate reflection because the operational system records sales using the branch code LON22 whereas the data warehouse equivalent is actually LON0022. Hopefully, the issue will be discovered and rectified if not it could create the perception that a branch is not generating enough revenue and therefore may lead to redundancies and maybe closure of the branch resulting in a loss on the balance sheet. Similarly, if the organisation is regulated by some regulatory requirements then this could result is hefty fines and damage the organisation’s reputation in the market.
So to manage the reference master data, organisations should use reference master data management tools. These may either be available within a MDM tool, although some vendors may provide specific tools just to manage reference master data. Whichever option is purchased from a vendor, you have to ensure that the tool offers at least some of the following capabilities:
Auditing and history to enable tracking and recording of changes made to the reference data over a period of time, who made the changes, when were the changes made and why etc. Audit trails and maintaining history of data changes is particularly important to meet regulatory requirements such as Solvency 2.
Allow assignment of data stewards responsible for managing certain reference data domains. Similarly, the tool show allow to assign access controls to each reference data domain to control who can and cannot make changes or read the data.
Creating and managing hierarchies of reference data in order to support and represent the organisation’s business structure. Hierarchy management can be quite simple, but may also be extremely complex. So ensure that the hierarchy management features meets you present as well as foreseeable future requirements.
Integration with all the existing systems which need to source reference master data from this data hub.
Easy to use so that you do not have to spend a lot of time and money training up the users.
By implementing a suitable reference master data solution organisations can become more operationally efficient by means of automation and communicating through a single channel ( i.e. the reference master data tool).
Costs are also reduced, because when exchanging information about any data updates/additions these would be provided by one person and then another person responsible for putting the data in the system – the cost is therefore born by two people. The reference master data management solution would allow the “data provider” to immediately enter the details into the system hence a cost saving of one person and also limited room of misspelling or miscommunication any information about the data. Multiple people would be involved in the process when updating or adding a data element because multiple people who difference information about the reference data and require different information for their relevant system. Thus the reference master data system would provide a single, complete and consistent view about the data in question compared to previously fragmented pieces of information about the data communicated over phone, e-mail etc.
Customer service and retention is also improved, because incorrect reference master data for a customer-related data element is likely to be more accurate and correct. A critical mistake in the information could cause the customer frustration about the mistake by the organisation and therefore a less happy customer. In the worst case if the data error affects thousands of customers then the retention rates could decrease almost immediately causing further damage to an organisation’s revenue and position in the market.
Furthermore, if an organisation has a data governance council which oversees data standards they can easily monitor if data standards are correctly followed and make any corrections immediately through a clear view of all reference master data via the tool. Some of the workload can perhaps be reduced by implementing data validation rules within the reference master data tool.
Overall, when starting any kind of MDM implementation can be a very good point to start with. This is for several reasons and maybe should perhaps be another discussion, but in brief implementing a reference master data solution would be quicker and easier to implement. It can therefore also be used as a starting point of MDM to demonstrate a quick ROI in mostly 6 – 12 months depending on business complexity, data sources, reference master data you wish to manage.
Data models in general should be built such that they are flexible enough such that changes in the future are quick and easy to implement with no or minimal impact on components including systems, processes, data stores or users. This is particularly important for CDI solutions, because businesses have to be very agile these days and therefore adjust the way they do business very quickly e.g. in order to be one of the first companies offering a new service or product. Being able to put in place the necessary infrastructure including IT systems or changes to existing systems means that the organisation has to be very agile to in order to be able to do so. Every system and process relies on some kind of underlying data, which in turn is stored mostly in a database which is built using a data model.
When building a CDI solution organisations can either reuse their existing data models if it is deemed good enough to meet the requirements. Alternatively, a new data model might be built from scratch. Sometimes organisations also tend to purchase industry standard data models such as the Teradata FS-LDM (Logical Data Model for the Financial Services industry) or SAS DDS (has got various ready-made data model components for e.g. different financial services institutions such as insurance companies and banks). Some of the off-the-shelf MDM/CDI products on the market also come with a basic data model out-of-the-box which can be customised to suit the requirements. Overall, no matter which MDM architecture selected or complexity of requirements, every CDI data model shares at least the following components:
The top two data components will need to be customised for the specific industry sector an organisation operates in, whereas the bottom two data components provide the reassurance that the data used in the CDI solution supports building a complete, consistent, reliable, accurate set of customer records.
On the whole, building a data model is not a straightforward task and requires skill and knowledge. Data modeling should be started at the same time when the system requirements are being gathered. Since CDI or more generally MDM solutions are enterprise-wide solutions which require data integration from various data sources, creating a data model which will satisfy all users’ needs can be a challenging and very political task to get right. Creating a data model which meets everyone’s requirements means that you will create a data model which is too sophisticated to use. A simple data model aimed at a satisfying only a small number of users’ requirements means that other users will not benefit from the CDI solution. So getting the right balance is very important and is something which data and business analysts can assist with.
CDI solutions allow organisations to collect, identify and aggregate relationship data from the various data sources within the enterprise’s systems and applications. CDI is a key enabler which allows organisations to gather relationship information, analyse and interpret it and then use it to their advantage to grow revenues, the customer base, identify previously unknown new customer segments etc.
Essentially, customer data tends to be duplicated across multiple systems within an organisation and CDI has the power to identify and group the various records belonging to a single customer is a group called Match Group. Through this feature CDI is able to give an insight into all of the relationships the customer has with the organisation. For example, a particular customer may not have had many interactions with a bank and little information might therefore exist about the customer. However, knowing that the customer has some relatives who are also customers with the same bank having several savings and investment accounts might imply that actually the customer could be a very worthy customer to the bank (subject to further analysis). But it is this capability of CDI which will allow to identify links and relationships amongst customers and therefore be more effective in achieving various organisational goals and strategies.
Overall, relationships do not just apply to customers but also product data, reference data, hierarchies etc where relationship discoveries can be beneficial to organisations in their own ways.
To represent and capture all the relationship and customer-related data, a flexible data model is required to meet current requirements as well as is flexible enough to be altered in the future subject to changing requirements, business processes etc. Any changes should have no or minimal impact on other systems, processes and users.
In: Data Quality
23 Apr 2013The main challenge with customer data is the fact that it changes very frequently and there are multiple reasons for this such as the following
There is a statistic which says that about 2% of customer details change every month. Therefore an organisation with 1 million customers would have to deal with about 20,000 customer data changes every month. This estimate would only be valid for existing customers and therefore organisations which are growing rapidly, the amount of customer record changes will grow proportionally.
The promotions, marketing campaigns, price change, new product introductions etc are all driven by having accurate and reliable information around customer data and it is this data which will help an organisation to have greater revenue and customers compared to its competitors operating in the same business area.
The challenge of course is to aggregate, cleanse, standardise and make this data available in a timely manner so that it is still usable before it becomes obsolete. Standardising, matching and merging, de-duplicating and rationalising the customer data are the main challenges related to customer data.
Aggregating the relevant customer data is not an easy task especially given the common theme often seen across organisations how these developed and grew over time i.e. in a siloed fashion. Customer data would be stored in multiple systems and often be duplicated, formats and level of granularity tend to vary from system to system and department to department.
Resolving all the data quality issues is no straightforward task which usually take months to resolve and depending on the size of the organisation and breadth of data quality issues maybe even years in the most extreme cases. However, once the data quality issues have been ironed out, it can lead to increased revenue, reduced costs, greater efficiencies, compliance to regulatory requirements, greater customer service and satisfaction levels, reduced customer attrition, more effective marketing campaigns etc.
On the whole, challenges to integrate customer data are due to the following reasons:
Assuming that all or at least the majority of customer data has been cleansed, it is also important that you establish a process of monitoring, reporting and assessing the data quality on an ongoing basis. After all, having made a substantial investment in cleansing the data you do not want to be 6 or 12 months down the line of having to do a similar exercise again. Therefore ensure that as soon as you notice a drop in data quality that these issues are resolved immediately before these reach an unmanageable state and get out of control again.
In order to get a MDM project going within an organisation and to gain commitment in terms of resources and finance from senior management i.e. the budget holders, the reason behind starting a MDM project was must be justified through a solid business case. The business case is for the IT and Business audience and the senior management.
In order to create a good business case, it is important that thorough research and analysis is carried out across the enterprise with the aim of identifying as many interested people/departments as possible who would benefit from a MDM solution. Identifying all or at least most interested people/department not just helps to make the case for MDM need stronger, but simultaneously avoids costly changes and u-turns later on.
Even though a business case can be created for the whole enterprise, but it would be better to create a consolidated business case document which captures the following information by department or LOB. Structuring the business case by departments/LOBs helps to improve the readability of the business case document and allows the various departments/LOBs to focus immediately on the MDM part which is relevant to them. The business case document should capture the following information by department/LOB:
The business case document can later be used to evaluate whether the project has been a success. It also allows to measure the level of success by comparing the implemented solution against the above proposal.
The business case document is a key document which will be read by various people involved in the project. It is the document which allows new joiners to gain a quick overview why MDM is required, what benefits it will deliver, the scale of change and in what order the project will be executed. Hence it is also important that it is written such that it can be easily understood by a broad audience.
To test MDM implementations effectively, you should take a scenario-based testing approach. If you go ahead with the typical automated or manual testing approach there is very little value is testing the MDM implementation against thousands of values. Instead, the approach should be have a selective set of a few hundred test scenarios to test against. Obviously, it depends on the complexity of the MDM solution and the number of test cases may go into the thousands but this may be rare. And of course, there is also the scope of testing, resource availability, how detailed you want the testing to be, what are the time constraints etc. Therefore less than a thousand test scenarios would be the usual case of testing an MDM solution.
During the test phase all aspects of the MDM solution such as the following should be covered:
MDM implementations require a variety of tools. Which specific tools are required for a certain implementations is dependent on the scope, requirements and objectives of the MDM initiative. These are the tools which may be used during an MDM implementation:
ETL Tool: This will be used for the data integration part of the implementation to acquire, transform, standardise and load the data into the MDM data hub. Some MDM vendors have the ETL toolset bundled into their MDM application, whilst others have it as a separate application which can be integrated with the MDM tool.
Data Quality Tool: The Data Quality tool should not be confused with ETL tool. Even though ETL tools come with features which can cleanse and standardise data similar to data quality tools, ETL is more about data integration/migration i.e. moving data from one or more sources to one or more destinations. Data quality tools will help to profile, cleanse, standardise, parse and enrich source data. Data quality tools can integrate with external data service providers such as Dun & Bradstreet to enrich, validate and cross-reference source data.
Data Profiling Tool: If the MDM implementation is using a Data Quality tool then a separate Data Profiling tool should not be required. However, it is still mentioned here as part of the MDM tool set in case a Data Quality Tool is not used. This may be the case if the data quality is of an acceptable standard (most likely this will not be the case).
Database: This is probably the most important core component which will store the master data.
Core MDM Data Hub with Matching Component: This is the MDM data hub which will support the linking and matching of entities, support hierarchy management, versioning etc.
Enterprise Information Integration (EII) Tool: Depending on the MDM architecture style, an EII tool may be required particularly in case of the Registry Style or Co-existence Style (a.k.a. Hybrid or Reconciliation Style). EII tools are capable of aggregating distributed, small amounts of data which is in a non-persistent format or memory. With the Registry and Hybrid Style if a lot of complex queries are made from multiple source systems then this will impact the performance. To avoid this, EII tools should be used.
Web Server: The Core MDM Data Hub and other custom-built applications will sit on a web server.
Messaging Server (Enterprise Service Bus): If the MDM solution requires messaging and data synchronisation in real-time then an ESB component will be required as part of the MDM solution architecture.
Test Management Server: This is used to create the test cases to test the MDM solution, execute the test cases and manage defects.
Workflow Tool: For data governance purposes and business process management, the MDM system must be capable of being integrated with a workflow tool. Some MDM tools come with a built in workflow tool. It is therefore worth comparing the MDM tool’s workflow capabilities against the existing one. If the workflow tool inside the MDM tool is better compared to the existing tool, then the next question will be how easily and quickly the existing workflows can be migrated to the new workflow.
Even though MDM architectures will vary from one solution to another, generally the following technical capabilities are represented by MDM solutions:
Data Acquisition
The data is profiled, acquired, cleansed and transformed using Data Integration/Consolidation and Data Quality tools.
Data Delivery
On the other hand, the data must be delivered from the MDM Data Hub to the consuming systems and applications. The data delivery component includes interfaces to facilitate communication between the MDM Data Hub and consuming applications and systems.
MDM Data Hub
The MDM Data Hub sits between the Data Acquisition and Delivery component. It includes the following capabilities: Reference Master Data, MDM Architecture, Master Data Model, Hierarchy Management, Business Rules, Metadata Management and the Master Data Repository. The business rules ensure consistency and completeness of the master data. Some MDM tools come with a pre-configured data model, but in most cases requires customisation specific to the requirements.
Access Control
This component ensures that master data can only be accessed by authorised users, systems and applications.
Master Data Services
The MDM Data Hub offers a number of data services which can be accessed by the downstream systems. Examples of such master data services include Update customer name, Deactivate customer account etc.
Data Governance and Stewardship
The overall MDM solution is supported by Data Governance and Stewardship to manage and monitor the quality of data and resolve any data quality issues when required.
Data Synchronisation
This is one of the most important capabilities MDM has and ensures ensures that data is kept in synch between the MDM Data Hub and upstream and downstream systems.
The purpose of this blog is for the author to share his knowledge and experience in various Data Management domains with the aim of helping readers to learn/expand on their existing knowledge in the area. This blog shares knowledge on Master Data Management, Data Quality, Data Governance, Data Integration, Data Analysis, Data Profiling, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Metadata Management etc.
I am Manjeet Singh Sawhney and work as a Information/Data Architect, Manager for Direct Line Group in London and Bromley (UK). Prior to this, I have worked for Accenture, Tibco Software, Initiate Systems (now IBM) and Tata Consultancy Services. My areas of expertise are Master Data Management (Customer, Product, Reference), Metadata Management, Data Governance, Data Quality, Data Integration, Data Migration, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Data Profiling and Data Analysis. I am using this blog to share my knowledge and experience in these areas and hope that you will find it useful.
If you wish to advertise (banner or text links) on this blog or sponsor any blog posts then contact the author using the 'Contact the Author' form below. Please include some details about your advertisement / sponsorship request and a response will be sent to you within 24 hours.