…covering everything which includes the word "data" or "information"
Following on from the previous post, Hadoop Introduction – Part 1, this post explains the history of Hadoop, how it was born and reached its level of popularity today.
Hadoop’s history is an interesting one. You could say that the original fundamental idea came from Google which published a paper about its file system in 2003 and also published papers on MapReduce in 2004. At that time, the Apache Nutch developers saw the opportunity to port Nutch on a cluster consisting of tens of computers rather than just a single computer.
Yahoo! wanted to rebuild its web search infrastructure and so used Nutch’s storage and processing architecture. This effectively lead to the creation of the core of Hadoop.
However, Hadoop at that time was not capable of processing such large volumes of data as required by Yahoo!’s search. Eric Baldeschwiegler, co-founder and CEO of Hortonworks. who was VP of Hadoop software development at Yahoo!, told GigaOM that Hadoop is limited to working on five to twenty computers.
Yahoo!’s CTO Raymie Stata realised that it would be take a lot of time to develop Hadoop such that it could provide Yahoo! the kind of scalability it required. And so the work began on Hadoop at Yahoo! to make it more scalable. In order to achieve this goal, Yahoo! set up a “research grid” for its data scientists. This helped the team to gradually scale up Hadoop to more and more computers.
Finally in 2008 Yahoo! was able to use Hadoop as its web search engine which consisted of a cluster of around 10,000 computers. The effort paid off for Yahoo! such that its search speed increased. Furthermore, since processing was distributed across a cluster therefore reliability was also improved in case of any failures at any computer node.
Three years later in 2011, Yahoo! was running its search engine across 42,000 computers. Since that time, Hadoop has gained a lot of attention worldwide and particularly the development community which helps it to continuously evolve.
In my blog post next week, I will talk about how to install Hadoop. So make sure you visit my blog next week.
This is the first blog post of a set of posts I am going to write over the next few months about Hadoop including HDFS and MapReduce. The aim of these blog posts will be to share my current knowledge with you, and hopefully you will find these useful. I am going to start by giving you a high-level introduction about Hadoop first before going into the details in subsequent posts.
What is Hadoop?
Hadoop is an open source Apache project capable of storing and retrieving large volumes of data. These data volumes may go into the terabytes or even petabytes. Hadoop has been strongly associated with the concepts of Big Data and a lot of additional frameworks have been built on top of it to complement its features and functionality.
Key points about Hadoop
Other projects related to Hadoop
Hadoop has received a lot of attention since about 2011. This has resulted in multiple related projects being sprung off it. Some of the most popular related projects include:
There are many more projects related to Hadoop, but the above highlights some of the most popular ones which you will hear about more often than others.
How does Hadoop work?
Hadoop runs code across a cluster of computers. This process includes the following core tasks which Hadoop performs:
There are lots of other tasks Hadoop does, but the above outlines the core tasks which Hadoop performs and simultaneously provide an overview of how Hadoop operates.
If you look into the details of Hadoop, you will see that Hadoop is actually quite a complex framework but from a developer’s perspective. You only need to provide Hadoop with some code and data and it will take care of all the complexities for you in the background.
An interesting fact about Hadoop
The largest Hadoop clusters are up to 25 petabytes of data distributed across 4,500 computers; that is an extremely powerful capability. All this can be achieved using commodity hardware, thereby keeping hardware costs and data processing costs low.
In my blog post next week, I will talk about the history of Hadoop. So make sure you visit my blog next week to learn about how Hadoop was born and reached its state where it is today.
Managing data effectively within an organisation is a big task especially organisations which either have been in business for a long time or who are very large in terms of their size or market coverage. These organisations typically suffer from siloed data source where data is either stored across multiple systems and often managed in different ways by different users. Storing data across multiple systems leads to data duplications, redundant storage, inconsistencies, storage in different formats etc. A wide range of data management tools and technologies exist in the market which can help to overcome such data management issues within organisations. The following sections describe the six key data attributes that enterprise data must have:
The enterprise data must be accessible by authorised users and applications regardless of the source, data structure or means of accessing the data. Employees do not always have a computer or laptop with them such as sales people. These employees may interact with enterprise data whilst they are on a client site and may access the data through their tablet PC or (smart) phone. Given the variety of channels through which employees, but also customers or suppliers may access some of the enterprise data it is important that the data can be accessed through a variety of means.
Users and applications must be able to access the data whenever and wherever it is required. For instance, this is particularly important for large organisations which span across multiple time zones and who require access to data around the clock.
The data must be complete, accurate, consistent and trustworthy if it to be used across the enterprise. In order to achieve this, a mix of technologies and processes may be required depending on the organisation’s requirements and current available tools. If current tools can be leveraged then that is fine otherwise new tools probably need to be procured.
Since data usually is spread across multiple departments, applications and data stores it is very easy for data to become inconsistent if no synchronisation method is in place. To ensure all users have access the same information and therefore are on the same “knowledge wave” therefore data consistency is crucial. Consistent data is important to support daily operations, but also for strategic decision making and reporting purposes.
Auditability means that it is possible to see an audit trail of what changes were made, who did the changes, a time stamp when any changes were made, why the changes were made etc. Changes could include additions, edits or deletions of data. Auditability is typically supported through versioning features which are provided as part of Reference Data Management, Metadata Management and MDM tools. Keeping a log of any data changes is particularly important to comply with regulatory requirements but also supports other data management initiatives within organisations such as Data Governance, Data Warehousing, Business Intelligence, MDM etc.
All enterprise data must be protected from unauthorised use or access. In large organisations with global presence this can be a big task. For instance, Marks and Spencer is a large retailer found in the UK but today has presence also in many other countries across the world. An organisation such as Marks and Spencer has got a website to allow customers to purchase clothing, food, furniture etc, has physical presence by means of stores and also has kiosks inside stores. Its user base include customers, employees, suppliers, partners such as affiliates and others. Given the variety of users and channels for data access, security must be set up that is appropriate for each of these users and channels.
Data profiling tools help to identify relationships amongst data elements and also allow to perform various statistical analysis on data such as frequency counts, number of Null data items, mean, median etc. Data profiling tools are often used for data quality and data modelling purposes. This article explains how data profiling tools can be uses in both of these areas.
Data profiling tools are great at generating metadata. Sometimes organisations do not have documentation available on all of their applications and databases. This is where data profiling tools can help to discover the various data items and their metadata i.e. data characteristics. Similarly, if you doubt whether your metadata information and accurate or complete you can use a data profiling tool to verify it.
Domain Discovery and Analysis
Another application of data profiling tools is for domain discovery and analysis. For instance, this way you can find out if a certain set of values define a certain domain (set of values) or perhaps represent reference data. In addition, you can discover ranges of values using a data profiling tool.
Anomaly analysis is the most common area where data profiling tools are used in order to help identify data quality issues. Anomaly analysis can be performed through one of the following ways:
Column analysis: Performs statistical analysis on a particular column
Cross-column analysis: Performs analysis across columns within the same table. This includes ensuring that column data is not duplicated, checking that the current primary key really identifies each record uniquely, checking if any mappings/associations exist between columns and checking if there are any functional dependencies between the primary key and non-primary key columns.
Cross-table analysis: This helps to identify any anomalies between columns across tables and includes referential integrity checks and checking for duplicate columns (different names but same column values).
The first two applications of data profiling tools are mainly used for data modelling purposes, although this would obviously also feed into a data quality program. Whereas the anomaly analysis is mainly used for data quality purposes, any information discovered would also be very beneficial for data modelling purposes and can help to improve existing data models.
Data quality is not option any more, since businesses rely on accurate information for various purposes such as running their transactional systems where multiple systems source data from the same system, analysis, regulatory reporting etc. There are many tools out there which can help cleanse the data, match and merge it to remove duplications and therefore help reduce storage space and data anomalies. Matching and merging capabilities are either bundled into data quality tools or MDM products. Although there are various aspects in which they try to distinguish each other and vary in capability, the main three aspects which these tools compete against are
Although lots of algorithms have been developed, these can be classified into the following categories:
The following sections will explain each algorithm in more detail.
This is based on a predictable comparison technique where each attribute in one record is compared each attribute in another record. This is a very simple technique but very quick in performance although it requires that data is already cleansed and standardised.
This type of algorithm matches and matches similar looking records based on probability theories. Over time this algorithm is able to develop and improve its capability such as that initial records which had to be matched manually or corrected manually in case of false positives, would not be required due to its self-learning capability.
Machine learning algorithms
Using this techniques records are matched and merged based on machine learning and artificial intelligence. It compares records like human beings and improves its capabilities over time.
Like the name says, these algorithms consist of a hybrid i.e. combinations of the above algorithms plus others such as matching based on soundex, phonetics, fuzzy logic etc.
It is unlikely that a single algorithm will match on organisations requirement and therefore a mix of these algorithms will be used and one tool will be stronger in one type of algorithm compared to another tool. However, when selecting the appropriate tool organisations should compare tools based on some of the following criteria:
Speed of matching: How quickly does the tool match and merge given a set of records? It is worth thinking about the future where an organisation’s data volumes will increase significantly in the future.
Accuracy of matching: Although the majority of matching should be done automatically, but it would be good if the tool would allow a semi-matching approach i.e. the user should be allowed to intervene and maybe inspect and verify recommended matches by tool before the merging is done.
Consider organisation’s custom business rules: Often organisations have their own business rules which should take precedence of any of the “standard” matching technique. Again a semi-matching technique would be a good solution to assist with this in order to give custom business rules a higher priority level of matching.
Uniqueness and key generation: The tool should be able to identify unique sets of records and cluster these into so called match group. Each match group identifying a unique customer should be allocated a globally unique identifier by the tool.
Flexibility: The matching and merging tool should be flexible enough to make changes to the logic according to changing business requirements and needs.
Easy to implement and administer: The matching and merging tool should make it easy to add new matching rules or change existing rules. The tool itself should be able to integrate within the existing infrastructure without requiring any special equipment like a supercomputer. Similarly, it should integrate with existing software and hardware.
Scalability: The tool should be able to provide consistent performance despite new data sources, additional concurrent users, new business and matching rules.
Getting the matching and merging right and part of this is selecting the right tool to assist with this is crucial. Hence it is important to make the right decision about the tool and giving it the right amount of time for thought to get the benefits it has to offer.
Reference master data is master data which is used across the organisation by various departments. The data itself is reference data also known as code tables or look-up data. Examples of reference master data such as within a bank with multiple branches across the country may include currencies, branch locations, geographic divisions, region names etc. As you can see from the examples, some reference data may actually be used to build a hierarchy of reference master data such as countries, regions, cities, city areas, branch locations. From a database perspective, there will be 10s or 100s of reference master data tables which are usually smaller in size compared to others within the same database. A reference table would mainly consist of two key attributes/data items:
Code: This is like a unique key to identify each reference master data item. It varies between reference master data hubs and therefore may be purely numeric such as 1, 2, 3 etc or it may be alphanumeric such as LON1, LON2, LON3 etc where the charters could have a meaning such as ‘LON’ referring to ‘London’ and the number to which London-based branch.
Description: The value of the data item such as London branch, Birmingham branch etc. In the above example of LON1, LON2, LON3 etc it may be names of London branches such as LON1 = London Monument, London Bank, London Victoria etc.
Reference master data is generic in its nature i.e. it can be used by various types of systems within the organisation including master data repositories such as customer, product, supplier master data etc. Given its wide use across the organisation it is vital that the reference master data is kept accurate and complete.
Managing the reference data within an organisation effectively will offer the following key benefits:
Reference master data may be managed by organisations manually. For instance, when a new branch is to be added to an organisation’s network the branch details need to be added across multiple systems such as CRM, data warehouse, reports, operational systems, website, published company brochures etc. Without going into the details of the business process, in summary multiple employees working on these systems and responsible for the accuracy of the data held within it will be involved and will communicate information through various channels including phone, e-mail and/or face-to-face meetings. The relevant data elements required by each person and system is entered manually and very often this introduces data input issues. For example, one person may enter LON22 as the new branch number in the data warehouse whereas another person responsible for the CRM system may enter LON0022. Due to a small, simple error the two records would not match and therefore would later on lead to further problems in the future until the issue is discovered and raised by a person impacted by the accuracy of data. For example, when it comes to reporting sales levels from within the data warehouse the sales figures at the end of the month may not show an accurate reflection because the operational system records sales using the branch code LON22 whereas the data warehouse equivalent is actually LON0022. Hopefully, the issue will be discovered and rectified if not it could create the perception that a branch is not generating enough revenue and therefore may lead to redundancies and maybe closure of the branch resulting in a loss on the balance sheet. Similarly, if the organisation is regulated by some regulatory requirements then this could result is hefty fines and damage the organisation’s reputation in the market.
So to manage the reference master data, organisations should use reference master data management tools. These may either be available within a MDM tool, although some vendors may provide specific tools just to manage reference master data. Whichever option is purchased from a vendor, you have to ensure that the tool offers at least some of the following capabilities:
Auditing and history to enable tracking and recording of changes made to the reference data over a period of time, who made the changes, when were the changes made and why etc. Audit trails and maintaining history of data changes is particularly important to meet regulatory requirements such as Solvency 2.
Allow assignment of data stewards responsible for managing certain reference data domains. Similarly, the tool show allow to assign access controls to each reference data domain to control who can and cannot make changes or read the data.
Creating and managing hierarchies of reference data in order to support and represent the organisation’s business structure. Hierarchy management can be quite simple, but may also be extremely complex. So ensure that the hierarchy management features meets you present as well as foreseeable future requirements.
Integration with all the existing systems which need to source reference master data from this data hub.
Easy to use so that you do not have to spend a lot of time and money training up the users.
By implementing a suitable reference master data solution organisations can become more operationally efficient by means of automation and communicating through a single channel ( i.e. the reference master data tool).
Costs are also reduced, because when exchanging information about any data updates/additions these would be provided by one person and then another person responsible for putting the data in the system – the cost is therefore born by two people. The reference master data management solution would allow the “data provider” to immediately enter the details into the system hence a cost saving of one person and also limited room of misspelling or miscommunication any information about the data. Multiple people would be involved in the process when updating or adding a data element because multiple people who difference information about the reference data and require different information for their relevant system. Thus the reference master data system would provide a single, complete and consistent view about the data in question compared to previously fragmented pieces of information about the data communicated over phone, e-mail etc.
Customer service and retention is also improved, because incorrect reference master data for a customer-related data element is likely to be more accurate and correct. A critical mistake in the information could cause the customer frustration about the mistake by the organisation and therefore a less happy customer. In the worst case if the data error affects thousands of customers then the retention rates could decrease almost immediately causing further damage to an organisation’s revenue and position in the market.
Furthermore, if an organisation has a data governance council which oversees data standards they can easily monitor if data standards are correctly followed and make any corrections immediately through a clear view of all reference master data via the tool. Some of the workload can perhaps be reduced by implementing data validation rules within the reference master data tool.
Overall, when starting any kind of MDM implementation can be a very good point to start with. This is for several reasons and maybe should perhaps be another discussion, but in brief implementing a reference master data solution would be quicker and easier to implement. It can therefore also be used as a starting point of MDM to demonstrate a quick ROI in mostly 6 – 12 months depending on business complexity, data sources, reference master data you wish to manage.
Data models in general should be built such that they are flexible enough such that changes in the future are quick and easy to implement with no or minimal impact on components including systems, processes, data stores or users. This is particularly important for CDI solutions, because businesses have to be very agile these days and therefore adjust the way they do business very quickly e.g. in order to be one of the first companies offering a new service or product. Being able to put in place the necessary infrastructure including IT systems or changes to existing systems means that the organisation has to be very agile to in order to be able to do so. Every system and process relies on some kind of underlying data, which in turn is stored mostly in a database which is built using a data model.
When building a CDI solution organisations can either reuse their existing data models if it is deemed good enough to meet the requirements. Alternatively, a new data model might be built from scratch. Sometimes organisations also tend to purchase industry standard data models such as the Teradata FS-LDM (Logical Data Model for the Financial Services industry) or SAS DDS (has got various ready-made data model components for e.g. different financial services institutions such as insurance companies and banks). Some of the off-the-shelf MDM/CDI products on the market also come with a basic data model out-of-the-box which can be customised to suit the requirements. Overall, no matter which MDM architecture selected or complexity of requirements, every CDI data model shares at least the following components:
The top two data components will need to be customised for the specific industry sector an organisation operates in, whereas the bottom two data components provide the reassurance that the data used in the CDI solution supports building a complete, consistent, reliable, accurate set of customer records.
On the whole, building a data model is not a straightforward task and requires skill and knowledge. Data modeling should be started at the same time when the system requirements are being gathered. Since CDI or more generally MDM solutions are enterprise-wide solutions which require data integration from various data sources, creating a data model which will satisfy all users’ needs can be a challenging and very political task to get right. Creating a data model which meets everyone’s requirements means that you will create a data model which is too sophisticated to use. A simple data model aimed at a satisfying only a small number of users’ requirements means that other users will not benefit from the CDI solution. So getting the right balance is very important and is something which data and business analysts can assist with.
CDI solutions allow organisations to collect, identify and aggregate relationship data from the various data sources within the enterprise’s systems and applications. CDI is a key enabler which allows organisations to gather relationship information, analyse and interpret it and then use it to their advantage to grow revenues, the customer base, identify previously unknown new customer segments etc.
Essentially, customer data tends to be duplicated across multiple systems within an organisation and CDI has the power to identify and group the various records belonging to a single customer is a group called Match Group. Through this feature CDI is able to give an insight into all of the relationships the customer has with the organisation. For example, a particular customer may not have had many interactions with a bank and little information might therefore exist about the customer. However, knowing that the customer has some relatives who are also customers with the same bank having several savings and investment accounts might imply that actually the customer could be a very worthy customer to the bank (subject to further analysis). But it is this capability of CDI which will allow to identify links and relationships amongst customers and therefore be more effective in achieving various organisational goals and strategies.
Overall, relationships do not just apply to customers but also product data, reference data, hierarchies etc where relationship discoveries can be beneficial to organisations in their own ways.
To represent and capture all the relationship and customer-related data, a flexible data model is required to meet current requirements as well as is flexible enough to be altered in the future subject to changing requirements, business processes etc. Any changes should have no or minimal impact on other systems, processes and users.
The main challenge with customer data is the fact that it changes very frequently and there are multiple reasons for this such as the following
There is a statistic which says that about 2% of customer details change every month. Therefore an organisation with 1 million customers would have to deal with about 20,000 customer data changes every month. This estimate would only be valid for existing customers and therefore organisations which are growing rapidly, the amount of customer record changes will grow proportionally.
The promotions, marketing campaigns, price change, new product introductions etc are all driven by having accurate and reliable information around customer data and it is this data which will help an organisation to have greater revenue and customers compared to its competitors operating in the same business area.
The challenge of course is to aggregate, cleanse, standardise and make this data available in a timely manner so that it is still usable before it becomes obsolete. Standardising, matching and merging, de-duplicating and rationalising the customer data are the main challenges related to customer data.
Aggregating the relevant customer data is not an easy task especially given the common theme often seen across organisations how these developed and grew over time i.e. in a siloed fashion. Customer data would be stored in multiple systems and often be duplicated, formats and level of granularity tend to vary from system to system and department to department.
Resolving all the data quality issues is no straightforward task which usually take months to resolve and depending on the size of the organisation and breadth of data quality issues maybe even years in the most extreme cases. However, once the data quality issues have been ironed out, it can lead to increased revenue, reduced costs, greater efficiencies, compliance to regulatory requirements, greater customer service and satisfaction levels, reduced customer attrition, more effective marketing campaigns etc.
On the whole, challenges to integrate customer data are due to the following reasons:
Assuming that all or at least the majority of customer data has been cleansed, it is also important that you establish a process of monitoring, reporting and assessing the data quality on an ongoing basis. After all, having made a substantial investment in cleansing the data you do not want to be 6 or 12 months down the line of having to do a similar exercise again. Therefore ensure that as soon as you notice a drop in data quality that these issues are resolved immediately before these reach an unmanageable state and get out of control again.
The purpose of this blog is for the author to share his knowledge and experience in various Data Management domains with the aim of helping readers to learn/expand on their existing knowledge in the area. This blog shares knowledge on Master Data Management, Data Quality, Data Governance, Data Integration, Data Analysis, Data Profiling, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Metadata Management etc.
I am Manjeet Singh Sawhney and work as a Data Architect for Direct Line Group in Bromley (UK). Prior to this, I have worked for Accenture, Tibco Software, Initiate Systems (now IBM) and Tata Consultancy Services. My areas of expertise are Master Data Management (Customer, Product, Reference), Metadata Management, Data Governance, Data Quality, Data Integration, Data Migration, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Data Profiling and Data Analysis. I am using this blog to share my knowledge and experience in these areas and hope that you will find it useful.
If you wish to advertise (banner or text links) on this blog or sponsor any blog posts then contact the author using the 'Contact the Author' form below. Please include some details about your advertisement / sponsorship request and a response will be sent to you within 24 hours.