Dealing with Imperfect Data? 5 Things to Consider

When customers are looking for ways of dealing with imperfect (substitute your own expletive here) data, there are five major factors that should be considered. You will find many competing claims about approaches and algorithms, but at the end of the day, my (completely unbiased) view about evaluation criteria is:

1. First and foremost, it’s all about accuracy – here we could talk about specificity and sensitivity analysis and other statistical mumbo-jumbo, but for simplicity, let’s just focus on accuracy – that is measuring how close the system can come to reaching the same conclusions as a domain expert when faced with the same data.

2. It’s about scalability – dealing with big data. How easily can the system you select deal with increasingly large volumes of data and workloads?

3. To any organization doing business in and across country or cultural boundaries, being able to deal with any type of data in any language should be a key criteria. Systems are global and need to deal with data about many different types of entities – not just customers and product data  – and do so in a way that is independent of language.

4. Data comes from so many different systems and sources that being able to easily configure requests to deal with whatever comes along is a must-have. So make sure you review the options provided to fine tune requests which can easily achieve the desired results.

5. Finally, of course, is seeing how easily the system can be integrated with existing and future applications, processes and tools that run your business. This involves looking at two main areas: What native language support is provided? How is that integration achieved? Then, make sure it will work effectively with your ESB, SOA, BPM andCEP products.

Blame It on Ted

Imperfect data – A historic perspective

Our world of computing in 1969 was very different from today. In 1969, Dr. E.F. (Ted) Codd published his first internal IBM paper, “Derivability, Redundancy and Consistency of Relations stored in Large Data Banks”, followed in 1970 by the ACM publication, “A Relational Model of Data for Large Data Banks” – the birth of relational databases as we know them today.

Organizations used to have complete control of their data. With just a few systems (usually to automate back office functions) there was no concept of customer self-service, or integrated supply chains, or third party data feeds, or just about anything we take for granted today.

Data was generated by professional data entry staff; they took pride in getting the data entry right, with very low error rates. Data was processed sequentially, tapes spinning round and lights flashing brightly; often you could tell what job was being processed by the noises in the computer room.

What’s changed?

What’s changed over 40 years? Today the typical organization runs hundreds, if not thousands, of systems spread across large data centers – many of these applications sharing data with external sources, their supply chain, external data feeds and, of course, we are constantly trying to get our customers to do as much as we can get them to do. When you add up 40+ years-worth of growth and change, we can see how organizations have come to have such volumes of “imperfect” data to deal with – data that is full of errors, inaccuracies and inconsistencies. [Read more...]