Data Quality
Amadeus have many years of experience in implementing Data Quality systems, and these are just some typical data quality challenges that you may face also:
- Complete missing information by looking across fields and even between systems;
- Gain consistent data definitions in a unified storage structure;
- Make data comply to standard definitions and corporate governance;
- Control data access and usage;
- Profile data to understand structures and improve integrity from operational systems onwards;
- Use sophisticated text parsing (with Perl Regular Expressions) to achieve standardisation;
- Limit costs and optimise processing time.
Looking After The Pennies
It was a phrase you'd hear quite often a couple of generations ago, and perhaps the phrase will be heard a little more whilst the economy slowly regains its feet: "Look after the pennies, and the pounds will look after themselves."
It's a simple enough concept; watching where small amounts are wasted and avoiding unnecessary expense would, when all added up, reap larger dividends. How many companies are applying the same principles to their incoming information?
The time to catch most data errors is on the way into the system. The acronym "GIGO" – Garbage In, Garbage Out – will remain relevant as long as there is data to process. But quite a lot of development concentrates so much on process that businesses waste much time and money because the process itself is having to cope with data that is of poor quality. Conversely, so much time can be saved by investing in robust methods to catch, match and dispatch duplications and eliminating errors as data is extracted from its sources.
As Bill Gates says in his autobiography: "The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency".
Where input data is of poor quality, the processing of that data becomes unnecessarily inefficient and has to work that much harder to achieve the same ends. Like arteries clogged with cholesterol, the input data streams are full of rubbish and the heart of the business becomes strained. Eventually, either productive output slows to a trickle or the heart simply gives up, with catastrophic consequences. Processes take more and more time, and the output becomes less and less usable. The business finds it harder and harder to climb the next set of commercial stairs.
Strings and Sounds
So much for the analogies. There are usually two outcomes: either more pounds are spent on bigger machines to handle all that inefficient input, or the system falls into disuse or never realises its original goal. The clue to success here is to consider the "pennies" of data which eventually will become "pounds" of useful information.
Much can be achieved with Base SAS using string handling functions and techniques (as simple examples, see "Useful String Functions: strip, scanq, substrn" in the Tips and Techniques section). Good coding practice will ensure that important information such as, say, names and addresses, are presented consistently. More complex functions are available such as SOUNDEX, a function that compares homophones – words that sound similar but spelled differently; hence the surnames "Read", "Reed" and "Reid" will be returned as a positive match using SOUNDEX, and therefore may be a clue to a duplicate record because of a previous misspelling.
For more powerful data quality tools, SAS provide dfPower Studio ("dfPS") from Dataflux, and SAS Data Quality Server. For example, Data Quality functions can determine, with a degree of confidence, whether a name is that of a company or an individual; and if an individual, whether they are male or female.
SAS can also interface to specialised address verification software such as QAS.
Quality Help For Quality Data
For the examples above we have considered a common problem that most people can relate to, matching names and addresses. But the issues of data quality run far wider: spotting outliers in sample readings, alerting users to inconsistent inputs to help prevent contaminating datasets, and filtering large quantities of text by keywords all fall into or are affected by data quality issues. The longer such issues remain unaddressed, the less efficient the processes intended to use the data become.
Amadeus can demonstrate successful projects where good data quality slashed processing times and improved the quality of the data delivered to end-users. By spending pennies on the pennies of data, companies have saved pounds and found the information that they hold is worth so much more. If this is an issue that your organisation faces, call Amadeus on 01993 848010 and ask for the Business Development Director.
References: Dataflux white papers.

