When migrating to an information system to another, one is often tempted to engage in an operation to improve quality. Although this is a project in itself, this may be the "right time", provided that we have the resources to do so. Human Resources, of course, but we must also have enough time to be able to develop a methodology able to guarantee us a better quality.
This year some students have embarked on this adventure for their graduation work with the migration of library catalogs. And a special problem was quickly placed with the presence of duplicates in the lists of authority. Catalogs are excellent venues to measure the creativity of the human mind can not follow the rules ;-)
Thus, in a library catalog, we have a list of publishers for example, and depending on the creativity of cataloguers, but also the amount of people involved in the management of these publishers, we will find more or fewer variants :
- Ed O'Reilly
- O'Reilly;
- O'Reilly
- E. O'Reilly
- etc..
The ideal would therefore be able to find any duplicates, and replace them with the correct form. But how to find all these duplicates? Without computers, the task is tedious:
- browse the list of authority;
- establish a list of duplicates;
- make the choice of either "correct";
- make all necessary changes.
Faced with such situations, my first instinct is often to determine what I can computerize, and even automate. In this case, it would be ideal to get a list of "possible duplicates", ie, expressions that close enough for us to put a flea in his ear. In this case, the computer offers several techniques:
- an algorithm to compute the Levenshtein distance , that is to say, the number of elementary operations to move a word to a word P M, based on this algorithm, we will be able to compare each entry in the list authority with the rest of that, and keep the items which the Levenshtein distance is not important (this threshold is obviously set as the mesh size of nets), of course, this algorithm is available on CPAN: Text :: Levenshtein for a version in Perl, and Text:: LevenshteinXS for a version in C;
- other techniques derived from the previous exist, for example using the algorithm hiding behind
agrep , a grep to make approximations in the investigations to strings, there is a Perl module that reproduces this behavior: String:: Approx (this is also an XS module, thus based on C).
Thus, techniques exist, "there is more than" ... do a search on CPAN, for example with the keyword "group" or "Similarity," which allowed me must see String:: Similarity:: Group based on String:: Similarity , which is based on an algorithm which significantly different but the same techniques as explained above. In short, once again CPAN save me time and allows me to develop a prototype quickly (without this module, I set up the group creation, which is certainly not complicated, but brings a lot of reflections).
Here is the prototype in question:
#! / Usr / bin / env perl
use strict;
use warnings