Zastosuj identyfikator do podlinkowania lub zacytowania tej pozycji:
http://hdl.handle.net/20.500.12128/21844
Tytuł: | Miary podobieństw łańcuchów znakowych a deduplikacja rekordów w bibliograficznych bazach danych |
Autor: | Kamińska, Anna Małgorzata |
Słowa kluczowe: | Bibliographic databases; Deduplication of records; String similarity; Records linkage; Bibliograficzne bazy danych; Deduplikacja rekordów; Podobieństwo łańcuchów znakowych; Scalanie rekordów |
Data wydania: | 2017 |
Źródło: | "Przegląd Biblioteczny" (2017), z. 4, s. 477-495 |
Abstrakt: | Thesis/Objective – The article presents the method of deduplicating/linking bibliographic
records in databases based on the string similarity metrics. The proposal is based
on the author’s own experience acquired while building a bibliographic database and conducting
bibliometric research based on data acquired from publicly available bibliographic
databases. The formal description of the method is illustrated with data obtained from the
CYTBIN database. Research methods – The development of the method required a review
of information architecture of selected Polish bibliographic databases and an identification
of problems that affect them, resulting not only from data models but also from the construction
of their graphical user interfaces. Several string similarity metrics were analyzed
and some of them were used as components of the finally proposed compound method. The
method enables the evaluation of bibliographic record similarity based on their attributes.
Results – The results presented on the example of data acquired from CYTBIN database enabled
the empirical verification of the proposed method. In addition, the author performed
the analysis of the similarity distribution of bibliographic records from the CYTBIN database
calculated for the proposed method and for Jaro-Winkler algorithm based on the titles
of bibliographic units. Conclusions – The proposed method, after adjusting its parameters
to the specificity of selected bibliographic databases, can be used to improve the quality of
bibliographic data. Depending on the performance of the computer system, the proactive
model (the verification before adding a given record to a database) or/and reactive model
(the verification of all or just recently added records, performed for instance during a minor
system load at daily intervals) can be implemented. |
URI: | http://hdl.handle.net/20.500.12128/21844 |
ISSN: | 0033-202X |
Pojawia się w kolekcji: | Artykuły (W.Hum.)
|