Miary podobieństw łańcuchów znakowych a deduplikacja rekordów w bibliograficznych bazach danych

Kamińska, Anna Małgorzata

Zastosuj identyfikator do podlinkowania lub zacytowania tej pozycji: http://hdl.handle.net/20.500.12128/21844

Tytuł:	Miary podobieństw łańcuchów znakowych a deduplikacja rekordów w bibliograficznych bazach danych
Autor:	Kamińska, Anna Małgorzata
Słowa kluczowe:	Bibliographic databases; Deduplication of records; String similarity; Records linkage; Bibliograficzne bazy danych; Deduplikacja rekordów; Podobieństwo łańcuchów znakowych; Scalanie rekordów
Data wydania:	2017
Źródło:	"Przegląd Biblioteczny" (2017), z. 4, s. 477-495
Abstrakt:	Thesis/Objective – The article presents the method of deduplicating/linking bibliographic records in databases based on the string similarity metrics. The proposal is based on the author’s own experience acquired while building a bibliographic database and conducting bibliometric research based on data acquired from publicly available bibliographic databases. The formal description of the method is illustrated with data obtained from the CYTBIN database. Research methods – The development of the method required a review of information architecture of selected Polish bibliographic databases and an identification of problems that affect them, resulting not only from data models but also from the construction of their graphical user interfaces. Several string similarity metrics were analyzed and some of them were used as components of the finally proposed compound method. The method enables the evaluation of bibliographic record similarity based on their attributes. Results – The results presented on the example of data acquired from CYTBIN database enabled the empirical verification of the proposed method. In addition, the author performed the analysis of the similarity distribution of bibliographic records from the CYTBIN database calculated for the proposed method and for Jaro-Winkler algorithm based on the titles of bibliographic units. Conclusions – The proposed method, after adjusting its parameters to the specificity of selected bibliographic databases, can be used to improve the quality of bibliographic data. Depending on the performance of the computer system, the proactive model (the verification before adding a given record to a database) or/and reactive model (the verification of all or just recently added records, performed for instance during a minor system load at daily intervals) can be implemented.
URI:	http://hdl.handle.net/20.500.12128/21844
ISSN:	0033-202X
Pojawia się w kolekcji:	Artykuły (W.Hum.)

Pliki tej pozycji:

Plik	Opis	Rozmiar	Format
Kaminska_miary_podobienstw_lancuchow_znakowych.pdf		700,46 kB	Adobe PDF	Przejrzyj / Otwórz

Pokaż pełny rekord

Uznanie autorstwa - użycie niekomercyjne, bez utworów zależnych 3.0 Polska Creative Commons