论文原标题:BookReconciler: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering
说明:你提供的是论文 PDF 的长段落文本摘录(含摘要、引言、相关工作、架构流程、评测、结论与参考文献等)。下文为严格基于该摘录内容的中英文双栏对照;图注与表格按文本中出现的内容保留。若你希望把 PDF 全文(含未出现在摘录里的段落、脚注、版面信息)也做双栏对照,需要你再提供更完整的原文内容或允许我访问 PDF 全文。
标题与作者信息(对照)
English
中文
BookReconciler [ v: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering
BookReconciler [ v:用于元数据增补与作品层聚类的开源工具
Matthew Miller, Library of Congress, Washington, DC, USA. thisismattmiller@gmail.com
Matthew Miller,美国国会图书馆,华盛顿特区,美国。thisismattmiller@gmail.com
Dan Sinykin, Emory University, Atlanta, Georgia, USA. daniel.sinykin@emory.edu
Dan Sinykin,埃默里大学,亚特兰大,乔治亚州,美国。daniel.sinykin@emory.edu
Melanie Walsh, University of Washington Information School, Seattle, Washington, USA. 0000-0003-4558-3310
Abstract—We present BookReconciler [ v, an open-source tool for enhancing and clustering book data. BookReconciler [ v allows users to take spreadsheets with minimal metadata, such as book title and author, and automatically 1) add authoritative, persistent identifiers like ISBNs 2) and cluster related Expressions and Manifestations of the same Work, e.g., different translations or editions. This enhancement makes it easier to combine related collections and analyze books at scale. The tool is currently designed as an extension for OpenRefine—a popular software application—and connects to major bibliographic services including the Library of Congress, VIAF, OCLC, HathiTrust, Google Books, and Wikidata. Our approach prioritizes human judgment. Through an interactive interface, users can manually evaluate matches and define the contours of a Work (e.g., to include translations or not). We evaluate reconciliation performance on datasets of U.S. prize-winning books and contemporary world fiction. BookReconciler [ v achieves near-perfect accuracy for U.S. works but lower performance for global texts, reflecting structural weaknesses in bibliographic infrastructures for non-English and global literature. Overall, BookReconciler [ v supports the reuse of bibliographic data across domains and applications, contributing to ongoing work in digital libraries and digital humanities.
摘要——我们提出 BookReconciler [ v,一款用于增强与聚类图书数据的开源工具。BookReconciler [ v 允许用户输入仅含极少元数据的电子表格(例如书名与作者),并自动完成两类任务:1)补充权威且可持久引用的标识符(如 ISBN);2)将同一“作品”(Work)的相关“表达”(Expression)与“体现”(Manifestation)进行聚类,例如不同译本或版本。这样的增强使得合并相关收藏与开展规模化图书分析更为容易。该工具目前设计为 OpenRefine(一个流行的软件应用)的扩展,并连接到主要的书目服务,包括美国国会图书馆、VIAF、OCLC、HathiTrust、Google Books 与 Wikidata。我们的方法优先强调人的判断。通过交互式界面,用户可以手动评估匹配结果,并定义一个作品(Work)的边界(例如是否将译本纳入)。我们在两类数据集上评估了对账(reconciliation)性能:美国获奖图书数据集与当代世界小说数据集。BookReconciler [ v 在美国作品上实现近乎完美的准确率,但在全球文本上表现较低,这反映了非英语与全球文学在书目基础设施中的结构性薄弱。总体而言,BookReconciler [ v 支持跨领域与跨应用复用书目数据,并为数字图书馆与数字人文学科的持续工作作出贡献。
Index Terms / 索引词(对照)
English
中文
Index Terms—bibliographic data, metadata, FRBR, digital humanities, reconciliation, linked data
索引词——书目数据、元数据、FRBR、数字人文学、对账、关联数据
I. INTRODUCTION / I. 引言(对照)
English
中文
In many settings, people work with only minimal bibliographic metadata, often just a book’s title and author—for example, “The Book of Salt” by “Monique Truong” (Fig. 1). Think of a humanities researcher compiling a list of prize-winning novels; an archivist stewarding an underdescribed collection; or a journalist assembling a dataset of banned books.
在许多情境中,人们手里只有极少的书目元数据,往往只有一本书的标题与作者——例如,“The Book of Salt”,作者“Monique Truong”(见图 1)。你可以想象:一位人文学研究者在编制获奖小说清单;一位档案员在管理一批描述不足的馆藏;或一位记者在整理被禁图书的数据集。
While basic information about these books may suffice for certain purposes, enriched metadata may be necessary for others. For example, if users want to analyze genre or time period; connect to other data sources or library systems; or identify related editions, they will need to add subject headings, publications dates, persistent identifiers, and more. What is the best way to enrich and cluster book metadata, especially at scale?
This challenging question has come to the fore in the digital humanities, where researchers increasingly curate and publish bibliographic data, and where they often focus on books at the most abstract Work level—in the sense of the Functional Requirements for Bibliographic Records (FRBR) model. Examples include datasets of major U.S. literary prize winners [1], bestselling novels [2], [3], anthologies of African American literature [4], and works of futuristic fiction [5].
这个棘手问题在数字人文学中变得格外突出。研究者日益策划并发布书目数据,而且他们常常在最抽象的“作品”(Work)层面上研究图书——即功能需求书目记录(FRBR)模型中的 Work 意义。相关的数据集例子包括:美国重要文学奖得主数据集[1]、畅销小说[2]、[3]、非裔美国文学选集[4]、未来主义小说作品[5]。
Despite their great scholarly value, these sorts of datasets remain difficult to build upon because they often include minimal and inconsistent metadata. While on an individual basis we can see that “The Book of Salt” by “Monique Truong” refers to the same entity as “The Book of Salt: A Novel” by “Truong, Monique,” such discrepancies quickly become unwieldy at scale and cannot be resolved even by using computational text similarity approaches.
尽管这类数据集具有显著学术价值,但它们往往只包含稀少且不一致的元数据,因此很难被他人继续利用与复用。个别情况下我们能看出,“The Book of Salt”(Monique Truong)与“The Book of Salt: A Novel”(Truong, Monique)指向同一实体;但在规模化时,这类差异会迅速变得难以处理,甚至无法仅靠计算性的文本相似度方法解决。
To address these challenges, we introduce BookReconciler. Built as an extension for the widely used>Figure 1 caption / 图 1 图注(对照)
English
中文
Fig. 1. A conceptual demonstration of the BookReconciler [ v workflow. A user can submit a dataset with minimal bibliographic metadata, such as book title and author, and enrich the data with ISBNs, subject headings, VIAF identifiers, and more for related editions and formats—what we call a Work cluster. The tool can be used to reconcile sources from the Library of Congress, Google Books, OCLC, HathiTrust, Wikidata, and VIAF.
图 1:BookReconciler [ v 工作流程的概念性演示。用户可提交仅含少量书目元数据的数据集(如书名与作者),并为相关版本与格式(我们称之为一个作品簇 Work cluster)增补 ISBN、主题词、VIAF 标识符等。该工具可用于对接美国国会图书馆、Google Books、OCLC、HathiTrust、Wikidata 与 VIAF。
II. BACKGROUND & RELATED WORK / II. 背景与相关工作(对照)
English
中文
In the digital humanities, new data collectives and>III. OPENREFINE BACKGROUND AND MOTIVATION / III. OpenRefine 背景与动机(对照)
English
中文
The software application OpenRefine has been in active development since around 2008. This web-browser based tool is popular with information professionals and researchers for data cleaning and manipulation. The tool works on tabular data, allowing it to be inserted into a workflow as long as the dataset can be represented as a spreadsheet.
OpenRefine also provides a mechanism to add reconciliation services that follow the W3C Reconciliation Service API standard. By designing BookReconciler [ v as an extension of OpenRefine, we can offer automated reconciliation with a non-technical, user-friendly interface that is familiar to many in the digital humanities, libraries, and elsewhere.
OpenRefine 还提供了一种机制,可添加遵循 W3C Reconciliation Service API 标准的对账服务。将 BookReconciler [ v 设计为 OpenRefine 扩展后,我们就能用熟悉、非技术性的界面提供自动化对账,这对数字人文学、图书馆界以及其他场景的用户尤为重要。
IV. BOOKRECONCILER [ v OVERVIEW / IV. BookReconciler [ v 概览(对照)
English
中文
When reconciling resources, it is best to cast a wide net. This improves the chances of correctly matching a resource and aggregating more identifiers. BookReconciler [ v supports data services including the Library of Congress, Google Books, VIAF, OCLC, Wikidata, and the HathiTrust Digital Library. We select these services because they are among the most widely used, authoritative, and interoperable sources of book metadata available today.
进行资源对账时,最好的策略是“撒大网”,这会提升正确匹配资源的机会,并聚合更多标识符。BookReconciler [ v 支持的数据服务包括:美国国会图书馆、Google Books、VIAF、OCLC、Wikidata 与 HathiTrust 数字图书馆。之所以选择它们,是因为它们是当下最广泛使用、权威且互操作性强的图书元数据来源之一。
These systems vary widely in the types of metadata they store and return, and in their access. We summarize key characteristics of the supported data services as follows:
这些系统在存储与返回的元数据类型、以及访问方式上差异很大。我们概括其关键特征如下:
• Library of Congress (id.loc.gov): Public API access. Provides Work-level search. Works are narrowly scoped. Returns ISBN, LCCN, OCLC Numbers, LC Work URI, and other metadata such as Subject Headings and Genres.
• 美国国会图书馆(id.loc.gov):公开 API。提供作品层搜索。作品范围较窄。返回 ISBN、LCCN、OCLC 号、LC Work URI,以及主题词与体裁等其他元数据。
• Google Books: Public API access. Provides Manifestation-level search. Returns ISBN and other metadata such as Description, Language, and Page Count.
• Google Books:公开 API。提供体现层搜索。返回 ISBN 及简介、语言、页数等元数据。
• VIAF (viaf.org): Public API access. Provides cluster search for Works (Name/Title) and Personal Names. Works are broadly scoped. Returns VIAF Work Identifiers.
• OCLC WorldCat Metadata: API key required. Provides Manifestation-level search, but includes a Work identifier to group related resources. Returns ISBN, OCLC numbers, LCCN, OCLC Work IDs, Dewey (DDC), and other metadata such as Subject Headings, Genres, and Language.
• OCLC WorldCat 元数据:需要 API key。提供体现层搜索,但包含作品标识符用于聚类相关资源。返回 ISBN、OCLC 号、LCCN、OCLC Work ID、杜威分类(DDC),以及主题词、体裁、语言等其他元数据。
• Wikidata: Public API and SPARQL endpoints provide Work-level search. Works are broadly scoped. Returns Work IDs and links to external identifiers for enrichment.
• Wikidata:公开 API 与 SPARQL 端点,提供作品层搜索。作品范围较宽。返回作品 ID 及外部标识符链接以用于增补。
• HathiTrust: No public API, but regular database dumps are available for local querying. Provides Manifestation-level search. Returns ISBN, OCLC, LCCN, HathiTrust Volume IDs, and other metadata such as Earliest Publication Date, Latest Publication Date, and Thumbnail Image.
For all data services, reconciliation begins with a query—book title or author information—to return a matching result set. The tool attempts to cluster together resources, from the result set, that belong to the same Work or author. Clustering is enabled by default, but users can configure the tool to reconcile only a single best match. This is useful in cases where precise matching is required, such as reconciling an exact list of publications from a specific year.
A key limitation of this tool is its reliance on external APIs. Since we do not have access to the full underlying databases, reconciliation is limited to the records returned in each API response.
该工具的关键限制在于依赖外部 API。由于我们无法访问底层完整数据库,对账只能在各 API 响应返回的记录范围内进行。
V. ARCHITECTURE & WORKFLOW / V. 架构与工作流(对照)
English
中文
To begin reconciliation, the user first selects the column they wish to reconcile in OpenRefine—for example, the “title” column for books. Next, they select a reconciliation service, such as OCLC, Google Books, or HathiTrust. Then, they choose any additional columns—such as author/contributor name or publication date—to add as additional “Properties,” which can improve match accuracy. Finally, the user launches the reconciliation process with a single click.
BookReconciler [ v normalizes the submitted metadata and queries the selected API or data source to retrieve candidate matches. It ranks these potential matches using Levenshtein distance (tokenizing and alpha sorted), selecting the most likely result for each row. The Levenshtein measurement is a ratio from 0 to 100, which can be customized by the user but is set at 80 by default. If the service provides native Work identifiers (as in the case of the OCLC WorldCat Metadata API), the tool uses the top-ranked result’s identifier to cluster together additional resources that share the same Work value.
BookReconciler [ v 会规范化提交的元数据,并查询所选 API 或数据源以检索候选匹配。它使用 Levenshtein 距离(对文本进行分词并按字母排序后计算)对潜在匹配排序,为每行选择最可能的结果。Levenshtein 度量是 0 到 100 的比例值,用户可自定义阈值,默认设为 80。若服务提供原生作品标识符(例如 OCLC WorldCat 元数据 API),工具会用排名第一的结果的作品标识符,将共享同一作品值的其他资源聚类起来。
The tool provides an interactive web interface that enables users to inspect and manually curate matches and Work clusters. Users can hover over any reconciled cell to see a preview of matched resources, and can click to explore a more detailed view (in a separate Flask-based interface). For example, a user reconciling The Book of Salt may choose to include or exclude translations from its Work cluster, depending on their goals. Additionally, users can navigate to the original source metadata, offering full transparency and provenance.
该工具提供交互式 Web 界面,便于用户检查并手动策划匹配与作品簇。用户把鼠标悬停在任何已对账单元格上即可看到匹配资源预览,点击后还能在更详细视图中探索(在一个基于 Flask 的独立界面中)。例如,用户对账《The Book of Salt》时,可根据自己的研究目标选择是否将译本纳入作品簇。此外,用户还能跳转到原始来源元数据,获得完全透明的来源与溯源信息。
This workflow balances automation with human oversight. Users maintain control over how Works are defined and clustered, which is particularly important given the diversity of bibliographic practices and the imperfections of even well-structured metadata systems.
Once reconciliation is complete, users can import additional fields such as ISBNs, genres, subject headings, or descriptions using OpenRefine’s “Data Extension” service. When fields contain multiple values (e.g., multiple ISBNs), the tool provides configuration options: values can be joined into a single cell with a delimiter, or exploded into multiple rows.
We evaluate BookReconciler on two datasets: books that won “major” (more than $10,000) U.S. prizes between 1918-2020 (n = 691) [1] and contemporary world fiction published between 2012 and 2023 (n = 1,139) [6]. The prize-winners include novels, poetry, as well as collections of essays and short stories. Some of the books are now canonical, but others are much more obscure. Poetry makes up 37% of the total. The world fiction draws from 13 countries and 9 languages.
We attempt to reconcile each book with each bibliographic service. To provide a baseline comparison, we also test the general Wikidata reconciliation included in OpenRefine by default. We pass in the title and full name of the author (not standardized) as an additional property.
For the U.S. dataset, BookReconciler correctly matches 98% of titles with Google Books and 99% when using all services. We find that performance degrades with poetry, and that variations in author name representation (e.g. “W.S. Merwin” vs “William Stanley Merwin”) can also degrade matching quality depending on the service. On contemporary world literature, the highest accuracy (Google Books) drops to 63%, with very low performance among other services. These results indicate that genre, metadata formatting, language, and region are significant contributing factors to reconciliation performance.
对美国数据集而言,BookReconciler 使用 Google Books 正确匹配 98% 的标题;使用所有服务时达到 99%。我们发现:加入诗歌会降低表现;作者名表示的差异(例如 “W.S. Merwin” 与 “William Stanley Merwin”)也会因服务不同而降低匹配质量。对当代世界文学,最高准确率(Google Books)降至 63%,而其他服务表现很低。这些结果表明:体裁、元数据格式、语言与地域是影响对账性能的重要因素。
VII. AVAILABILITY / VII. 可用性(对照)
English
中文
We release BookReconciler [ v under an open-source license (MIT), allowing researchers, librarians, and developers to freely use, adapt, and extend the tool. The tool is currently available on GitHub and can be installed as a one-click application or with Docker. We also make available a video tutorial and walk-through demonstration [13]. In the near term, maintenance will be supported by the Post45 Data Collective. Long-term sustainability and new development will require broader community contributions and/or external funding.
我们以 MIT 开源许可证发布 BookReconciler [ v,允许研究者、图书馆员与开发者自由使用、改编与扩展。工具目前在 GitHub 上可用,并可作为一键应用或通过 Docker 安装。我们也提供视频教程与演示 walkthrough[13]。短期维护将由 Post45 Data Collective 支持;长期可持续性与新开发需要更广泛的社区贡献和/或外部资金支持。
VIII. CONCLUSION & FUTURE WORK / VIII. 结论与未来工作(对照)
English
中文
We introduce BookReconciler, an open-source reconciliation tool that extends OpenRefine to support metadata enrichment and clustering. Our evaluation shows that the tool achieves near-perfect accuracy on U.S. prize literature (1918-2020) but performs less well on contemporary world literature (2012-2023). Progress in this area will require major authority services to improve multilingual coverage. The tool would also benefit from integrating additional international authority services, such as data.bnf.fr (France), Trove (Australia), NDL Linked Open Data (Japan), and others. Looking ahead, we see potential in using large language models as an additional layer to assess ambiguous matches, provided they are used thoughtfully and in combination with human judgment.
我们介绍 BookReconciler:一个扩展 OpenRefine 的开源对账工具,用于元数据增补与作品层聚类。评测显示,它在美国获奖文学(1918-2020)上达到近乎完美的准确率,但在当代世界文学(2012-2023)上表现较差。要在这一方向取得进展,需要主要权威服务改进多语言覆盖。该工具也将受益于集成更多国际权威服务,如 data.bnf.fr(法国)、Trove(澳大利亚)、NDL Linked Open Data(日本)等。展望未来,我们看到将大语言模型作为额外一层来评估模糊匹配的潜力,但前提是谨慎使用,并与人工判断结合。
[1] C. Grossman, J. Spahr, and S. Young, “The Index of Major Literary Prizes in the US,” Post45 Data Collective, Dec. 2022. Available: https://doi.org/10.18737/CNJV1733p4520221212 [2] J. Pruett, “New York Times Hardcover Fiction Bestsellers (1931–2020),” Post45 Data Collective, Feb. 2022. Available: https://doi.org/10.18737/CNJV1733p4520220211 [3] S. DiLeonardi, B. Cohen, and D. Sinykin, “International Bestsellers: The Dataset,” Post45 Data Collective, Jul. 2025. Available: https://doi.org/10.18737/386521 [4] A. E. Earhart, “DALA, The Database of African American and Predominantly White American Literature Anthologies,” Journal of Open Humanities Data, vol. 11, no. 1, Apr. 2025. Available: https://doi.org/10.5334/johd.298 [5] G. Wythoff and T. Leane, “Time Horizons of Futuristic Fiction,” Post45 Data Collective, Jun. 2025. Available: https://data.post45.org/posts/futuristic-fiction/ [6] A. Piper et al., “Mini Worldlit: A Dataset of Contemporary Fiction from 13 Countries, Nine Languages, and Five Continents,” Journal of Open Humanities Data, vol. 11, no. 1, Jan. 2025. Available: https://doi.org/10.5334/johd.248 [7] B. Tillett, “What is FRBR? A conceptual model for the bibliographic universe,” The Australian Library Journal, vol. 54, no. 1, pp. 24–30, Feb. 2005. Available: https://doi.org/10.1080/00049670.2005.10721710 [8] IFLA, Functional requirements for bibliographic records. De Gruyter Saur, 1998, vol. 19. [9] K. Coyle, “FRBR, Twenty Years On,” Cataloging & Classification Quarterly, vol. 53, no. 3-4, pp. 265–285, May 2015. [10] R. Bennett, B. F. Lavoie, and E. T. O’Neill, “The concept of a work in WorldCat: an application of FRBR,” Library Collections, Acquisitions, & Technical Services, vol. 27, no. 1, pp. 45–59, Mar. 2003. Available: https://doi.org/10.1080/14649055.2003.10765895 [11] D. Vizine-Goetz, “Classify: a FRBR-based research prototype for applying classification numbers,” OCLC NextSpace, vol. 14, pp. 14–15, 2010. [12] Library of Congress, “BIBFRAME - Bibliographic Framework Initiative (Library of Congress).” Available: https://www.loc.gov/bibframe/ [13] Matt Miller, “BookReconciler demo video,” Sep. 2025. Available: https://www.youtube.com/watch?v=V9ZJoFowRJM
需要提醒的是:论文也坦率承认了外部 API 依赖带来的上限——你对账到什么程度,取决于 API 能返回什么、覆盖了什么、以及你是否能拿到密钥(例如 OCLC)。这意味着在实际使用中,研究者需要把“工具性能”与“数据源结构”区分开来评估,并在跨语种研究中更早准备补充策略(例如引入更多国家级权威服务、或者本地化索引与转储数据)。
三、问答(提出10个相关问题,并解答)
1)BookReconciler 的核心创新是什么? 它的核心创新不是提出新理论,而是把“元数据增补 + 作品层聚类”做成一个 OpenRefine 扩展:输入极简表格(书名、作者),输出权威标识符与可扩展字段,并提供交互式作品簇审查界面,让用户以研究目的为准定义 Work 的边界。这让数字人文常见的“从名单到可分析数据集”的过程显著降本提速。
2)它为什么要接入多个书目服务,而不是只用一个(比如 Google Books)? 因为“撒大网”能同时提升召回与增补字段的丰富度:不同服务覆盖的作品、语言、字段类型差异很大。单一来源可能对某类书很好,但缺少另一类书的权威 ID 或主题词。多源对账还能在某个服务缺失时由其他服务补位,并汇聚更多标识符,增强互操作性。
3)论文里说的“作品(Work)层聚类”具体指什么? 指把同一智识作品的多个体现或表达聚为一组,比如不同版本、不同译本、不同再版。在 FRBR 术语里,这是把 Manifestation/Expression 归入同一 Work 的过程。论文强调 Work 的边界并非统一标准,因此需要用户在界面里决定是否纳入译本等。