BHL Participates in the Global Names Workshop – Biodiversity Heritage Library
News
Featured Books
All Featured Books
Book of the Month Series
BHL at 20
User Stories
Campaigns
Fossil Stories
Garden Stories
Monsters Are Real
Page Frights
Her Natural History
Earth Optimism 2020
Tech Blog
Visit BHL
BHL News
Blog Reel
Tech Updates
BHL Participates in the Global Names Workshop
Participants at the The Global Names Project workshop discuss progress in a morning “stand up” briefing. Photo by Deborah Paul (iDigBio).
The Global Names Project held a workshop on 17-19 June 2019 on the Campus of the University of Illinois at Urbana-Champaign. The workshop was titled
Scientific names indexing and data mobilization of Biodiversity Heritage Library using tools from Global Names project
and was hosted by the Species File Group at the
Illinois Natural History Survey
. Eighteen people attended representing a variety of organizations interested in BHL content:
Global Names Architecture
iDigBio
TaxonWorks
UIUC Species File Group
, the
Illinois Library
Encyclopedia of Life
the DINA Project
, the
Catalogue of Life
GBIF
Species File Group
Argentina, the
HathiTrust Research Center
, and
Global Biotic Interactions
The workshop was organized as an unconference/hackathon in which the meeting is planned by all participants at the workshop. We initially all proposed topics we individually were interested in exploring; these were our “selfish goals”. In an exercise at the workshop, those goals were broken into similar or related topics. The most popular topics (see those sticky notes on the wall — in the background of the photo) became the focus of “pitches”, i.e. challenges that we could address at the workshop. We self-organized into working groups under the banner of pitch and got to work.
Note that at a hackathon, the goal is that you are always either “doing or learning.” For example, some of us learned how to mine BHL content using the
Developer and Data Tools
. And if you’d like to try it, you too can install and use the
gnfinder and gnparser tools
. The gnparser tool breaks scientific name-strings into the semantic elements of the string. While
gnfinder
searches text output (like OCR) for names.
Overall, the activities of the workshop centered around further improving the information that we can extract from the OCR (optical character recognition) content that is generated from the page images in BHL, including improving that OCR content itself.
One group focused on attempting to find Species Identification Keys in BHL. Using a versioned, citable, and verifiable snapshot of the BHL OCR text corpus
, the group discovered that a variety of ways in which a species identification key is labeled in the text combined with the natural inaccuracies of OCR make the task of identifying a heading for a key challenging
Another group worked on connecting the APIs of TaxonWorks, Global Names, and BHL. Their goal was to integrate information and resources from all three in a single interface that highlighted the BHL pages that species were originally described on. This group managed to wrap all three APIs in a single place (a “Task” in TaxonWorks), but problems with matching citation data across platforms prevented them from truly “closing the loop”.
Finally, the largest group focused on extracting different entities from the OCR content of the BHL, for example geographic names, people names, and organizations. This group experimented with a variety of natural language techniques and tools including the
Edinburgh Geoparser
IBM Watson
Microsoft Azure Cognitive Services
, and
LingPipe
and identified some additional challenges to extracting such entities from BHL. Not surprisingly, there is some overlap between place names and taxon names. For example, “St. Lucia” can be conflated with the genus “Lucia” (a type of butterfly), which certainly adds a hurdle for accurate entity identification.
The results of the workshop are being integrated into a Wiki that contains our initial goals and that invites other stakeholders to get involved. One direct outcome of the workshop is that the BHL will move to provide quarterly exports of the OCR, available to anyone, to mine and experiment with. Previously, this content was not easily downloadable. The workshop discussions and hacking drove home the point that this corpus is a key element for future developments. Many other broader topics were also raised throughout the meeting. In particular, we explored the idea of opening a worldwide biodiversity informatics channel to better facilitate communication and share ideas among interested parties in real-time. This could be done using
Slack
Many thanks to the Global Names and the Illinois Natural History Survey for hosting, and especially Dima Mozzherin for all of his work on the Global Names Name Finding algorithm, which has opened the door to moving BHL’s content into the next decade.
References
[1] Poelen, Jorrit H. (2019). A biodiversity dataset graph: Biodiversity Heritage Library (BHL) (Version 0.0.1) [Data set]. Zenodo.
[2] Poelen, Jorrit H., Schulz, Katja, Trei, Kelli J., & Rees, Jonathan A. (2019, July 10). Finding Identification of Keys in the Biodiversity Heritage Library (Version 1.1). Zenodo.
Discover more from Biodiversity Heritage Library
Subscribe to get the latest posts sent to your email.
Data Mining
Global Names
technical development
technical team
Workshops
July 15, 2019
Written by
Joel Richard
Matt Yoder
Deborah Paul
Jorrit Poelen
and
Mike Lichtenberg
Joel is the Technical Coordinator for BHL. When he's not serving as the Head of Web and IT for the Smithsonian Libraries and Archives, he's also working on BHL's Macaw software.
Matt Yoder is a Biological Informatician at the Illinois Natural History Survey at the University of Illinois at Urbana-Champaign.
Debbie Paul is the Digitization and Workforce Development Manager at iDigBio.
Jorrit Poelen, independent software engineer, lives and works in Oakland were he uses frugal methods to link and preserve biodiversity data.
Mike Lichtenberg is the Lead Software Engineer for the Biodiversity Heritage Library.
Related Posts
Quello che era nuovo in TDWG 2013?
November 29, 2013
Changes Coming to the BHL Data Exports Files on 10 April 2019
April 3, 2019
Annual BHL Institutional Council Meeting, 2011
March 28, 2011
Leave a Comment
Cancel Reply
« Previous post
BHL Adds Functionality Allowing Partners to Upload Crowdsourced Transcriptions of Digitized Archival Materials
Next post »
Sharks and More: Discovering Animals in the Sixteenth Century and Today
Help Support BHL
BHL's existence depends on the financial support of its patrons. Help us keep this free resource alive!
About BHL
The
Biodiversity Heritage Library
(BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a
worldwide consortium
of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”
Follow BHL
Join Our Mailing List
Sign up to receive the latest news, content highlights, and promotions.
Subscribe Now
Subscribe to Blog via Email
Join 255 other subscribers
Subscribe to our Blog Via RSS
Subscribe to the blog RSS feed to stay up-to-date on all the latest BHL posts.
Access RSS Feed
US