To bypass this involvement of software developers and to empower librarians and documentalists, web scraping toolkits are an excellent way to gather content without diving deep into programming issues like development environments or dependency management. But to code a full-blown web scraper for various sites, they need the support of real programmers. These people are well-aware of markup languages like XML or HTML and maybe know a little about DOMs or processing XML-resources with XPath or XSLT. This is an opportunity we would like to make use of and present a web scraping tool that does not demand that digital library curators program custom web scrapers from scratch but instead use a mighty, but still light-weight, toolkit that does not force them to learn to program.ĭata curation is a typical task that is done by people with a library and information science background. Although managing and preparing high quality metadata is not their core business they are somehow able to present their content online in a rather structured form. The same is true for many research institutes or funding agencies. Ironically, many small and medium-size publishers do have a web page or an online catalogue. While some of these partners or content providers are technically and organizationally able to provide a clean set of parsable metadata, many do not have the necessary technical manpower to prepare these metadata sets. Other examples are disciplinary open access repositories like the Social Science Open Access Repository (SSOAR) that gather available full text items from different partner organizations like publishers, research institutes, and individuals. Ley (2009) gave an excellent overview and insight into all the traps one might fall. One of the largest digital libraries that lead the way in digitizing this data extraction process is the dblp computer science bibliography, which built up their process chain to heavily rely on automatic metadata extraction from many different sources. While this might be a trivial task for programmers, librarians and content curators are most likely overwhelmed with such a task and its complexity and pitfalls. Usually this is done by coding custom data handlers or conversion scripts with languages like Perl or Python. Not only do digital content curators need to assess many different data sources intellectually but also need to invest a lot of time and effort to extract the available data sets. Introduction and Motivationīuilding up new collections for digital libraries is an expensive and demanding task. On top of that, we also present a syntax highlighting plugin for the popular text editor Atom that we developed to further support OXPath users and to simplify the authoring process.īy Mandy Neumann, Jan Steinberg, and Philipp Schaer 1. We also point out some practical things to consider when creating a web scraper (with OXPath). By taking one of our own use cases as an example, we guide you in more detail through the process of creating an OXPath wrapper for metadata harvesting. We present the open-source tool OXPath, an extension of XPath, that allows the user to define data to be extracted from websites in a declarative way. ![]() Therefore we would like to present a web scraping tool that does not demand the digital library curators to program custom web scrapers from scratch. As data curation is a typical task that is done by people with a library and information science background, these people are usually proficient with XML technologies but are not full-stack programmers. This may be the case for small to medium-size publishers, research institutes or funding agencies. In cases where the desired data is only available on the data provider’s website custom web scrapers are needed. Available data sets have to be extracted which is usually done with the help of software developers as it involves custom data handlers or conversion scripts. Building up new collections for digital libraries is a demanding task.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |