Adding Devon content

This new DA website shouldn’t be just for existing members, nor should it be solely about matters of relevance to the Association. I plan to expand its content to make it a useful web resource for anyone who’s interested in the county of Devon – at least the aspects that fall within our remit. Of course persuading more people to visit our website will benefit us too: we will increase the recognition of the Association, and we can anticipate that a few of the new visitors will want to become members.

Part of title page of the 1906 volume of Transactions

Part of title page, Transactions 1906

Historically, we have published very little content about Devon on our website – mainly the abstracts of papers published in Transactions since 1999 and a series of PDF reports produced by the Buildings Section since 1996. We also included on the website a PDF copy of the latest issue of DA News, in which we published the annual specialist reports about the locations of our annual Conferences. I’ve included these reports back to 2011 (Axminster).

We do, however, have access to a vast quantity of information about Devon: the 2,500 (and counting) papers and a large number of Section reports that together make up our series of Transactions, of which the 150th volume will be published late in 2018. The early volumes (up to 1920) are considered to be out of copyright and most of them have been scanned and OCR’ed, notably by Google as part of their Google Books project. These are available online via the Internet Archive and we have a list of them here which includes links to the scanned copies.

A typical poor scan showing shadow of finger

Although these early Transactions are 100 to 150 years old, some of their content is still of value to researchers and scholars. Whilst they can be read online or downloaded as PDF files, the scanning process is not perfect meaning that not all the content is always present. For example, some pages are scanned poorly (I’ve seen several intruding fingers!) and some of the original documents will be faulty with poorly printed, torn or missing pages, etc. Illustrations are often not reproduced well, and fold-out pages were usually left folded during the scanning process, meaning that they are only partly present or completely omitted, and where they overlap a page of print, just a narrow strip of text is scanned.

On top of these problems the subsequent optical character recognition (OCR) process that converts the scanned images into text and makes the content searchable is also prone to errors. Poor scans upset it completely. Manuscript annotations, underlining, foxing, etc. cause problems, and if errors occur in uncommon words – the words that researchers will want to search for – then just one error can mean that an entire relevant article may be overlooked.

We should be grateful to those organisations that have made this content freely available to all, but we can do better! By comparing the OCR’ed text against the original scan (or, if necessary the original document) and correcting the text to match, we can ensure that nothing in the originals is missed. Rescanning significant illustrations will be useful too. I’ve made a tiny start by completing the process for a few early papers, see the Historical content section on the Devon page. The text will be indexed by the internet search engines and will appear in search results. We can do searches within the site too (click on the magnifying glass top right of every page).

I intend to continue proofreading old content on an ad hoc basis – if I did one a day it would only take about 8 years to do all the papers! So if anyone wants to help, do let me know (email). Problems like scanned fingers are not frequent and a typical OCR’ed article takes maybe an hour or two to correct: although not seen by some as an enthralling task, one does learn new things (maybe leading to new avenues to explore), and I think it’s a more fulfilling use of time than watching another repeat of Midsomer Murders…

If you have any ideas as to which types of content are most likely to be of use to today’s researchers, please let me know so that the proofreading can be directed appropriately. I believe that reports on flora and fauna may be particularly welcome, as indicators of the changing environment.

More on this subject will follow.

Website Admin
(updated after go-live)

1 Response

  1. admin says:

    I’ve just come across a section of the important British History Online website that explains that most of their content (transcriptions of old documents) is produced either by double re-keying or by manually checking and correcting OCR’ed documents, and it has led to an estimated accuracy of greater than 99.995%. See for details.