

the METAe and IMPACT projects in the EU or the eMOP project in the US. In the past, various projects and initiatives have been working on improving OCR for historical documents, e.g.
Semi ocr font software#
However, Optical Character Recognition (OCR) - the process of having software automatically detect and recognize text from an image - is still a complex and error-prone task, especially for historical documents with their idiosyncrasies and wide variability of font, layout, language and orthography.
Semi ocr font full#
must no longer stop with the scanning of said documents - nowadays users and researchers demand access to the full text, not just for search and retrieval but increasingly also for purposes of text and data mining. The digitisation of books, newspapers, journals and so forth. We hope you find the below articles insightful and if you or your institute are working on OCR or presentation and access to textual resources, let us know. Most of the texts have been OCRed making it very easy for users to search and browse through the portal or for researchers to get full-text extractions via the API. Users can explore millions of pages from over hundreds of newspapers from around Europe. Just over a month ago, the new Europeana Newspapers Collection was published on Europeana. But the consortium has continued to grow and improve the initial work done by the Europeana Newspapers project resulting in a new presentation of this material online. If the OCR is inaccurate, then additional features such as Named Entity Recognition are made increasingly more difficult. Both projects delivered millions of text records to Europeana and each encountered many challenges related to OCR including just understanding how accurate the automated OCR actually is. The event was an incredible showcase of projects working to improve the creation, transformation and exploitation of historical documents in digital form.įor Europeana, OCR has been integral perhaps most visibly in the Europeana Newspapers and DM2E (Digital Manuscripts to Europeana) projects. The inspiration for this issue of EuropeanaTech Insight came from the 2019 DATeECH conference attended by EuropeanaTech Steering Group chair, Clemens Neudecker.

With this issue of EuropeanaTech we highlight three use-cases related to OCR and show the types of problems different organizations encounter and how they work to solve them. Yet despite how far OCR has come, there are still many challenges institutes face including non-latin alphabets and article detection. It then allows for full-text search, making exploration and searching material much more accessible for users thus making it easier to find what they’re looking for. OCR in this regard is, in its simplest terms, the process of converting digital scans of historical documents into full-text. Optical Character Recognition is an essential resource for cultural heritage institutes working to make their text content available for users.
