Browser Screenshot Comparison Tool

Fido

↧

POSTPONED Digital Preservation Without Tears

September 24, 2013, 8:31 am

≫ Next: Measuring Bigfoot

≪ Previous: FIDO News

Link:

The ‘Digital Preservation Without Tears’ Mash-up will appeal to collection owners and developers. The programme offers two connected strands – a hack and a sprint.

In the hack, developers will have two days to develop, test and enhance practical tools for digital preservation. Collection owners will be invited to bring problem elements of their digital collections for analysis using the latest digital forensic and characterisation tools. This will help the collection owners develop practical workflows for management and preservation while helping developers spot and refine solutions that will enable better tools.
In the sprint, collection owners will examine current thinking on digital preservation policy and planning in their organisations. Collections owners will present their own digital preservation policies and will be invited to assess these against each other and against emerging good practice, providing a managed environment for policy development and peer review. Collection owners will then be invited to pool their wisdom to create a Digital Preservation Policy Building Toolkit that can be shared.

This mashup will:

Provide a forum for practical problem solving for analysis of digital collection
Provide a forum for discussion, review and development of digital preservation policy
Bring together developers and collection owners from across the DPC and OPF to address shared challenges
Extend and enhance the corpus of digital preservation tools
Deliver a simple beginners’ guide for the development of digital preservation policies

This event will be of interest to:

Collections managers, librarians, curators and archivists and policy makers in all institutions with an interest in digital preservation
Techies, tools developers, IT officers, database managers and systems analysts with an interest in long term data management
Innovators and researchers digital preservation
Vendors and providers of digital preservation services
CEO’s CTO’s and CIO’s seeking to develop institutional capacity for digital preservation

Everyone coming needs to bring a lap top computer. In addition:

Collection owners will need to bring a data set that is giving them trouble in terms of characterisation or identification and be prepared to present their institutional policy on digital preservation
Techies will need to tell us about the skills they have and bring a knowledge of existing digital forensic and characterisation tools

Also, because elements of the mash-up include peer-review of existing practice, participants need to understand and consent to working under ‘Chatham House Rules’ for parts of the programme.

Places are strictly limited and should be booked in advance. Priority will be given to DPC and OPF members who can attend at no cost. Non-members are welcome at a cost of £150 pounds per person. Lunch and refreshments are provided on three days and dinner on the first night. Accommodation will be recommended but is not included in the cost. Register online at: http://www.dpconline.org/events

Can’t make it?

Parts of the event will be available as a webcast. We’ll publish the slides after each event and will tweet live from the event using the hashtag #DPnoTears.

Location:

Innovation Centre, University of York

Innovation Way Heslington

YO10 5DGYork

United Kingdom

53° 56' 53.0772" N, 1° 2' 49.074" W

See map: Google Maps

Event Types:

Hackathon

Preservation Topics:

Preservation Actions

Preservation Strategies

Projects

Representation Information

SPRUCE

↧

Measuring Bigfoot

October 8, 2013, 9:24 am

≫ Next: Scalable Environments for File Format Identification and Characterisation

≪ Previous: POSTPONED Digital Preservation Without Tears

My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.

Numbers first?

Ross overall point is that we need the numbers first; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be possible to use them to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there's a pretty fundamental problem, which I'll try to explain below. To avoid any confusion here, I will be speaking of "format risk" here in the sense used by Graf & Gordea, which follows from the idea of "institutional obsolescence" (which is probably worth a blog post by itself, but I won't go into this here).

The risk model

Graf & Gordea define institutional obsolescence in terms of "the additional effort required to render a file beyond the capability of a regular PC setup in particular institution". Let's call this effort E. Now the aim is to arrive at an index that has some predictive power of E. Let's call this index R_E. For the sake of the argument it doesn't matter how R_E is defined precisely, but it's reasonable to assume it will be proportional to E (i.e. as the effort to render a file increases, so does the risk):

R_E∝ E

The next step is to find a way to estimate R_E (the dependent variable) as a function of a set of potential predictor variables:

R_E = f(S, P, C, ... )

where S = software count, P = popularity, C = complexity, and so on. To establish the predictor function we have two possibilities:

use a statistical approach (e.g. multiple regression or something more sophisticated);
use a conceptual model that is based on prior knowledge of how the predictor variables affect R_E.

The first case (statistical approach) is only feasible if we have actual data on E. For the second case we also need observations on E, if only to be able to say anything about the model's ability to predict R_E (verification).

No observed data on E!

Either way, the problem here is that there's an almost complete lack of any data on E. Although we may have a handful of isolated 'war stories', these don't even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model¹. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?

Looking at Ross's suggestions for collecting more data, all of the examples he provides fall into the potential (!) predictor variables category. For instance, prompted by my observation on compression in PDF, Ross suggests to start analysing large collections of PDFs to establish patterns on the occurrence of various types of compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don't tell you if PDF is "riskier" than another format, but he argues that:

once we've got them [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.

Aside from the fact that it's debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there's a more fundamental issue here. Bearing in mind that, ultimately, the thing we're really interested in here is E, how could collecting more data on potential predictor variables of E ever help here in the near absence of any actual data on E? No amount of clever maths or statistics can compensate for that! Meanwhile, ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross's suggestions), even though the purpose of these efforts remains largely unclear.

Within this context I was quite intrigued by the grant proposal mentioned by Andrea Goethals which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on E (although like Andy Jackson said here I'm also wondering whether this may be too ambitious).

Obsolescence-related risks versus format instance risks

On a final note, Ross makes the following remark about the role of tools:

[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats - and more and more should appear over time.

This is true to some extent, but a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn't say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view. Such risks affect file instances of current formats, and this is an area that is covered by the OPF File Format Risk Registry that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by FFMA. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross's suggestion on compression in PDF entails (if I'm understanding him correctly) the analysis of large volumes of PDFs in order to gather statistics on the use of different compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for 'risky' features.

On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. ↩

Preservation Topics:

Preservation Risks

Format Registry

Corpora

↧

Scalable Environments for File Format Identification and Characterisation

October 14, 2013, 5:00 am

≫ Next: Fund it, Solve it, Keep it (with SPRUCE)

≪ Previous: Measuring Bigfoot

Link:

Webinar registration

This webinar provides an introduction to file format identification and characterisation tools which have been developed or extended as part of the SCAPE Project.

It covers the basic principals of file format identification, and shows how format information drives digital preservation workflows.

Participants will be given an overview of file format registries, and their role in digital preservation, and will see demonstrations of identification and characterisation tools including fido and tika.

We will provide a Virtual Machine image with samples files and step-by-step worksheets to allow participants to try out these exercises for themselves after the webinar with support.

Learning outcomes (by the end of the webinar and exercises, participants
will be able to):

Distinguish between different file types and identify the requirements for characterising each of them.
Carry out identification and characterisation experiments on example files.
Compare characterisation and identification tools and understand their advantages and disadvantages when used in different scenarios.

Session Lead: Carl Wilson, OPF
Date: Friday 25 October
Time: 12 noon BST / 13:00 CET
Duration: 1 hour (please note this includes the presentation and demonstrations. Practical exercises can be carried out after the webinar).

There are 25 places available which will be allocated on a first come, first serve basis.

Date:

Friday, 25 October 2013

Location:

United Kingdom

55° 22' 40.9836" N, 3° 26' 9.5028" W

See map: Google Maps

Event Types:

Preservation Topics:

Fido

↧

Fund it, Solve it, Keep it (with SPRUCE)

October 24, 2013, 4:38 am

≫ Next: SCAPE/OPF Continuous Integration update

≪ Previous: Scalable Environments for File Format Identification and Characterisation

Link:

Event information

How to fund and solve your digital preservation challenges

What will the event do for me?

This event will help to make your digital preservation more effective by demonstrating the best community focused approaches and results from the JISC funded SPRUCE Project. You'll be hearing from the SPRUCE Team experts and from the practitioners and developers who have been tackling digital preservation challenges in targeted SPRUCE Award projects. We'll also be hearing from you, so we can take on board what you need from our future work.

If you're taking your first steps in preserving your digital assets we will demonstrate how to get started, where to get help, and how to make the case to resource your work more effectively.
If you're already engaged in digital preservation we'll show how your efforts can be supported more effectively with help from the community.

Key topics we will be covering include:

Securing funding for your digital preservation activities with the Digital Preservation Business Case Toolkit
Community approaches to solving digital preservation challenges
SPRUCE guides on how to assess your digital collections
Stabilising data stored on obsolete hand-held media
Results from the SPRUCE Award Projects

Who is this for?

Practitioners, developers and middle managers who are engaged (or would like to be engaged) in preserving their organisation's digital assets.

When, where and how do I register?

The free event will take place at 11am on the 25th November at the brand new Library of Birmingham. Register your attendance here. Please note that anyone who registers for the event and then fails to attend without giving at least one week of notice will be liable for a £50 cancellation charge. Places are limited, so please don't waste them!

Date:

Monday, 25 November 2013

Location:

Library of Birmingham

Centenary Square Broad Street

B1 2NDBirmingham

United Kingdom

52° 28' 45.6924" N, 1° 54' 29.3328" W

See map: Google Maps

Event Types:

Conference

Preservation Topics:

Preservation Actions

Preservation Strategies

SPRUCE

Software

↧

SCAPE/OPF Continuous Integration update

November 1, 2013, 3:19 am

≫ Next: OPF Webinar: Securing funding for your digital preservation, with SPRUCE

≪ Previous: Fund it, Solve it, Keep it (with SPRUCE)

As previously blogged about by Carl we now have virtually all SCAPE and OPF projects in Continuous Integration; building and unit testing in both Travis CI and Jenkins.

Travis compiles the projects and executes unit tests whenever a new commit is pushed to Github, or when a pull request is submitted to the project.
Jenkins builds are generally scheduled once per day. After a build the software has its code quality analysed by Sonar

Complete details of how to build each non-Java project are contained within the .travis.yml files that are found in the project directories. As a side effect of this work the .travis.yml files can be used as instructions for independently building the projects.

Matchbox, Xcorrsound and Jpylyzer have CI builds that are capable of generating an installable Debian package, which we are aiming to publish. Java projects have had their Maven GroupId and package names changed to the appropriate SCAPE names so we can publish binary snapshots.

The daily Maven snapshots of code built in Jenkins are now (or soon will be) published to https://oss.sonatype.org/content/repositories/snapshots/eu/scape-project/ and can be used by adding this repository to your pom.xml:

<parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
</parent>

What you can do for your project

Maintain your .travis.yml file if project dependencies change
Ensure code matches the SCAPE/OPF functional review criteria – correct Java package names and Maven GroupIds are essential to be able to publish snapshots
Ensure your project has an up to date README that contains details of how to build and run your software (including dependencies)
Very importantly ensure that your project has (at the very least) a top level LICENSE, ideally source files should each contain a license header
Add unit tests for your project
Ensure that unit tests for your project can easily be run using standard dependencies. Relying on your particular installation for unit tests to pass means that they cannot be successfully run by Travis/Jenkins and show as test failures. Whilst it might not always be possible to have unit tests that can be run independently, if there have to be test dependencies then please document how these should be set up!
Check your project at http://projects.opf-labs.org/

The CI days are generally about once a month. If you are interested in joining us do let us know as we could always do with more help. It’s an opportunity for you to work on CI with Travis/Jenkins, and do other work that is interesting (and rewarding), such as Debian packaging, that you might not normally get to work on.

Preservation Topics:

Packaging

Software

jpylyzer

↧

OPF Webinar: Securing funding for your digital preservation, with SPRUCE

November 5, 2013, 8:24 am

≫ Next: Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

≪ Previous: SCAPE/OPF Continuous Integration update

Link:

Making the case to your organisation's management, or to external funders, to adequately resource your digital preservation activities is not an easy task. Digital preservation is not always a straightforward sell. In this financial climate the justification for spending money has to be compelling and watertight. In this webinar Paul Wheatley will describe how to make the case for funding your digital preservation, with reference to the SPRUCE Project's Digital Preservation Business Case Toolkit.

* Making a compelling case to fund digital preservation
* The Digital Preservation Business Case Toolkit from SPRUCE
* Getting started
* Other resources

There are twenty-five places available on a first come, first serve basis.

Date: Wednesday 27 November

Time: 14:00 GMT / 15:00 CET

Duration: 1 hour

Session Lead: Paul Wheatley, SPRUCE Project Manager, University of Leeds

Date:

Wednesday, 27 November 2013

Location:

Online

United Kingdom

55° 22' 40.9836" N, 3° 26' 9.5028" W

See map: Google Maps

Event Types:

Webinar

Preservation Topics:

Preservation Strategies

· Full-text search on top of warcbase

SPRUCE

Software

↧

Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

December 6, 2013, 8:30 am

≫ Next: SCAPE Training - Preserving Your Preservation Tools

≪ Previous: OPF Webinar: Securing funding for your digital preservation, with SPRUCE

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied by the background noise of mouse clicks and keyboard typing.

In #scapeproject we always have a high ceiling #hadoop4DP #fb pic.twitter.com/mVRsBfArFC
— Per Møldrup-Dalum (@perdalum) 2. Dezember 2013

There were Hadoop newbies, people from the SCAPE Project with some knowledge about Apache Hadoop related technologies, and, finally, Jimmy Linn who works currently as an associate professor at the University of Maryland and who was employed as chief data scientist at Twitter before. There is no doubt that his profound knowledge of using Hadoop in an ‘industrial’ big data context was that certain something of this event.

The topic of this Hackathon was large-scale digital preservation in the web archiving and digital books quality assurance domains. People from the Austrian National Library presented application scenarios and challenges and introduced the sample data which was provided for both areas on a virtual machine together with a pseudo-distributed Hadoop-installation and some other useful tools from the Apache Hadoop ecosystem.

I am sure that Jimmy’s talk about Hadoop was the reason why so many participants became curious about Apache Pig, a powerful tool which was humorously characterised by Jimmy as the tool for lazy pigs aiming for hassle-free MapReduce. Jimmy gave a live demo running some pig scripts on the cluster at his university explaining how pig can be used to find out which links point to each web page in a web archive data sample from the Library of Congress. Asking Jimmy about his opinion on Pig and Hive as two alternatives for data science to choose from, I found it interesting that he did not seem to have a strong preference for Pig. If an organisation has a lot of experienced SQL experts, he said, Hive is a very good choice. On the other hand, from the perspective of the data scientist, Pig offers a more flexible, procedural approach for manipulating data and to do data analysis.

Towards the end of the first day we started to split into several groups. People gathered ideas in a brainstorming session which at the end led to several groups:

· Cropping error detection

· Hadoop-based Identification and Characterisation

· OCR Quality

· PIG User Defined Functions to operate on extracted web content

· PIG User Defined Functions to operate on METS

Many participants made their first steps in Pig scripting during the event, so it is clear that one cannot expect code that is ready to be used in a production environment, but we can see many points to start from when we do the planning of projects with similar requirements.

On the second day, there was another talk by Jimmy about HBase and his project WarcBase which looks like a very promising approach of providing a scalable HBase storage backend with a very responsive user interface that offers basic functionality of what the WayBack machine does for rendering ARC and WARC web archive container files. In my opinion, the upside of his talk was to see HBase as tremendously powerful database on top of Hadoop’s distributed file system (HDFS), Jimmy brimming over with ideas about possible use cases for scalable content delivery using HBase. The downside was to hear his experiences about how complex the administration of a large HBase cluster can become. First, additionally to the Hadoop administration tasks, it is necessary to keep additional daemons (ZooKeeper, RegionServer) up and running, and he explained how the need for compacting data stored in HFiles, once you believe that the HBase cluster is well balanced, can lead to what the community calls a “compaction storm” that blows up your cluster - luckily this only manifests itself with endless java stack-traces.

One group provided a full text search for WarcBase and they picked up the core ideas from the developer groups and presentations to build a cutting-edge environment where the web archive content was indexed by the Terrier search engine and the index was enriched with metadata from the Apache Tika mime-type and language detection. There were two ways to add metadata to the index. The first option was to run a pre-processing step that uses Pig user defined function to output the metadata of each document. The second option was to use Apache Tika during indexing to detect both the MimeType and language. In my view, this group has won the price of the fanciest set-up, sharing resources and daemons running on their laptops.

I was impressed how in the largest working group the outcomes were dynamically shared between developers: One implemented a Pig user defined function (UDF) making use of Apache Tika’s language detection API (see section MIME type detection) which the next developer used in a Pig script for mime type and language detection. Also Alan Akbik, SCAPE project member, computer linguist and Hadoop researcher from the University of Berlin, was reusing building blocks from this group to develop Pig Scripts for old German language analysis using dictionaries as a means to determine the quality of noisy OCRed text. As an experienced Pig scripter he produced impressive results and deservedly won the Hackathon’s competition for the best presentation of outcomes.

The last group was experimenting with functionality of classical digital preservation tools for file format identification, like Apache Tika, Droid, and Unix file, and looking into ways to improve the performance on the Hadoop platform. It’s worth highlighting that digital preservation guru Carl Wilson found a way to replace the command line invocation of unix file in FITS by a Java API invocation which proved to be ways more efficient.

Finally, Roman Graf, researcher and software developer from the Austrian Institute of Technology, took images from the Austrian Books Online project in order to develop python scripts which can be used to detect page cropping errors and which were especially designed to run on a Hadoop platform.

On the last day, we had a panel session with people talking about experiences regarding the day-to-day work with Hadoop clusters and the plans that they have for the future of their cluster infrastructure.

Panel session, sharing experiences -adventures in implementing #Hadoop4DP #SCAPEProject pic.twitter.com/Mjssc1pSSy
— OPF (@openplanets) 4. Dezember 2013

I really enjoyed these three days and I was impressed by the knowledge and ideas that people brought to this event.

Preservation Topics:

↧

SCAPE Training - Preserving Your Preservation Tools

December 17, 2013, 5:18 am

≫ Next: OPF Webinar - From the Preservation Toolkit: JHOVE2

≪ Previous: Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

Link:

Event wiki page

Overview

Learning to Think Like a Package Maintainer

Lots of great digital preservation applications and services exist, however very few are actively maintained and thus preserved! This is a big problem! By introducing the steps to develop these and engage the support of the community, this training course looks at what can be done to improve this situation. Specifically, this training course looks at how to prepare packages for submission into the very heart of many digital environments; the operating system and directly associated “app-stores”. Attendees will be given hands-on experience with developing and maintaining packages rather than software and key differences will be discussed and evaluated. Better preservation of preservation tools, means better preservation our digital history.

Learning Outcomes (by the end of the training event the attendees will be able to):

Understand the complexities of package management and distinguish between the different practices relating to both package objectives and chosen programming language.
Be able to carry out advanced package management operations in order to critically appraise current packages and propose changes.
Understand the importance of clearly defined versioning and licenses and the role of clear documentation and examples.
Apply best practice techniques in order to create a simple package suitable for long term maintenance.
Evaluate a number of options for managing package configuration and behavior relating to package installation, removal, upgrade and re-installation.
Analyse opportunities for automating package management and releases, maintaining a clear focus on the user and not the developer.
Critically evaluate opportunities to generalise package management to allow the easy building and maintenance of packages on multiple platforms.
Assess the potential to apply package management techniques in your own environment.

Delegates will receive a certificate of attendance for the training course.

The agenda can be seen here: http://wiki.opf-labs.org/display/SP/Agenda+-+Preserving+Your+Preservation+Tools.

Registration will open in early 2014.

Date:

Wednesday, 26 March 2014 to Thursday, 27 March 2014

Location:

The National Library of the Netherlands

Prins Willem-Alexanderhof 5

2595 BE The Hague

Netherlands

See map: Google Maps

Event Types:

Preservation Topics:

↧

OPF Webinar - From the Preservation Toolkit: JHOVE2

January 9, 2014, 3:16 am

≫ Next: Standing on the Shoulders of Your Peers

≪ Previous: SCAPE Training - Preserving Your Preservation Tools

Link:

This webinar will give an overview of JHOVE2, the free and open-source tool for characterizing digital objects. It will cover the motivation for creating a second-generation version of JHOVE some of the new features of the tool, including the ability to perform not just format identification, validation, and feature extraction, but also assessment (a policy-based determination of the acceptability of a format instance, regardless of its validation). It will discuss JHOVE2's more sophisticated data model of a format instance, embracing complex digital objects that can be composed of more than one file, each of a possibly different format. It will provide pointers on JHOVE2's setup and use. It will briefly introduce the tool's architecture, and ways in which it has been and can continue to be extended to include more formats, building on existing libraries and tools.

There are twenty-five places available on a first come, first serve basis.

Date: Friday 31 Janaury

Time: 09:00 EST / 14:00 GMT / 15:00 CET

Duration: 1 hour

Session Lead: Sheila Morrissey, Portico

Date:

Friday, 31 January 2014

Location:

United Kingdom

55° 22' 40.9836" N, 3° 26' 9.5028" W

See map: Google Maps

Event Types:

Webinar

Preservation Topics:

Characterisation

↧

Standing on the Shoulders of Your Peers

January 23, 2014, 1:01 am

≫ Next: Identification of PDF preservation risks: analysis of Govdocs selected corpus

≪ Previous: OPF Webinar - From the Preservation Toolkit: JHOVE2

In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb's Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René's The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this event with a lot of enthusiasm and fingers itching to continue the work I started there.

As Clements and René writes in their blog post on this event, collaboration had, without really being stated explicit, taken centre place and that in it self was a good experience.

For the hackathon Jimmy Lin from the University of Maryland had been invited to present Hadoop and some of the adjoining technologies to the participants. We all came with the hope of seeing cool and practical usages of Hadoop for use in digital preservation. He started his first session by surprising us all with a talk titled: Never Write Another MapReduce Job. It later became clear Jimmy enjoys this kind of gentle provocation like in his 2012 article titled If all you have is a hammer, throw away everything that is not a nail. Jimmy, of course, did not want us to throw away Hadoop. Instead he gave a talk on how to get rid of all the tediousness and boiler plating necessary when writing MapReduce jobs in Java. He showed us how to use Pig Latin, a language which can be described as an imperative-like, SQL-like DSL language for manipulating data structured as lists. It is very concise and expressive and soon became a new shiny tool for us developers.

During the past year or so Jimmy had been developing a Hadoop based tool for harvesting web sites into HBase. This tool also had its own piggy bank which is what you call a library of user defined functions (UDFs) for Pig Latin. So to cut the corner those of us who wanted to hack in Pig Latin cloned that tool from Github: warcbase. As a bonus this tool also had an UDF for reading ARC files which was nice as we had a lot of test data in that format, some provided by ONB and some brought from home.

As an interesting side-note, the warcbase tool actually leverages another recently developed digital preservation tool, namely JWAT, developed at the Danish Royal Library.

As Clemens and René writes in their blog post they created two UDFs using Apache Tika. One UDF for detecting which language a given ARC text-based record was written in and another for identifying which MIME type a given record had. Meanwhile another participant Alan Akbik from Technischen Universität Berlin showed Lin how to easily add Pig Latin unit tests to a project. This resulted in an actual commit to warcbase during the hackathon adding unit tests to the previously implemented UDFs.

Given those unit tests I could then implement such tests for the two Tika UDFs that Clemens and René had written. These days unit test are almost ubiquitous when collaborating on writing software. Apart from their primary role of ensuring the continued correctness of refactored code, they do have another advantage. For years I've preferred an exploratory development style using REPL-like environments. This is hard to do using Java, but the combination of unit tests and a good IDE gives you a little of that dynamic feeling.

With all the above in place I decided to write a new UDF. This UDF should use the UNIX file tool to identify records in an ARC file. This task would aggregate the ARC reader UDF by Jimmy, the Pig unit tests by Alan and lastly a Java/JNA library written by Carl Wilson who adapted it from another digital preservation tool called JHOVE2. This library is available as libmagic-jna-wrapper. I, off course, would also rely heavily on the two Tika UDFs by Clemens and René and the unit tests I wrote for those.

Old Magic

The "file" tool and its accompanying library "libmagic" is used in every Linux and BSD distribution on the planet, it was born in 1987, and is still the most used file format identification tool. It would be sensible to employ such a robust and widespread tool in any file identification environment especially as it is still under development. As of this writing, the latest commit to "file" was five days ago!

The "file" tool is available on Github as glenc/file.

"file" and the "ligmagic" library are developed in C. To employ this we therefore need to have a JNA interface and this is exactly what Carl finished during the hackathon.

Maven makes it easy to use that library:

<dependency>
        <groupId>org.opf-labs</groupId>
        <artifactId>lib-magic-wrapper</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </dependency>

which gives access to "libmagic" from a Java program:

    import org.opf_labs.LibmagicJnaWrapper;

    ...

    LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
    magicFile = "/usr/share/file/magic.mgc";
    jnaWrapper.load(magicFile);
    mimeType = jnaWrapper.getMimeType(is);

    ...

There is one caveat in using a C library like this on Java. It often requires platform specific configuration as in this case the full path to the "magic.mgc" file. This file contains the signatures (byte sequences) used when identifying the formats of the unknown files. In this implementation the UDF will take this path as a parameter to the constructor of the UDF class.

Magic UDF

With the above in place is it very easy to implement the UDF which in its completeness is as simple as

package org.warcbase.pig.piggybank;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.opf_labs.LibmagicJnaWrapper;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;

public class DetectMimeTypeMagic extends EvalFunc<String> {
    private static String MAGIC_FILE_PATH;

    public DetectMimeTypeMagic(String magicFilePath) {
        MAGIC_FILE_PATH = magicFilePath;
    }

    @Override
    public String exec(Tuple input) throws IOException {
        String mimeType;

        if (input == null || input.size() == 0 || input.get(0) == null) {
            return "N/A";
        }
        //String magicFile = (String) input.get(0);
        String content = (String) input.get(0);

        InputStream is = new ByteArrayInputStream(content.getBytes());
        if (content.isEmpty()) return "EMPTY";

        LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
        jnaWrapper.load(MAGIC_FILE_PATH);

        mimeType = jnaWrapper.getMimeType(is);

        return mimeType;
    }
}

Github: DetectMimeTypeMagic.java

Magic Pig Latin

A Pig Latin script utilising the new magic UDF on an example ARC file. The script measures the distribution of MIME types in the input files.

    register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';

    -- The '50' argument is explained in the last section
    define ArcLoader50k org.warcbase.pig.ArcLoader('50'); 

    -- Detect the mime type of the content using magic lib
    -- On MacOS X using Homebrew the magic file is located at
    -- /usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc
    define DetectMimeTypeMagic org.warcbase.pig.piggybank.DetectMimeTypeMagic('/usr/local/Cellar/libmagic/5.16/share/misc/magic.mgc');

    -- Load arc file properties: url, date, mime, and 50kB of the content
    raw = load 'example.arc.gz' using ArcLoader50k() as (url: chararray, date:chararray, mime:chararray, content:chararray);

    a = foreach raw generate url,mime, DetectMimeTypeMagic(content) as magicMime;

    -- magic lib includes "; <char set>" which we are not interested in
    b = foreach a {
        magicMimeSplit = STRSPLIT(magicMime, ';');
        GENERATE url, mime, magicMimeSplit.$0 as magicMime;
    }

    -- bin the results
    magicMimes      = foreach b generate magicMime;
    magicMimeGroups = group magicMimes by magicMime;
    magicMimeBinned = foreach magicMimeGroups generate group, COUNT(magicMimes);

    store magicMimesBinned into 'magicMimeBinned';

This script can be modified a bit for usage with this unit test

    @Test
    public void testDetectMimeTypeMagic() throws Exception {
        String arcTestDataFile;
        arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();

        String pigFile = Resources.getResource("scripts/TestDetectMimeTypeMagic.pig").getPath();
        String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ?

        PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location});

        Iterator <Tuple> ts = test.getAlias("magicMimeBinned");
        while (ts.hasNext()) {
            Tuple t = ts.next(); // t = (mime type, count)
            String mime = (String) t.get(0);
            System.out.println(mime + ": " + t.get(1));
            if (mime != null) {
                switch (mime) {
                    case                         "EMPTY": assertEquals(  7L, (long) t.get(1)); break;
                    case                     "text/html": assertEquals(139L, (long) t.get(1)); break;
                    case                    "text/plain": assertEquals( 80L, (long) t.get(1)); break;
                    case                     "image/gif": assertEquals( 29L, (long) t.get(1)); break;
                    case               "application/xml": assertEquals( 11L, (long) t.get(1)); break;
                    case           "application/rss+xml": assertEquals(  2L, (long) t.get(1)); break;
                    case         "application/xhtml+xml": assertEquals(  1L, (long) t.get(1)); break;
                    case      "application/octet-stream": assertEquals( 26L, (long) t.get(1)); break;
                    case "application/x-shockwave-flash": assertEquals(  8L, (long) t.get(1)); break;
                }
            }
        }
    }

Github: TestArcLoaderPig.java

The modified Pig Latin script is at TestDetectMimeTypeMagic.pig

¡Hasta la Vista!

During this event we had a lot of synergy through collaboration; shouting over the tables, showing code to each other, running each other's code on non-public test data, presenting results on projectors, and so on. Even late night discussions added significant energy to this synergy. All this is not possible without people actually meeting each other face to face for a couple of days, showing up with great intentions for sharing, learning and teaching.

So, I do hope to see you all soon somewhere in Europe for some great hacking.

Epilogue: Out of heap space

A couple of weeks ago I was more or less done with all of the above, including this blog post. Then something happened that required us to upgrade our version of Cloudera to 4.5. This again resulted in us changing the basic cluster architecture and then the UDFs stopped working due to heap space out of memory errors. I traced those out of memory errors to the ArcLoader class, which is why I implemented the "READ_SIZE" class field. This field is set when instantiating the class to some reasonable number of kB. In forces the ArcLoader to only read a certain amount of payload data, just enough for Tika and libmagic to complete their format identifications while ensuring we don't get hundreds-of-megabyte sized strings being passed around.

This doesn't address the problem of why it worked before and why it doesn't now. It also doesn't address the loss of generality. The ArcLoader can no longer provide an ARC container format abstraction in every case. It only works when the job can make do with only a part of the payload of the ARC records. I.e. the given solution would not work for a Pig script that needs to extract the audio parts of movie files provided as ARC records.

As this work has primarily been a learning experience I will stop here — for now. Still, I'm certain that I'll revisit these issues somewhere down the road as they are both interesting and the solutions will be relevant for our work.

Preservation Topics:

Identification

Web Archiving

↧

Identification of PDF preservation risks: analysis of Govdocs selected corpus

January 27, 2014, 7:08 am

≫ Next: Why can't we have digital preservation tools that just work?

≪ Previous: Standing on the Shoulders of Your Peers

This blog follows up on three earlier posts about detecting preservation risks in PDF files. In part 1 I explored to what extent the Preflight component of the Apache PDFBox library can be used to detect specific preservation risks in PDF documents. This was followed up by some work during the SPRUCE Hackathon in Leeds, which is covered by this blog post by Peter Cliff. Then last summer I did a series of additional tests using files from the Adobe Acrobat Engineering website. The main outcome of this more recent work was that, although showing great promise, Preflight was struggling with many more complex PDFs. Fast-forward another six months and, thanks to the excellent response of the Preflight developers to our bug reports, the most serious of these problems are now largely solved¹. So, time to move on to the next step!

Govdocs Selected

Ultimately, the aim of this work is to be able to profile large PDF collections for specific preservation risks, or to verify that a PDF conforms to an institute-specific policy before ingest. To get a better idea of how that might work in practice, I decided to do some tests with the Govdocs Selected dataset, which is a subset of the Govdocs1 corpus. As a first step I ran the latest version of Preflight on every PDF in the corpus (about 15 thousand)².

Validation errors

As I was curious about the most common validation errors (or, more correctly, violations of the PDF/A-1b profile), I ran a little post-processing script on the output files to calculate error occurrences. The following table lists the results. For each Preflight error (which is represented as an error code), the table shows the number of PDFs for which the error was reported (expressed as a percentage)³.

Error code	% PDFs reported	Description (from Preflight source code)
2.4.3	79.5	color space used in the PDF file but the DestOutputProfile is missing
7.1	52.5	Invalid metadata found
2.4.1	39.1	RGB color space used in the PDF file but the DestOutputProfile isn't RGB
1.2.1	38.8	Error on the object delimiters (obj / endobj)
1.4.6	34.3	ID in 1st trailer and the last is different
1.2.5	32.1	The length of the stream dictionary and the stream length is inconsistent
7.11	31.9	PDF/A Identification Schema not found
3.1.2	31.6	Some mandatory fields are missing from the FONT Descriptor Dictionary
3.1.3	29.4	Error on the "Font File x" in the Font Descriptor (ed.:font not embedded?)
3.1.1	27.2	Some mandatory fields are missing from the FONT Dictionary
3.1.6	17.1	Width array and Font program Width are inconsistent
5.2.2	13	The annotation uses a flag which is forbidden
2.4.2	12.8	CMYK color space used in the PDF file but the DestOutputProfile isn't CMYK
1.2.2	12	Error on the stream delimiters (stream / endstream)
1.2.12	9.5	The stream uses a filter which isn't defined in the PDF Reference document
1.4.1	9.3	ID is missing from the trailer
3.1.11	8.4	The CIDSet entry i mandatory from a subset of composite font
1.1	8.3	Header syntax error
1.2.7	7.5	The stream uses an invalid filter (The LZW)
3.1.5	7.3	Encoding is inconsistent with the Font
2.3	6.7	A XObject has an unexpected key defined
Exception	6.6	Preflight raised an exception
3.1.9	6.1	The CIDToGID is invalid
3.1.4	5.7	Charset declaration is missing in a Type 1 Subset
7.2	5	Metadata mismatch between PDF Dictionnary and xmp
7.3	4.3	Description schema required not embedded
2.3.2	4.2	A XObject has an unexpected value for a defined key
7.1.1	3.3	Unknown metadata
3.3.1	3.1	a glyph is missing
1.4.8	2.6	Optional content is forbidden
2.2.2	2.4	A XObject SMask value isn't None
1.0.14	2.1	An object has an invalid offset
1.4.10	1.6	Last %%EOF sequence is followed by data
2.2.1	1.6	A Group entry with S = Transparency is used or the S = Null
1	1.6	Syntax error
5.2.3	1.5	Annotation uses a Color profile which isn't the same than the profile contained by the OutputIntent
1.0.6	1.2	The number is out of Range
5.3.1	1.1	The AP dictionary of the annotation contains forbidden/invalid entries (only the N entry is authorized)
6.2.5	1	An explicitly forbidden action is used in the PDF file
1.4.7	1	EmbeddedFile entry is present in the Names dictionary

This table does look a bit intimidating (but see this summary of Preflight errors); nevertheless it is useful to point out a couple of general observations:

Some errors are really common; for instance, error 2.4.3 is reported for nearly 80% of all PDFs in the corpus!
Errors related to color spaces, metadata and fonts are particularly common.
File structure errors (1.x range) are reported quite a lot as well. Although I haven't looked at this in any detail, I expect that for some files these errors truly reflect a deviation from the PDF/A-1 profile, whereas in other cases these files may simply not be valid PDF (which would be more serious).
About 6.5% of all analysed files raised an exception in Preflight, which could either mean that something is seriously wrong with them, or alternatively it may point to bugs in Preflight.

Policy-based assessment

Although it's easy to get overwhelmed by the Preflight output above, we should keep in mind here that the ultimate aim of this work is not to validate against PDF/A-1, but to assess arbitrary PDFs against a pre-defined technical profile. This profile may reflect an institution's low-level preservation policies on the requirements a PDF must meet to be deemed suitable for long-term preservation. In SCAPE such low-level policies are called control policies, and you can find more information on them here and here.

To illustrate this, I'll be using a hypothetical control policy for PDF that is defined by the following objectives:

File must not be encrypted or password protected
Fonts must be embedded and complete
File must not contain JavaScript
File must not contain embedded files (i.e. file attachments)
File must not contain multimedia content (audio, video, 3-D objects)
File should be valid PDF

Preflight's output contains all the information that is needed to establish whether each objective is met (except objective 6, which would need a full-fledged PDF validator). By translating the above objectives into a set of Schematron rules, it is pretty straightforward to assess each PDF in our dataset against the control policy. If that sounds familiar: this is the same approach that we used earlier for assessing JP2 images against a technical profile. A schema that represents our control policy can be found here. Note that this is only a first attempt, and it may well need some further fine-tuning (more about that later).

Results of assessment

As a first step I validated all Preflight output files against this schema. The result is rather disappointing:

Outcome	Number of files	%
Pass	3973	26
Fail	11120	74

So, only 26% of all PDFs in Govdocs Selected meet the requirements of our control policy! The figure below gives us some further clues as to why this is happening:

Here each bar represents the occurrences of individual failed tests in our schema.

Font errors galore

What is clear here is that the majority of failed tests is font-related. The Schematron rules that I used for the assessment currently includes all font errors that are reported by Preflight. Perhaps this is too strict on objective 2 ("Fonts must be embedded and complete"). A particular difficulty here is that it is often hard to envisage the impact of particular font errors on the rendering process. On the other hand, the results are consistent with the outcome of a 2013 survey by the PDF Association, which showed that its members see fonts as the most challenging aspect of PDF, both for processing and writing (source: this presentation by Duff Johnson). So, the assessment results may simply reflect that font problems are widespread⁴. One should also keep in mind that Govdocs selected was created by selecting on unique combinations of file properties from files in Govdocs1. As a result, one would expect this dataset to be more heterogeneous than most 'typical'PDF collections, and this would also influence the results. For instance, the Creating Program selection property could result in a relative over-representation of files that were produced by some crappy creation tool. Whether this is really the case could be easily tested by repeating this analysis for other collections.

Other errors

Only a small small number of PDFs with encryption, JavaScript, embedded files and multimedia content were detected. I should add here that the occurrence of JavaScript is probably underestimated due to a pending Preflight bug. A major limitation is that there are currently no reliable tools that are able to test overall conformity to PDF. This problem (and a hint at a solution) is also the subject of a recent blog post by Duff Johnson. In the current assessment I've taken the occurrence of Preflight exceptions (and general processing errors) as an indicator for non-validity. This is a pretty crude approximation, because some of these exceptions may simply indicate a bug in Preflight (rather than a faulty PDF). One of the next steps will therefore be a more in-depth look at some of the PDFs that caused an exception.

Conclusions

These preliminary results show that policy-based assessment of PDF is possible using a combination of Apache Preflight and Schematron. However, dealing with font issues appears to be a particular challenge. Also, the lack of reliable tools to test for overall conformity to PDF (e.g. ISO 32000) is still a major limitation. Another limitation of this analysis is the lack of ground truth, which makes it difficult to assess the accuracy of the results.

Demo script and data downloads

For those who want to have a go at the analyses that I've presented here, I've created a simple demo script here. The raw output data of the Govdocs selected corpus can be found here. This includes all Preflight files, the Schematron output and the error counts. A download link for the Govdocs selected corpus can be found at the bottom of this blog post.

Acknowledgements

Apache Preflight developers Eric Leleu, Andreas Lehmkühler and Guillaume Bailleul are thanked for their support and prompt response to my questions and bug reports.

Why can't we have digital preservation tools that just work?

January 31, 2014, 4:58 am

≫ Next: EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!

≪ Previous: Identification of PDF preservation risks: analysis of Govdocs selected corpus

One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!

FITS 0.8

Fast-forward 2.5 years: this week I saw the announcement of the latest FITS release. This got me curious, also because of the recent work on this tool as part of the FITS Blitz. So I downloaded FITS 0.8, installed it in a directory called c:\fits\on my Windows PC, and then typed (while being in directory f:\myData\):

f:\myData>c:\fits\fits

Instead of the expected helper message I ended up with this:

The system cannot find the path specified.
Error: Could not find or load main class edu.harvard.hul.ois.fits.Fits

Hang on, I've seen this before ... don't tell me this is the same bug that I already reported 2.5 years ago ? Well, turns out it is after all!

This got me curious about the status of the other tools that had similar problems in 2011, so I started downloading the latest versions of DROID, JHOVE2 and Fido. As I was on a roll anyway, I gave JHOVE a try as well (even though it was not part of the 2011 evaluation). The objective of the test was simply to run each tool and get some screen output (e.g. a help message), nothing more. I did these tests on a PC running Windows 7 with Java version 1.7.0_25. Here are the results.

DROID 6.1.3

First I installed DROID in a directory C:\droid\. Then I executed it using:

f:\myData>c:\droid\droid

This started up a Java Virtual Machine Launcher that showed this message box:

The Running DROID text document that comes with DROID says:

To run DROID on Windows, use the "droid.bat" file. You can either double-click on this file, or run it from the command-line console, by typing "droid"when you are in the droid installation folder.

So, no progress on this for DROID either, then. I was able to get DROID running by circumventing the launcher script like this:

java -jar c:\droid\droid-command-line-6.1.3.jar

This resulted in the following output:

No command line options specified

This isn't particularly helpful. There is a helper message, for which you have to give the -h flag on the command line. But you don't get to see this until you give the -h flag on the command line. Catch 22 anyone?

JHOVE2-2.1.0

After installing JHOVE2 in c:\jhove2\, I typed:

f:\myData>c:\jhove2\jhove2

This gave me 1393 (yes, you read that right: 1393!) Java deprecation warnings, each along the lines of:

16:51:02,702 [main] WARN  TypeConverterDelegate : PropertyEditor [com.sun.beans.editors.EnumEditor]
found through deprecated global PropertyEditorManager fallback - consider using a more isolated 
form of registration, e.g. on the BeanWrapper/BeanFactory!

This was eventually followed by the (expected) JHOVE2 help message, and a quick test on some actual files confirmed that JHOVE2does actually work. Nevertheless, by the time the tsunami of warning messages is over, many first-time users will have started running for the bunkers!

Fido 1.3.1

Fido doesn't make use of any launcher scripts any more, and the default way to run it is to use the Python script directly. After installing in c:\fido\ I typed:

f:\myData>c:\fido\fido.py

Which resulted in ..... (drum roll) ... a nicely formatted Fido help message, which is exactly what I was hoping for. Beautiful!

JHOVE 1.11

I installed JHOVE in c:\jhove\ and then typed:

f:\myData>c:\jhove\jhove

Which resulted in this:

Exception in thread "main" java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j
hove/viewer/ConfigWindow
        at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon
figFile(Unknown Source)
        at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source)
        at Jhove.main(Unknown Source)
Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co
nfigWindow
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 3 more

Ouch!

Final remarks

I limited my tests to a Windows environment only, and results may well be better under Linux for some of these tools. Nevertheless, I find it nothing less than astounding that so many of these (often widely cited) preservation tools fail to even execute on today's most widespread operating system. Granted, in some cases there are workarounds, such as tweaking the launcher scripts, or circumventing them altogether. However, this is not an option for less tech-savvy users, who will simply conclude "Hey, this tool doesn't work", give up, and move on to other things. Moreover, this means that much of the (often huge) amounts of development effort that went into these tools will simply fail to reach its potential audience, and I think this is a tremendous waste. I'm also wondering why there's been so little progress on this over the past 2.5 years. Is it really that difficult to develop preservation tools with command-line interfaces that follow basic design conventions that have been ubiquitous elsewhere for more than 30 years? Tools that just work?

Preservation Topics:

↧

EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!

February 3, 2014, 9:21 pm

≫ Next: SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

≪ Previous: Why can't we have digital preservation tools that just work?

A while back I wrote a blog post, MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system - without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out the help of a student from the local region’s university to begin looking at EDRMS systems, to understand what metadata they collected, and how to collect additional ‘technical’ metadata using the tools often found in the digital preservation toolkit.

Sarah McKenzie is a student at Victoria University. She has been working at Archives New Zealand on a 400 Hour, Summer Scholarship Programme that takes place during the university’s summer-break. Our department submitted three research proposals to the School of Engineering and Computer Science and out of them Sarah selected the EDRMS focussed project. She began work in December and her scholarship is set to be completed mid-February.

To add further detail, the title and focus of the project is as follows:

Mechanism to connect the tools in the digital preservation toolset to content management and database systems for metadata extraction and generation

Electronic and document records management systems (EDRMS) are the only legitimate mechanism for storing electronic documents with sufficient organisational context to develop archival descriptions but are not necessarily suited at the point of the creation of a record to store important technical information. Sitting atop database management technology we are keen to understand mechanisms of generating this technical metadata before ingest into a digital archive.

We are keen to understand the challenge of developing this metadata from an EDRMS and DBMS perspective where it is appreciated that mechanisms of access may vary from system to another. In the DBMS context, technical metadata and contextual organisational metadata may be entirely non-existent.

With interfaces to popular characterization tools biased towards that of the file system it is imperative that we create mechanisms to use tools central to the preservation workflow in alternative ways. This project will ask students to develop an interface to EDRMS and DBMS systems that can support characterization using multiple digital preservation tools.

Metadata we’re seeking to gather includes format identification, characterisation reports along with other such data as SHA-1 checksums. Tools typical to the digital preservation workflow include DROID, JHOVE, FITS and TIKA.

The blog continues with Sarah writing for the OPF on behalf of Archives New Zealand. She provides some insight into her work thus far, and insight into her own methods of research and discovery within a challenging government environment.

EDRMS Systems

An EDRMS is a system for controlling, and tracking the creation of documents from the point they are made through publication and possibly even destruction. They function as a form of version control for text documents, providing a way to accomplish a varying range of tasks in the management of documents. Some examples of tasks an EDRMS can perform are:

• Tracking creation date

• Changes and publication status

• Keeping a record of who has accessed the documents.

EDRMS stores are the individual databases of documents that are maintained for management. They are usually in a proprietary format, and interfacing directly with them means having access to the appropriate Application Layer Interface (API) and Software Development Kit (SDK). In some cases these are merged together requiring only one package. The actual structure of the store varies from system to system. Some use the directory structure that is part of the computer's file system and then have an interface from there. Others utilise a database for storing the documents.

Most EDRMS are running client/server architecture.

Currently Archives New Zealand has dealt with three different EDRMS stores:

• IBM Notes (formerly called Lotus Notes)

• Objective

• Summation

‘Notes’ has a publically available API and the latest version is built in Java, allowing for ease of use with metadata extraction tools, used in the digital preservation community - The majority I have found to be written in Java. There are many EDRMS systems, and it's simply not possible to code a tool enabling our preservation toolkit to interact with all of them without a comprehensive review of all New Zealand government agencies and their IT suites.

A survey has been partially completed by Archives New Zealand. The large number of systems suggested a more focused approach in my research project, i.e. a particular instance of EDRMS, over multiple systems.

Gathering Information on Systems in Use

Within New Zealand, The Office of the Government Chief Information Officer (OGCIO) had already conducted a survey of electronic document management systems currently used by government agencies. This survey did not cover all government agencies, but with 113 agencies replying it was considered a large enough sample to understand the most widely used systems across government. Out of the 113, some agencies did not provide any information, leaving only 69 cases where a form of EDRMS was explicitly named. These results were then turned into an alphabetical table listing:

• EDRMS names

• The company that created them

• Any notes on the entry

• A list of agencies using them

In addition to the information provided by the OGCIO survey, some investigative work was done in looking through the records of the Archives' own document management system to find any reference to other EDRMS in use across government. Other active EDRMS systems were uncovered.

For the purposes of this research it was assumed that if an agency has ever used a given EDRMS, it is still relevant to the work of Archives New Zealand, and considered ‘in-use’ until it is verified that there are no more document stores from that particular system which remain not archived, migrated to a new format, or destroyed.

Obstacles were encountered in the process of converting the information into a reference table useful for this project. Some agencies provided the names of companies that built their EDRMS. This is understandable to some extent, since there has been a vanity in the software industry where companies name their flagship product after the company (or vice versa). However, in some cases it was difficult to discern what was meant because the company that made the original software had been bought out and their product was still being sold by the new owner under the same name – or the name had been turned into a brand for an arm of the new parent company which deals with all their EDRMS software (e.g. Autonomy Corporation has now become HP Autonomy, Hewlett-Packard's EDRMS branch).

In addition, sometimes there were multiple software packages for document management with the same name. While it was possible to deduce what some of these names meant, it was not possible to find all of them. In these cases the name provided by the agency was listed with a note explaining it was not possible to conclude what they meant, and some suggestions for further inquiry. Vendor acquisitions were listed to provide a path through to newer software packages that possibly have compatibility with the old software, and also provide a way to quickly track down current owners of an older piece of software.

The varying needs of different agencies means there is no one-size-fits-all EDRMS system (e.g. a system designed for legal purposes may offer specialised features one for general document handling wouldn't have). But since there has been no overarching standard for EDRMS for various purposes – it was assumed that agencies would make their own choices based on their business needs – there turned out to be a large number of systems in use, some of them obscure or old. The oldest system that could be reasonably verified as having been used was a 1990's version of a program originally created in the late 1980s, called Paradox. This was in progress of currently being upgraded and the data migrated to a system called Radar when the document mentioning it was written, but there was no clear note of this being completed.

At the time of writing it had been established that there were approximately 44 EDRMS ‘in-use’.

With 44 systems in use it was considered unfeasible to investigate the possibility of automating metadata extraction from all of them at this time. It was decided to set some boundaries for starting points. One boundary was, which EDRMS is the most used? The most common according to the information gathered looked to be Microsoft SharePoint, which we could gather may have 24 agencies using it, and Objective Corporation's Objective was associated with at least 12 agencies.

A second way to view this was to ask, ‘which systems have been recommended for use going forward?’ Archives New Zealand’s parent department The Department of Internal Affairs (DIA) has created a three-supplier panel for providing enterprise content management solutions to government agencies. Those suppliers are:

• Intergen

• Open Text

• Team Informatics

With two weeks remaining in the scholarship, and work already completed to connect a number of digital preservation tools together in a middle abstraction layer to provide a broad range of metadata for our digital archivists, it was decided that testing of the tool, that is connecting it to an EDRMS and extracting technical metadata, would be best done on a working, in-use EDRMS, from the proposed DIA supplier panel, that would continue to add value to Archives New Zealand’s work moving into the future.

Getting Things Out of an EDRMS

The following tools were considered to be a good set to start examining extraction of metadata from files:

• DROID

• ExifTool

• JHOVE

• National Library of New Zealand, Metadata Extractor Tool (NLMET)

• Tika

Linking the tools together has been done via a java application that uses each tool's command line API to run them in turn. The files are identified first by Droid, and then each tool is run over the file to produce a collection of all available metadata in Comma Separated Values format. This showed that some tools extract the information in different ways (date formatting is not consistent) and some tools can read data, others cannot; for example, due to a character encoding issue, a particular PDFs Title, Author, and Creator fields were not readable in JHOVE where they were read correctly in Tika, and NLMET - JHOVE still extracts information those tools do not.

When a tool sends its output to standard out it's a simple matter of working with the text output as it's fed back to the calling function from the process. In some cases a tool produces an output file which had to be read back in. In the case of the NLMET, a handler for the XML format had to be built. Since the XML schema had separate fields for date and time of creation and modification, the opportunity was taken to collate those into two single date-time fields so they would better fit into a schema.

The goal with the collated outputs is to have domain experts check over them to verify which tools produce the information they want, and once that is done a schema for which piece of data to get from which tool can be introduced to the program so it can create and populate the Archives metadata schema for the files it analyses.

The ideal goal for this tool is to connect it to an EDRMS system via an API layer, enabling the extraction of metadata from the files within a store without having to export the files. For that purpose the next stage in this research is to set up a test example of one of DIA’s proposed EDRMS solutions and try to access it with the tool unifier. It is hoped that this will provide an approach that can be applied to other document management systems moving forward.

Preservation Topics:

Identification

Characterisation

↧

SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

February 7, 2014, 5:15 am

≫ Next: A Nailgun for the Digital Preservation Toolkit

≪ Previous: EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!

The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page.

We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):

but web page versions:

An efficient change detection approach is important for several issues:

Crawler optimization
Discovering new crawl strategies e.g. based on patterns
Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.
Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers
Archive maintenance, different operations like format migration can change the archived versions renderings.

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.

It is based on two different technologies:

1 – Web page segmentation (let's keep the details for another blog post)

2 – Supervised Learning with Support Vector Machine(SVM).

In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).

You calculate the similarity (or distance) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images similarities (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions: SIFT similarity and HSV similarity.

To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundation (IMF). With a part of your dataset you train your system.

Let's start training:

First, you put all your vectors in input space.As this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).

You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).

Let's classify:

Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.

The pair of urls in blue will be considered as dissimilar, the one in black will be considered as similar by pagelyzer.

When you choose different types of comparison, you choose different types of features and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.

An image to show what happens when you have more than two dimensions:

From www.epicentersoftware.com

References

Structural and Visual Comparisons for Web Page Archiving
M. T. Law, N. Thome, S. Gançarski, M. Cord
12th edition of the ACM Symposium on Document Engineering (DocEng) 2012

Structural and Visual Similarity Learning for Web Page Archiving
M. T. Law, C. Sureda Gutierrez, N. Thome, S. Gançarski, M. Cord
10th workshop on Content-Based Multimedia Indexing (CBMI) 2012

Block-o-Matic: a Web Page Segmentation Tool and its Evaluation

Sanoja A., Gançarski S.

BDA. Nantes, France. 2013.http://hal.archives-ouvertes.fr/hal-00881693/

Yet another Web Page Segmentation Tool

Sanoja A., Gançarski S.

Proceedings iPRES 2012. Toronto. Canada, 2012

Understanding Web Pages Changes.

Pehlivan Z., Saad M.B. , Gançarski S.

International Conference on Database and Expert Systems Applications DEXA (1) 2010: 1-15

Preservation Topics:

↧

A Nailgun for the Digital Preservation Toolkit

February 23, 2014, 6:17 pm

≫ Next: SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale

≪ Previous: SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis:

   3 x 450,000 = 1,350,000 seconds
   1,350,000 seconds = 15.625 days

My bash script included calls to three Java applications, Apache Tika, 1.3 at the time, twice, running the -m and -d flags:

   -m or --metadata Output only metadata
   -d or --detect Detect document type

It also made a call to Jhove 1.11 in standard mode. The script also calculates SHA1 for de-duplication purposes, and to match Archives New Zealand's chosen fixity standard; computes a V4 UUID per file, and outputs the result of the Linux File command, in two separate modes, standard, and with the -i flag to attempt to identify mime-type.

Each application receives a path to a single file as an argument from a directory manifest. The script outputs five CSV files that can be further analysed.

The main function used in the script is as follows:

   dp_analysis ()
   {
      FUID=$(uuidgen)
      DIRN=$(dirname "$file")
      BASN=$(basename "$file")

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"file-5.11"'\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}file-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-md"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}tika-md-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-type"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}tika-type-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"jhove-1_11"'\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}jhove-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"sha-1-8.20"'\t' $(sha1sum -b "$file") >> ${LOGNAME}sha-one-analysis.log
   }

What I hadn't anticipated was the expense of starting the Java Virtual Machine (JVM) three times each loop, 450,000 times. The performance is prohibitive and so I immediately set out to find a solution. Either cut down the number of tools I was using, or figure out how to avoid starting the JVM each time. Fortunately a Google search led me to a solution, and a phrase, that I had heard before – Nailgun.

It has been mentioned on various forums, including comments on various OPF blogs, and it is even found in the Fits release notes. The phrase resonated and it turned out that it provided a single and accessible approach to do what we need.

One of the things that we haven't seen yet is a guide on using it within the digital preservation workflow. I'll describe how to make best use of this tool, and try and demonstrate its benefits during the remainder of this blog.

For testing purposes we will be generating statistics on a laptop that has the following specification:

   Product: Acer Aspire V5-571PG (Aspire V5-571PG_072D_2.15)
   CPU:     Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
   Width:   64 bits
   Memory:  8GiB

   OS:       Ubuntu 13.10
   Release:  13.10
   Codename: saucy

   Java version "1.7.0_21"
   Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
   Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

JVM Startup Time

First, let's demonstrate the startup cost of the JVM. If we take two functionally equivalent programs, the first in Java and the second in C++, we can look at the time taken to run them 1000 times consecutively.

The purpose of each application is to run and then exit with a return code of zero.

Java: SysExitApp.java:

   public class SysExitApp {
      public static void main(String[] args) {
         System.exit(0);
      }
   }

C++: SysExitApp.cpp:

   int main()
   {
      return(0);
   }

The Script to run both, and output the timing for each cycle, is as follows:

   #!/bin/bash

   time (for i in {1..1000}
   do
      java -jar SysExitApp,jar
   done)

   time (for i in {1..1000}
   do
      ./SysExitApp.bin
   done)

The source code can be downloaded from GitHub. Further information about how to build the C++ and Java applications is available in the README file. The output of the script is as follows:

   real 1m26.898s
   user 1m14.302s
   sys  0m13.297s

   real 0m0.915s
   user 0m0.093s
   sys  0m0.854s

With the C++ binary, the average time taken per execution is 0.915ms. The execution time of the Java application rises from this to 86.898ms on average. One can reasonably put this down to the cost of the JVM startup.

Both C++ and Java are compiled languages. C++ compiles down to machine code; instructions that can be executed directly by the CPU (Central Processing Unit). Java compiles down to bytecode. Bytecode lends itself to portability across many devices where the JVM provides an abstraction layer handling differences in hardware configuration before interpreting it down to machine code. .

A good proportion of the tools in the digital preservation toolkit are implemented in Java, e.g. DROID, Jhove, Tika, Fits. As such, we currently have to take this performance hit, and optimizations must focus on handling that effectively.

Enter Nailgun

Nailgun is a client/server application that removes the overhead of starting the JVM by running it once within the server and enabling all command-line based Java applications to run within that single instance. The Nailgun client then handles those applications' calls to the server; that might be the command line (stdin) one normally associates with a particular application, e.g. running Tika with the -m flag, and passing it a reference to a file. The application runs and Nailgun directs its stdout, and stderr back to the client which is then output to the console.

With the exception of the command line being executed within a call to the Nailgun client, behaviour remains consistent with that of the standalone Java application. The Nailgun background information page provides a more detailed description of the process.

How to build Nailgun

Before running Nailgun it needs to be downloaded from GitHub and built using Apache Maven to build the server, and the GNU Make utility to build the client. The instructions in the Nailgun README describe how this is done.

How to start Nailgun

Once compiled the server needs to be started. The command line to do this looks like this:

   java -cp /home/digital/dp-toolkit/nailgun/nailgun-server/target/nailgun-server-0.9.2-SNAPSHOT.jar -server com.martiansoftware.nailgun.NGServer

The classpath needs to include the path to the Nailgun server Jar file. The command to start the server can be expanded to include any further application classes you want to run. There are other ways it can be modified as well. For further information please refer to the Nailgun Quick S tart G uide. For simplicity we start the server using the basic startup command.

Loading the tools (Nails) into Nailgun

As mentioned above, the tools you want to run can be loaded into Nailgun at startup. For my purposes, and to provide a useful and simple overview for all, I found it easiest to load them via the client application.

Applications loaded into Nailgun need to have a main class. It is possible to find if the application has a main class by opening the Jar in an archive manager capable of opening Jars such as 7-Zip. Locate the MET-INF folder, and within that the MANIFEST.MF file. This will contain a line similar to this example from the Tika Jar’s MANIFEST.MF in tika-app-1.5.jar.

   Main-Class: org.apache.tika.cli.TikaCLI

Confirmation of a main class means that we can load Tika into Nailgun with the command:

   ng ng-cp /home/digital/dp-toolkit/tika-1.5/tika-app-1.5.jar

Before working with our digital preservation tools. we can try running the Java application created to baseline the JVM startup time alongside the functionally comparable C++ application.

MANIFEST.MF within the SysExitApp.jar file reads as follows:

   Manifest-Version: 1.0
   Created-By: 1.7.0_21 (Oracle Corporation)
   Main-Class: SysExitApp

As it has a main class we can load it into the Nailgun server with the following command:

   ng ng-cp /home/digital/Desktop/dp-testing/nailgun-timing/exit-apps/SysExitApp.jar

The command ng-cp tells Nailgun to add it to its classpath. We provide an absolute path to the Jar we want to execute. We can then call its main class from the Nailgun client.

Calling a Nail from the Command Line

Following that, we want to call our application from within the terminal. Previously we have used the command:

   java -jar SysExitApp.jar

This calls Java directly and thus the JVM. We can replace this with a call to the Nailgun client and our application's main class:

   ng SysExitApp

We don't expect to see any output at this point, that is, provided no error occurs, it will simply return a new input line on the terminal. On the server, however, we will see the following:

   NGSession 1: 127.0.0.1: SysExitApp exited with status 0

And that's it. Nailgun is up and running with our application!

We can begin to see the performance improvement gained by removing the expense of the JVM startup when we execute this command using our 1000 loop script. We simply add the following lines:

   time (for i in {1..1000}
   do
      ng SysExitApp
   done)

This generates the output:

   real 0m2.457s
   user 0m0.157s
   sys  0m1.312s

Compare that to running the Jar, and compiled binary files before:

   real 1m26.898s
   user 1m14.302s
   sys  0m13.297s

   real 0m0.915s
   user 0m0.093s
   sys  0m0.854s

It is not as fast as the compiled C++ code but it represents an improvement of well over a minute compared to calling the JVM each loop.

The Digital Preservation Toolkit Comparison Script

Up and running we can now baseline Nailgun with the script used to run our digital preservation analysis tools.

We define two functions: one that calls the Jars we want to run without Nailgun, and the other to call the same classes, with Nailgun:

   dp_analysis_no_ng ()
   {
      FUID=$(uuidgen)
      DIRN=$(dirname "$file")
      BASN=$(basename "$file")

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"file-5.11"'\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-md"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}${TIKAMDLOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-type"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}${TIKATYPELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"jhove-1_11"'\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}${JHOVELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"sha-1-8.20"'\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG}
   }

   dp_analysis_ng ()
   {
      FUID=$(uuidgen)
      DIRN=$(dirname "$file")
      BASN=$(basename "$file")

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"file-5.11"'\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-md"'\t' $(ng org.apache.tika.cli.TikaCLI -m "$file") >> ${LOGNAME}${TIKAMDLOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-type"'\t' $(ng org.apache.tika.cli.TikaCLI -d "$file") >> ${LOGNAME}${TIKATYPELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"jhove-1_11"'\t' $(ng Jhove "$file") >> ${LOGNAME}${JHOVELOG}  

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"sha-1-8.20"'\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG}
   }

Before we define the functions, we load the applications into Nailgun with the following commands:

   #Load JHOVE and TIKA into Nailgun CLASSPATH
   $(ng ng-cp ${JHOVE_HOME}/bin/JhoveApp.jar)
   $(ng ng-cp ${TIKA_HOME}/tika-app-1.5.jar)

The complete script can be found on GitHub with more information in the README file.

Results

For the purpose of this blog I reworked the Open Planets Foundation Test Corpus and have used a branch of that to run the script across. There are 324 files in the corpus with numerous different formats. The script produces the following results:

   real 13m48.227s
   user 26m10.540s
   sys  0m59.861s

   real 1m32.801s
   user 0m4.548s
   sys  0m16.847s

Stderr is piped to a file called errorlog.txt using ‘2>’ syntax to enable me to capture the output of all the tools and to avoid any expense of the tool printing to the screen. Errors shown in the log relate to the tools ability to parse certain files in the corpus rather than to do with Nailgun. The errors should be reproducible with the same format corpus and tool set.

There is a marked difference in performance when running the script with Nailgun and without. Running the tools as-is we find that each pass takes approximately 2.52 seconds per file on average.

Using Nailgun this is reduced to approximately 0.28 seconds per file on average.

Conclusion

The timing results collected here will vary quite widely on different systems, even on the same system. The disparity between running applications executing the JVM each time and then running applications using Nailgun should show up with fairly even contrast.

While I hope this blog provides a useful Nailgun tutorial, the concern I have after being able to work in anger with the tools we talk about in the digital preservation community on a daily basis, is in understanding what smaller institutions with smaller IT departments, and potentially fewer IT capabilities, are doing. And whether they are even able to make use of the tools out there given the overheads described.

It is possible to throw more technology and more resources at this issue but it can't be expected that this will always be possible. The reason I sought this workaround is that I can't see that capability being developed at Archives New Zealand without significant time and investment, and that capability can't always be delivered in short-order within the constraints of working within government. My analysis, on a single collection of files, needs to be complete within the next few weeks. I need tools that are easily accessible and far more efficient to be able to do this.

It is something I'll have to think about some more.

Nailgun gives me a good shot-term solution, and hopefully this blog opens it up as a solution that will prove useful to others too.

It will be interesting to learn, following this work, how others have conquered similar problems, or equally interesting, if they are yet to do so.

Notes:

Loops: I experimented with various loops for recursing the directories in the opf-format-corpus expecting to find differences in performance within each. Using the Linux time command I was unable to find any material difference in either loop. The script used for testing is available on GitHub. The loop executes a function that calls two Linux commands, ‘sha1sum’ and ‘file’. A larger test corpus may help to reveal differences in either approach. I opted to stick with iterating over a manifest as this is more likely to mirror processes within our organization.

Optimization: I recognize a naivety in my script. Produced to collect quick and dirty results from a test set that I only have available for a short period of time. The first surprise running the script was the expense of the JVM startup. After finding a workaround for that I now need to look at other optimizations to continue to approach the analysis this way. Failing that, I need to understand from others why this approach might not be appropriate, and/or sustainable. Comments and suggestions along those lines as part of this blog are very much appreciated.

And Finally...

All that glisters is not gold: Nailgun comes with its own overhead. Running the tool on a server at work with the following specification:

   Product: HP ProLiant ML310 G3
   CPU:     Intel(R) Pentium(R) 4 CPU 3.20GHz
   Width:   64 bits
   Memory:  5GiB
   
   OS:        Ubuntu 10.04.4 LTS
   Release:   10.04
   Codename:  lucid
   
   Java version "1.7.0_51"
   Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
   Java HotSpot(TM) Client VM (build 24.51-b03, mixed mode)

We find it running out of heap space around the 420,000^th call to the server, with the following message:

   java.lang.OutOfMemoryError: Java heap space

If we look at the system monitor we can see that the server has maxed out the amount of RAM it can address. The amount of memory it is using grows with each call to the server. I haven't a mechanism to avoid this at present, other than chunking the file set and restarting the server periodically. Users adopting Nailgun might want to take note of this issue up-front. Throwing memory at the problem will help to some extent but a more sustainable solution is needed, and indeed welcomed. This might require optimizing Nailgun, or instead, further optimization of the digital preservation tools that we are using.

Preservation Topics:

↧

SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale

March 4, 2014, 1:17 am

≫ Next: A Tika to ride; characterising web content with Nanite

≪ Previous: A Nailgun for the Digital Preservation Toolkit

Link:

Registration

Overview

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.
ToMaR offers the possibility to use existing command-line tools and Java applications in Hadoop’s distributed environment very similarly to a Desktop computer. By utilizing SCAPE tool specification documents, ToMaR allows users to specify complex command-line patterns as simple keywords, which can be executed on a computer cluster or a single machine. ToMaR is a generic MapReduce application which does not require any programming skills.

This webinar will introduce you to the core concepts of Hadoop and ToMaR and show you by example how to apply it to the scenario of file format migration.

Learning outcomes

1. Understand the basic principals of Hadoop
2. Understand the core concepts of ToMaR
3. Apply knowledge of Hadoop and ToMaR to the file format migration scenario

Who should attend?

Practitioners and developers who are:

• dealing with command line tools (preferrably of the digital preservation domain) in their daily work
• interested in Hadoop and how it can be used for binary content and 3rd-party tools

Session Lead: Matthias Rella, Austrian Institute of Technology

Time: 10:00 GMT / 11:00 CET

Duration: 1 hour

Date:

Friday, 21 March 2014

Location:

United Kingdom

55° 22' 40.9836" N, 3° 26' 9.5028" W

See map: Google Maps

Event Types:

Preservation Topics:

↧

A Tika to ride; characterising web content with Nanite

March 21, 2014, 6:58 am

≫ Next: CSV Validator - beta releases

≪ Previous: SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

Nanite-Core: an API for Droid
Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step. The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files. This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours. More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage. As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast. Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster. This is less than 1ms/file. Libraries to be turned on/off relatively easily by editing the source or FormatProfiler.properties in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers. The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best. There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies.

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly. This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging. As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika). However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them. This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment. It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out. The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika. Firstly I ensured that any Exceptions/Errors/etc were caught. Then I created a TimeoutParser that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded. This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread. Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects. Details about this can be read in an issue on the Tika bug tracker. Since I did not want to risk a corruption of the JVM I did not pursue this further.

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread. This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika. This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM. The new library controls restarting the process after a crash, or after processing times out (in case of a hang). An API similar to a normal Tika parser is provided so it can be easily reused. Communication by the library with the Tika-server is via REST, over the loopback network interface. There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for. Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files. Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files. This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header. A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here. I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above. Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used	Nanite-core (Droid) Tika detect() (mimetype only) ProcessIsolatedTika parsers
WARC files	1000
Total WARC size	59.4GB (63,759,574,081 bytes)
Total files in WARCs (# input records)	7,612,660
Runtime (hh:mm:ss)	03:06:24
Runtime/file	1.47ms
Throughput	19.1GB/hour
Total Tika parser output size (compressed)	765MB (801,740,734 bytes)
Tika parser failures/crashes	15
Misc failures	Malformed records: 122 IOExceptions*: 3224 Other Exceptions: 430 Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika. This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika. When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside - if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

Preservation Topics:

↧

CSV Validator - beta releases

March 21, 2014, 7:51 am

≫ Next: Digital Preservation Awards 2014 - nominations now open!

≪ Previous: A Tika to ride; characterising web content with Nanite

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema. We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language. The tool and source code are released under Mozilla Public Licence version 2.0.

For more details, links to the source code repository, release code on Maven Central, instructions and schema specification, see http://digital-preservation.github.io/csv-validator/

Feedback is welcome. When we make the formal version 1.0 release there will be a fuller blog post on The National Archives blog.

Preservation Topics: