Quantcast
Channel: Tools
Viewing all 53 articles
Browse latest View live

Browser Screenshot Comparison Tool

$
0
0

The browser-shots tool is developed by Internet Memory in the context of SCAPE project, as part of the preservation and watch (PW) sub-project. The goal of this tool is to perform automatic visual comparisons, in order to detect rendering issues in the archived Web pages.

From the tools developed in the scope of the project (in the preservation components sub-project), we selected the MarcAlizer tool, developed by UPMC, that performs the visual comparison between two web pages. In a second phase, the renderability analysis will also include the structural comparison of the pages, which is implemented by the new Pagelyser tool.
Since the core analysis for the renderability is thus performed by an external tool, the overall performance of the browser-shot tool will be tight to this external dependency. We will keep integrating the latest releases issued from the MarcAlizer development, as well as the updates on the tool issued from a more specific training.

The detection of the rendering issues is done in the following three steps:

    1° Web pages screenshots automatically taken using Selenium framework, for different browser versions.
    2° Visual comparison between pairs of screenshots using MarcAlizer tool (recently replaced by PageAlizer tool, to include also the structural comparison).
    3° Automatically detect the rendering issues in the Web pages, based on the comparison results.

Initial Implementation

The browser-shots tool is developed as a wrapper application, to orchestrate the main building blocks (Selenium instances and MarcAlizer comparators) and to perform large scale experiments on archived Web content.
The browser versions currently experienced and tested are: Firefox (for all the available releases), Chrome (only for the last version), Opera (for the official 11th and 12th versions) and Internet Explorer (still to be fixed).

The initial, sequential implementation of the tool is represented by several Python scripts, running on a Debian Squeeze (64 bits) platform. This version of the tool was released on GitHub and we received some valuable feedback from the sub-project partners:
https://github.com/crawler-IM/browser-shots-tool

For the preliminary rounds of tests, we deployed the browser-shots tool on three nodes of IM's cluster and we performed automated comparisons for around 440 pairs of URLs. The processing time registered in average was about 16 seconds per pair of Web pages. These results showed that the existing solution is suitable for small-scale analysis only. Most of the time in the process is actually represented by IO operations and disk access to the binary files for the snapshots. Taking the screenshots proven to be very time consuming and therefore if this solution is to be deployed on a large scale, the solution needed to be further optimized and parallelized.

These results showed also that a serious bottleneck for the performance of the tool is represented by the passage of intermediary parameters in between the modules. More precisely, the materialization of the screenshots in binary files on the disk is a very time consuming operation, especially when considering large scale experiments on a large number of Web pages.

We therefore have to move to a different implementation of the tool, which will use an optimized version of MarcAlizer. The Web pages screenshots taken with Selenium will be directly passed over to MarcAlizer comparator using streams and the new implementation of the browser-shots tool will be represented by a MapReduce job, running on a Hadoop cluster. Based on this framework, the current rounds of tests could be extended up to much higher number of pairs of URLs.

In the second round the browser shot comparison tool is implemented as a MapReduce job to parallelize the processing of the input. The input in this later case is a list of urls that together with a list of browser versions, that are used to render the screen shot - note the difference in comparison to the former version where the input where pairs of URLs that were rendered using one common browser version and these were compared.

Optimizations

LA times browser screen shot resizedIn order to achieve acceptable running times of the tool newer version of the Marcalizer comparison tool was integrated into this tool. The major improvement brings the possibility of feeding to tool with in-memory objects instead of pointers to files on disk. This improvement and the elimination of the unnecessary IO operations lead into following average times got for the particular steps in the shot comparison:

1) browser shot acquirement - 2s

2) marcalizer comparison 2s

Note that the time to take the render the screenshot using a browser mainly depends on the size of the rendered page, for instance capturing a wsj.com page takes about 15s on the IM machine where the resulting png image has several MBs.

MapReduce

As you can see, the operations on the operations on the screenshots are very expensive (remember that the list of the tested browsers can be very long and for each we need to spend one browser screen shot operation). Therefore we need to parallelize the tool to several machines working on the input list of urls. To facilitate this, we have employed Hadoop MapReduce which is part of the SCAPEs platform.

The result of the comparisons is then materialized in a set of XML files where each file represents one pair of browser shots comparions. In order to alleviate the problem of having big numbers of small files, these files are automatically bundled together into one ZIP file. A C3P0 adapter has been implemented by TU Wien so the result can be processed and passed further to Scout.

Tests

In the moment, we have ran preliminary tests on the currently supported browser versions - Firefox and Opera. The list of urls to test is about 13 000 entries long. We are using the IM central instance for these tests, currently having two worker nodes (thus we can cut the processing time to half in parallel execution). 

 

 

Preservation Topics: 
AttachmentSize
LA times browser screen shot320.93 KB

FIDO News

$
0
0

Here's a little newsbulletin about FIDO, the open source file format identification tool of OPF.

It seems that the use of FIDO is growing the last few months. I am getting responses by e-mail and through the Github issuetracker from all over the world, ranging from requests for help, giving suggestions for improvement and even some bugfixes. Thanks and please keep them coming!

RECENT CHANGES

Most important change currently is the versioning schema of tagged releases.
If you forked FIDO or watching the tags for updates, please notice that the versioning schema has changed from [major].[minor].[patch] to [major].[minor].[patch]-[PRONOM version number].
The reason for this is that from time to time there is a new PRONOM version available but there are no code changes to commit. As it is bad practice to update a tagged release this was the only reasonable way to fix this.

For example, release 1.3.1 has PRONOM version 70 distributed with it and is tagged '1.3.1-70'.
If a PRONOM update is available but there are no code changes the consecutive tag will be '1.3.1-71'. Please note that this is only reflected in release tags, FIDO will still only report its version number without the PRONOM version number.

Currently I am also working on the FIDO usage guide. It is still a work in progress, but it could help you on your way using FIDO.

FUTURE

I'll be the first to admit that FIDO is still far from being "the perfect file format identification tool". Although it is quite stable and many things are improved or fixed lately such as the handling of files passed to STDIN or the possibility to use only the official PRONOM signatures, it still needs improvement on many levels.

Recently Carl Wilson (OPF technical lead) and I started to work on thinking what needs changing for FIDO version 2. This second generation of FIDO will not differ much in functionality of the current version 1 generation but the way we plan on doing things will make a big difference. For starters we will be creating unit tests for every function of FIDO. Second important thing are unit testing of individual PRONOM signatures and PRONOM container signatures. With each update of PRONOM we will run unit tests using corpora files.

But the biggest change of all will be the way we build FIDO. It will no longer be just "a script", but rather an API. The "fido.py" script will then merely function as a prototype how to build your "own" FIDO into your workflow systems. It will also no longer output to STDOUT and STDERR but will return results in a more Pythonic way. You will read more about all this in a later post.

In the mean while I (with a little help of you) will continue on improving version 1 where possible. If you have any questions or suggestions about any of the above, please let me know.

FIDO @ Open Planets Github
FIDO releases @ Open Planets Github
FIDO usage guide

POSTPONED Digital Preservation Without Tears

$
0
0

The ‘Digital Preservation Without Tears’ Mash-up will appeal to collection owners and developers.  The programme offers two connected strands – a hack and a sprint.

  • In the hack, developers will have two days to develop, test and enhance practical tools for digital preservation. Collection owners will be invited to bring problem elements of their digital collections for analysis using the latest digital forensic and characterisation tools.  This will help the collection owners develop practical workflows for management and preservation while helping developers spot and refine solutions that will enable better tools.
  • In the sprint, collection owners will examine current thinking on digital preservation policy and planning in their organisations.  Collections owners will present their own digital preservation policies and will be invited to assess these against each other and against emerging good practice, providing a managed environment for policy development and peer review.  Collection owners will then be invited to pool their wisdom to create a Digital Preservation Policy Building Toolkit that can be shared.

This mashup will:

  • Provide a forum for practical problem solving for analysis of digital collection
  • Provide a forum for discussion, review and development of digital preservation policy
  • Bring together developers and collection owners from across the DPC and OPF to address shared challenges
  • Extend and enhance the corpus of digital preservation tools
  • Deliver a simple beginners’ guide for the development of digital preservation policies

This event will be of interest to:

  • Collections managers, librarians, curators and archivists and policy makers in all institutions with an interest in digital preservation
  • Techies, tools developers, IT officers, database managers and systems analysts with an interest in long term data management
  • Innovators and researchers digital preservation
  • Vendors and providers of digital preservation services
  • CEO’s CTO’s and CIO’s seeking to develop institutional capacity for digital preservation

Everyone coming needs to bring a lap top computer.  In addition:

  • Collection owners will need to bring a data set that is giving them trouble in terms of characterisation or identification and be prepared to present their institutional policy on digital preservation
  • Techies will need to tell us about the skills they have and bring a knowledge of existing digital forensic and characterisation tools

Also, because elements of the mash-up include peer-review of existing practice, participants need to understand and consent to working under ‘Chatham House Rules’ for parts of the programme.

Places are strictly limited and should be booked in advance.  Priority will be given to DPC and OPF members who can attend at no cost.  Non-members are welcome at a cost of £150 pounds per person. Lunch and refreshments are provided on three days and dinner on the first night.  Accommodation will be recommended but is not included in the cost.  Register online at: http://www.dpconline.org/events

Can’t make it?

Parts of the event will be available as a webcast. We’ll publish the slides after each event and will tweet live from the event using the hashtag #DPnoTears.  

Location: 
Innovation Centre, University of York
Innovation Way Heslington
YO10 5DGYork
United Kingdom
53° 56' 53.0772" N, 1° 2' 49.074" W
Event Types: 

Measuring Bigfoot

$
0
0

My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.

Numbers first?

Ross overall point is that we need the numbers first; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be possible to use them to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there's a pretty fundamental problem, which I'll try to explain below. To avoid any confusion here, I will be speaking of "format risk" here in the sense used by Graf & Gordea, which follows from the idea of "institutional obsolescence" (which is probably worth a blog post by itself, but I won't go into this here).

The risk model

Graf & Gordea define institutional obsolescence in terms of "the additional effort required to render a file beyond the capability of a regular PC setup in particular institution". Let's call this effort E. Now the aim is to arrive at an index that has some predictive power of E. Let's call this index RE. For the sake of the argument it doesn't matter how RE is defined precisely, but it's reasonable to assume it will be proportional to E (i.e. as the effort to render a file increases, so does the risk):

REE

The next step is to find a way to estimate RE (the dependent variable) as a function of a set of potential predictor variables:

RE = f(S, P, C, ... )

where S = software count, P = popularity, C = complexity, and so on. To establish the predictor function we have two possibilities:

  1. use a statistical approach (e.g. multiple regression or something more sophisticated);
  2. use a conceptual model that is based on prior knowledge of how the predictor variables affect RE.

The first case (statistical approach) is only feasible if we have actual data on E. For the second case we also need observations on E, if only to be able to say anything about the model's ability to predict RE (verification).

No observed data on E!

Either way, the problem here is that there's an almost complete lack of any data on E. Although we may have a handful of isolated 'war stories', these don't even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model1. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?

Looking at Ross's suggestions for collecting more data, all of the examples he provides fall into the potential (!) predictor variables category. For instance, prompted by my observation on compression in PDF, Ross suggests to start analysing large collections of PDFs to establish patterns on the occurrence of various types of compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don't tell you if PDF is "riskier" than another format, but he argues that:

once we've got them [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.

Aside from the fact that it's debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there's a more fundamental issue here. Bearing in mind that, ultimately, the thing we're really interested in here is E, how could collecting more data on potential predictor variables of E ever help here in the near absence of any actual data on E? No amount of clever maths or statistics can compensate for that! Meanwhile, ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross's suggestions), even though the purpose of these efforts remains largely unclear.

Within this context I was quite intrigued by the grant proposal mentioned by Andrea Goethals which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on E (although like Andy Jackson said here I'm also wondering whether this may be too ambitious).

Obsolescence-related risks versus format instance risks

On a final note, Ross makes the following remark about the role of tools:

[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats - and more and more should appear over time.

This is true to some extent, but a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn't say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view. Such risks affect file instances of current formats, and this is an area that is covered by the OPF File Format Risk Registry that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by FFMA. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross's suggestion on compression in PDF entails (if I'm understanding him correctly) the analysis of large volumes of PDFs in order to gather statistics on the use of different compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for 'risky' features.


  1. On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. 

Scalable Environments for File Format Identification and Characterisation

$
0
0

This webinar provides an introduction to file format identification and characterisation tools which have been developed or extended as part of the SCAPE Project.

It covers the basic principals of file format identification, and shows how format information drives digital preservation workflows.

Participants will be given an overview of file format registries, and their role in digital preservation, and will see demonstrations of identification and characterisation tools including fido and tika.

We will provide a Virtual Machine image with samples files and step-by-step worksheets to allow participants to try out these exercises for themselves after the webinar with support.

Learning outcomes (by the end of the webinar and exercises, participants
will be able to):

  • Distinguish between different file types and identify the requirements for characterising each of them.
  • Carry out identification and characterisation experiments on example files.
  • Compare characterisation and identification tools and understand their advantages and disadvantages when used in different scenarios.


Session Lead: Carl Wilson, OPF
Date: Friday 25 October
Time: 12 noon BST / 13:00 CET
Duration: 1 hour (please note this includes the presentation and demonstrations. Practical exercises can be carried out after the webinar).

There are 25 places available which will be allocated on a first come, first serve basis.

Date: 
Friday, 25 October 2013
Location: 
United Kingdom
55° 22' 40.9836" N, 3° 26' 9.5028" W
Event Types: 

Fund it, Solve it, Keep it (with SPRUCE)

$
0
0

How to fund and solve your digital preservation challenges

 

What will the event do for me?

This event will help to make your digital preservation more effective by demonstrating the best community focused approaches and results from the JISC funded SPRUCE Project. You'll be hearing from the SPRUCE Team experts and from the practitioners and developers who have been tackling digital preservation challenges in targeted SPRUCE Award projects. We'll also be hearing from you, so we can take on board what you need from our future work.

  • If you're taking your first steps in preserving your digital assets we will demonstrate how to get started, where to get help, and how to make the case to resource your work more effectively.
  • If you're already engaged in digital preservation we'll show how your efforts can be supported more effectively with help from the community.

Key topics we will be covering include:

  • Securing funding for your digital preservation activities with the Digital Preservation Business Case Toolkit
  • Community approaches to solving digital preservation challenges
  • SPRUCE guides on how to assess your digital collections
  • Stabilising data stored on obsolete hand-held media
  • Results from the SPRUCE Award Projects

Who is this for?

Practitioners, developers and middle managers who are engaged (or would like to be engaged) in preserving their organisation's digital assets.

When, where and how do I register?

The free event will take place at 11am on the 25th November at the brand new Library of BirminghamRegister your attendance here. Please note that anyone who registers for the event and then fails to attend without giving at least one week of notice will be liable for a £50 cancellation charge. Places are limited, so please don't waste them!

Date: 
Monday, 25 November 2013
Location: 
Library of Birmingham
Centenary Square Broad Street
B1 2NDBirmingham
United Kingdom
52° 28' 45.6924" N, 1° 54' 29.3328" W
Event Types: 

SCAPE/OPF Continuous Integration update

$
0
0

As previously blogged about by Carl we now have virtually all SCAPE and OPF projects in Continuous Integration; building and unit testing in both Travis CI and Jenkins

  • Travis compiles the projects and executes unit tests whenever a new commit is pushed to Github, or when a pull request is submitted to the project. 
  • Jenkins builds are generally scheduled once per day.  After a build the software has its code quality analysed by Sonar

Complete details of how to build each non-Java project are contained within the .travis.yml files that are found in the project directories.  As a side effect of this work the .travis.yml files can be used as instructions for independently building the projects.

Matchbox, Xcorrsound and Jpylyzer have CI builds that are capable of generating an installable Debian package, which we are aiming to publish.  Java projects have had their Maven GroupId and package names changed to the appropriate SCAPE names so we can publish binary snapshots.

The daily Maven snapshots of code built in Jenkins are now (or soon will be) published to https://oss.sonatype.org/content/repositories/snapshots/eu/scape-project/ and can be used by adding this repository to your pom.xml:

<parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
</parent>

What you can do for your project

  1. Maintain your .travis.yml file if project dependencies change
  2. Ensure code matches the SCAPE/OPF functional review criteria – correct Java package names and Maven GroupIds are essential to be able to publish snapshots
  3. Ensure your project has an up to date README that contains details of how to build and run your software (including dependencies)
  4. Very importantly ensure that your project has (at the very least) a top level LICENSE, ideally  source files should each contain a license header
  5. Add unit tests for your project
  6. Ensure that unit tests for your project can easily be run using standard dependencies. Relying on your particular installation for unit tests to pass means that they cannot be successfully run by Travis/Jenkins and show as test failures.  Whilst it might not always be possible to have unit tests that can be run independently, if there have to be test dependencies then please document how these should be set up!
  7. Check your project at http://projects.opf-labs.org/

The CI days are generally about once a month.  If you are interested in joining us do let us know as we could always do with more help.  It’s an opportunity for you to work on CI with Travis/Jenkins, and do other work that is interesting (and rewarding), such as Debian packaging, that you might not normally get to work on.

OPF Webinar: Securing funding for your digital preservation, with SPRUCE

$
0
0

Making the case to your organisation's management, or to external funders, to adequately resource your digital preservation activities is not an easy task. Digital preservation is not always a straightforward sell. In this financial climate the justification for spending money has to be compelling and watertight. In this webinar Paul Wheatley will describe how to make the case for funding your digital preservation, with reference to the SPRUCE Project's Digital Preservation Business Case Toolkit.

* Making a compelling case to fund digital preservation
* The Digital Preservation Business Case Toolkit from SPRUCE
* Getting started
* Other resources

There are twenty-five places available on a first come, first serve basis. 

Date: Wednesday 27 November
Time: 14:00 GMT / 15:00 CET
Duration: 1 hour
Session Lead: Paul Wheatley, SPRUCE Project Manager, University of Leeds
Date: 
Wednesday, 27 November 2013
Location: 
Online
United Kingdom
55° 22' 40.9836" N, 3° 26' 9.5028" W
Event Types: 

Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

$
0
0

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied by the background noise of mouse clicks and keyboard typing.

There were Hadoop newbies, people from the SCAPE Project with some knowledge about Apache Hadoop related technologies, and, finally, Jimmy Linn who works currently as an associate professor at the University of Maryland and who was employed as chief data scientist at Twitter before. There is no doubt that his profound knowledge of using Hadoop in an ‘industrial’ big data context was that certain something of this event.

The topic of this Hackathon was large-scale digital preservation in the web archiving and digital books quality assurance domains.  People from the Austrian National Library presented application scenarios and challenges and introduced the sample data which was provided for both areas on a virtual machine together with a pseudo-distributed Hadoop-installation and some other useful tools from the Apache Hadoop ecosystem.

I am sure that Jimmy’s talk about Hadoop was the reason why so many participants became curious about Apache Pig, a powerful tool which was humorously characterised by Jimmy as the tool for lazy pigs aiming for hassle-free MapReduce. Jimmy gave a live demo running some pig scripts on the cluster at his university explaining how pig can be used to find out which links point to each web page in a web archive data sample from the Library of Congress. Asking Jimmy about his opinion on Pig and Hive as two alternatives for data science to choose from, I found it interesting that he did not seem to have a strong preference for Pig. If an organisation has a lot of experienced SQL experts, he said, Hive is a very good choice. On the other hand, from the perspective of the data scientist, Pig offers a more flexible, procedural approach for manipulating data and to do data analysis.

Towards the end of the first day we started to split into several groups. People gathered ideas in a brainstorming session which at the end led to several groups:

·  Cropping error detection

·  Full-text search on top of warcbase

·  Hadoop-based Identification and Characterisation

·  OCR Quality

·  PIG User Defined Functions to operate on extracted web content

·  PIG User Defined Functions to operate on METS

Many participants made their first steps in Pig scripting during the event, so it is clear that one cannot expect code that is ready to be used in a production environment, but we can see many points to start from when we do the planning of projects with similar requirements.

On the second day, there was another talk by Jimmy about HBase and his project WarcBase which looks like a very promising approach of providing a scalable HBase storage backend with a very responsive user interface that offers basic functionality of what the WayBack machine does for rendering ARC and WARC web archive container files. In my opinion, the upside of his talk was to see HBase as tremendously powerful database on top of Hadoop’s distributed file system (HDFS), Jimmy brimming over with ideas about possible use cases for scalable content delivery using HBase.  The downside was to hear his experiences about how complex the administration of a large HBase cluster can become. First, additionally to the Hadoop administration tasks, it is necessary to keep additional daemons (ZooKeeper, RegionServer) up and running, and he explained how the need for compacting data stored in HFiles, once you believe that the HBase cluster is well balanced, can lead to what the community calls a “compaction storm” that blows up your cluster - luckily this only manifests itself with endless java stack-traces.

One group provided a full text search for WarcBase and they picked up the core ideas from the developer groups and presentations to build a cutting-edge environment where the web archive content was indexed by the Terrier search engine and the index was enriched with metadata from the Apache Tika mime-type and language detection. There were two ways to add metadata to the index. The first option was to run a pre-processing step that uses Pig user defined function to output the metadata of each document. The second option was to use Apache Tika during indexing to detect both the MimeType and language. In my view, this group has won the price of the fanciest set-up, sharing resources and daemons running on their laptops.         

I was impressed how in the largest working group the outcomes were dynamically shared between developers: One implemented a Pig user defined function (UDF) making use of Apache Tika’s language detection API (see section MIME type detection) which the next developer used in a Pig script for mime type and language detection. Also Alan Akbik, SCAPE project member, computer linguist and Hadoop researcher from the University of Berlin, was reusing building blocks from this group to develop Pig Scripts for old German language analysis using dictionaries as a means to determine the quality of noisy OCRed text. As an experienced Pig scripter he produced impressive results and deservedly won the Hackathon’s competition for the best presentation of outcomes.

The last group was experimenting with functionality of classical digital preservation tools for file format identification, like Apache Tika, Droid, and Unix file, and looking into ways to improve the performance on the Hadoop platform. It’s worth highlighting that digital preservation guru Carl Wilson found a way to replace the command line invocation of unix file in FITS by a Java API invocation which proved to be ways more efficient.

Finally, Roman Graf, researcher and software developer from the Austrian Institute of Technology, took images from the Austrian Books Online project in order to develop python scripts which can be used to detect page cropping errors and which were especially designed to run on a Hadoop platform.

On the last day, we had a panel session with people talking about experiences regarding the day-to-day work with Hadoop clusters and the plans that they have for the future of their cluster infrastructure.

I really enjoyed these three days and I was impressed by the knowledge and ideas that people brought to this event.

SCAPE Training - Preserving Your Preservation Tools

$
0
0

Overview

Learning to Think Like a Package Maintainer

Lots of great digital preservation applications and services exist, however very few are actively maintained and thus preserved! This is a big problem! By introducing the steps to develop these and engage the support of the community, this training course looks at what can be done to improve this situation. Specifically, this training course looks at how to prepare packages for submission into the very heart of many digital environments; the operating system and directly associated “app-stores”. Attendees will be given hands-on experience with developing and maintaining packages rather than software and key differences will be discussed and evaluated. Better preservation of preservation tools, means better preservation our digital history.

Learning Outcomes (by the end of the training event the attendees will be able to):

  1. Understand the complexities of package management and distinguish between the different practices relating to both package objectives and chosen programming language. 
  2. Be able to carry out advanced package management operations in order to critically appraise current packages and propose changes. 
  3. Understand the importance of clearly defined versioning and licenses and the role of clear documentation and examples. 
  4. Apply best practice techniques in order to create a simple package suitable for long term maintenance. 
  5. Evaluate a number of options for managing package configuration and behavior relating to package installation, removal, upgrade and re-installation. 
  6. Analyse opportunities for automating package management and releases, maintaining a clear focus on the user and not the developer. 
  7. Critically evaluate opportunities to generalise package management to allow the easy building and maintenance of packages on multiple platforms.
  8. Assess the potential to apply package management techniques in your own environment. 

Delegates will receive a certificate of attendance for the training course.

The agenda can be seen here: http://wiki.opf-labs.org/display/SP/Agenda+-+Preserving+Your+Preservation+Tools.

Registration will open in early 2014.

Date: 
Wednesday, 26 March 2014 to Thursday, 27 March 2014
Location: 
The National Library of the Netherlands
Prins Willem-Alexanderhof 5
2595 BE The Hague
Netherlands
Event Types: 

OPF Webinar - From the Preservation Toolkit: JHOVE2

$
0
0

This webinar will give an overview of JHOVE2, the free and open-source tool for characterizing digital objects.  It will cover the motivation for creating a second-generation version of JHOVE some of the new features of the tool, including the ability to perform not just format identification, validation, and feature extraction, but also assessment (a policy-based determination of the acceptability of a format instance, regardless of its validation).  It will discuss JHOVE2's more sophisticated data model of a format instance, embracing complex digital objects that can be composed of more than one file, each of a possibly different format.  It will  provide pointers on JHOVE2's setup and use.  It will briefly introduce the tool's architecture, and ways in which it has been and can continue to be extended to include more formats, building on existing libraries and tools.

There are twenty-five places available on a first come, first serve basis. 

Date: Friday 31 Janaury
Time: 09:00 EST / 14:00 GMT / 15:00 CET
Duration: 1 hour
Session Lead: Sheila Morrissey, Portico

 

Date: 
Friday, 31 January 2014
Location: 
United Kingdom
55° 22' 40.9836" N, 3° 26' 9.5028" W
Event Types: 

Standing on the Shoulders of Your Peers

$
0
0

In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb's Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René's The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this event with a lot of enthusiasm and fingers itching to continue the work I started there.

As Clements and René writes in their blog post on this event, collaboration had, without really being stated explicit, taken centre place and that in it self was a good experience.

For the hackathon Jimmy Lin from the University of Maryland had been invited to present Hadoop and some of the adjoining technologies to the participants. We all came with the hope of seeing cool and practical usages of Hadoop for use in digital preservation. He started his first session by surprising us all with a talk titled: Never Write Another MapReduce Job. It later became clear Jimmy enjoys this kind of gentle provocation like in his 2012 article titled If all you have is a hammer, throw away everything that is not a nail. Jimmy, of course, did not want us to throw away Hadoop. Instead he gave a talk on how to get rid of all the tediousness and boiler plating necessary when writing MapReduce jobs in Java. He showed us how to use Pig Latin, a language which can be described as an imperative-like, SQL-like DSL language for manipulating data structured as lists. It is very concise and expressive and soon became a new shiny tool for us developers.

During the past year or so Jimmy had been developing a Hadoop based tool for harvesting web sites into HBase. This tool also had its own piggy bank which is what you call a library of user defined functions (UDFs) for Pig Latin. So to cut the corner those of us who wanted to hack in Pig Latin cloned that tool from Github: warcbase. As a bonus this tool also had an UDF for reading ARC files which was nice as we had a lot of test data in that format, some provided by ONB and some brought from home.

As an interesting side-note, the warcbase tool actually leverages another recently developed digital preservation tool, namely JWAT, developed at the Danish Royal Library.

As Clemens and René writes in their blog post they created two UDFs using Apache Tika. One UDF for detecting which language a given ARC text-based record was written in and another for identifying which MIME type a given record had. Meanwhile another participant Alan Akbik from Technischen Universität Berlin showed Lin how to easily add Pig Latin unit tests to a project. This resulted in an actual commit to warcbase during the hackathon adding unit tests to the previously implemented UDFs.

Given those unit tests I could then implement such tests for the two Tika UDFs that Clemens and René had written. These days unit test are almost ubiquitous when collaborating on writing software. Apart from their primary role of ensuring the continued correctness of refactored code, they do have another advantage. For years I've preferred an exploratory development style using REPL-like environments. This is hard to do using Java, but the combination of unit tests and a good IDE gives you a little of that dynamic feeling.

With all the above in place I decided to write a new UDF. This UDF should use the UNIX file tool to identify records in an ARC file. This task would aggregate the ARC reader UDF by Jimmy, the Pig unit tests by Alan and lastly a Java/JNA library written by Carl Wilson who adapted it from another digital preservation tool called JHOVE2. This library is available as libmagic-jna-wrapper. I, off course, would also rely heavily on the two Tika UDFs by Clemens and René and the unit tests I wrote for those.

Old Magic

The "file" tool and its accompanying library "libmagic" is used in every Linux and BSD distribution on the planet, it was born in 1987, and is still the most used file format identification tool. It would be sensible to employ such a robust and widespread tool in any file identification environment especially as it is still under development. As of this writing, the latest commit to "file" was five days ago!

The "file" tool is available on Github as glenc/file.

"file" and the "ligmagic" library are developed in C. To employ this we therefore need to have a JNA interface and this is exactly what Carl finished during the hackathon.

Maven makes it easy to use that library:

<dependency>
        <groupId>org.opf-labs</groupId>
        <artifactId>lib-magic-wrapper</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </dependency>

which gives access to "libmagic" from a Java program:

    import org.opf_labs.LibmagicJnaWrapper;

    ...

    LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
    magicFile = "/usr/share/file/magic.mgc";
    jnaWrapper.load(magicFile);
    mimeType = jnaWrapper.getMimeType(is);

    ...

There is one caveat in using a C library like this on Java. It often requires platform specific configuration as in this case the full path to the "magic.mgc" file. This file contains the signatures (byte sequences) used when identifying the formats of the unknown files. In this implementation the UDF will take this path as a parameter to the constructor of the UDF class.

Magic UDF

With the above in place is it very easy to implement the UDF which in its completeness is as simple as

package org.warcbase.pig.piggybank;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.opf_labs.LibmagicJnaWrapper;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;

public class DetectMimeTypeMagic extends EvalFunc<String> {
    private static String MAGIC_FILE_PATH;

    public DetectMimeTypeMagic(String magicFilePath) {
        MAGIC_FILE_PATH = magicFilePath;
    }

    @Override
    public String exec(Tuple input) throws IOException {
        String mimeType;

        if (input == null || input.size() == 0 || input.get(0) == null) {
            return "N/A";
        }
        //String magicFile = (String) input.get(0);
        String content = (String) input.get(0);

        InputStream is = new ByteArrayInputStream(content.getBytes());
        if (content.isEmpty()) return "EMPTY";

        LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
        jnaWrapper.load(MAGIC_FILE_PATH);

        mimeType = jnaWrapper.getMimeType(is);

        return mimeType;
    }
}

Github: DetectMimeTypeMagic.java

Magic Pig Latin

A Pig Latin script utilising the new magic UDF on an example ARC file. The script measures the distribution of MIME types in the input files.

    register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';

    -- The '50' argument is explained in the last section
    define ArcLoader50k org.warcbase.pig.ArcLoader('50'); 

    -- Detect the mime type of the content using magic lib
    -- On MacOS X using Homebrew the magic file is located at
    -- /usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc
    define DetectMimeTypeMagic org.warcbase.pig.piggybank.DetectMimeTypeMagic('/usr/local/Cellar/libmagic/5.16/share/misc/magic.mgc');

    -- Load arc file properties: url, date, mime, and 50kB of the content
    raw = load 'example.arc.gz' using ArcLoader50k() as (url: chararray, date:chararray, mime:chararray, content:chararray);

    a = foreach raw generate url,mime, DetectMimeTypeMagic(content) as magicMime;

    -- magic lib includes "; <char set>" which we are not interested in
    b = foreach a {
        magicMimeSplit = STRSPLIT(magicMime, ';');
        GENERATE url, mime, magicMimeSplit.$0 as magicMime;
    }

    -- bin the results
    magicMimes      = foreach b generate magicMime;
    magicMimeGroups = group magicMimes by magicMime;
    magicMimeBinned = foreach magicMimeGroups generate group, COUNT(magicMimes);

    store magicMimesBinned into 'magicMimeBinned';

This script can be modified a bit for usage with this unit test

    @Test
    public void testDetectMimeTypeMagic() throws Exception {
        String arcTestDataFile;
        arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();

        String pigFile = Resources.getResource("scripts/TestDetectMimeTypeMagic.pig").getPath();
        String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ?

        PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location});

        Iterator <Tuple> ts = test.getAlias("magicMimeBinned");
        while (ts.hasNext()) {
            Tuple t = ts.next(); // t = (mime type, count)
            String mime = (String) t.get(0);
            System.out.println(mime + ": " + t.get(1));
            if (mime != null) {
                switch (mime) {
                    case                         "EMPTY": assertEquals(  7L, (long) t.get(1)); break;
                    case                     "text/html": assertEquals(139L, (long) t.get(1)); break;
                    case                    "text/plain": assertEquals( 80L, (long) t.get(1)); break;
                    case                     "image/gif": assertEquals( 29L, (long) t.get(1)); break;
                    case               "application/xml": assertEquals( 11L, (long) t.get(1)); break;
                    case           "application/rss+xml": assertEquals(  2L, (long) t.get(1)); break;
                    case         "application/xhtml+xml": assertEquals(  1L, (long) t.get(1)); break;
                    case      "application/octet-stream": assertEquals( 26L, (long) t.get(1)); break;
                    case "application/x-shockwave-flash": assertEquals(  8L, (long) t.get(1)); break;
                }
            }
        }
    }

Github: TestArcLoaderPig.java

The modified Pig Latin script is at TestDetectMimeTypeMagic.pig

¡Hasta la Vista!

During this event we had a lot of synergy through collaboration; shouting over the tables, showing code to each other, running each other's code on non-public test data, presenting results on projectors, and so on. Even late night discussions added significant energy to this synergy. All this is not possible without people actually meeting each other face to face for a couple of days, showing up with great intentions for sharing, learning and teaching.

So, I do hope to see you all soon somewhere in Europe for some great hacking.

Epilogue: Out of heap space

A couple of weeks ago I was more or less done with all of the above, including this blog post. Then something happened that required us to upgrade our version of Cloudera to 4.5. This again resulted in us changing the basic cluster architecture and then the UDFs stopped working due to heap space out of memory errors. I traced those out of memory errors to the ArcLoader class, which is why I implemented the "READ_SIZE" class field. This field is set when instantiating the class to some reasonable number of kB. In forces the ArcLoader to only read a certain amount of payload data, just enough for Tika and libmagic to complete their format identifications while ensuring we don't get hundreds-of-megabyte sized strings being passed around.

This doesn't address the problem of why it worked before and why it doesn't now. It also doesn't address the loss of generality. The ArcLoader can no longer provide an ARC container format abstraction in every case. It only works when the job can make do with only a part of the payload of the ARC records. I.e. the given solution would not work for a Pig script that needs to extract the audio parts of movie files provided as ARC records.

As this work has primarily been a learning experience I will stop here — for now. Still, I'm certain that I'll revisit these issues somewhere down the road as they are both interesting and the solutions will be relevant for our work.

Identification of PDF preservation risks: analysis of Govdocs selected corpus

$
0
0

This blog follows up on three earlier posts about detecting preservation risks in PDF files. In part 1 I explored to what extent the Preflight component of the Apache PDFBox library can be used to detect specific preservation risks in PDF documents. This was followed up by some work during the SPRUCE Hackathon in Leeds, which is covered by this blog post by Peter Cliff. Then last summer I did a series of additional tests using files from the Adobe Acrobat Engineering website. The main outcome of this more recent work was that, although showing great promise, Preflight was struggling with many more complex PDFs. Fast-forward another six months and, thanks to the excellent response of the Preflight developers to our bug reports, the most serious of these problems are now largely solved1. So, time to move on to the next step!

Govdocs Selected

Ultimately, the aim of this work is to be able to profile large PDF collections for specific preservation risks, or to verify that a PDF conforms to an institute-specific policy before ingest. To get a better idea of how that might work in practice, I decided to do some tests with the Govdocs Selected dataset, which is a subset of the Govdocs1 corpus. As a first step I ran the latest version of Preflight on every PDF in the corpus (about 15 thousand)2.

Validation errors

As I was curious about the most common validation errors (or, more correctly, violations of the PDF/A-1b profile), I ran a little post-processing script on the output files to calculate error occurrences. The following table lists the results. For each Preflight error (which is represented as an error code), the table shows the number of PDFs for which the error was reported (expressed as a percentage)3.

Error code% PDFs reportedDescription (from Preflight source code)
2.4.379.5color space used in the PDF file but the DestOutputProfile is missing
7.152.5Invalid metadata found
2.4.139.1RGB color space used in the PDF file but the DestOutputProfile isn't RGB
1.2.138.8Error on the object delimiters (obj / endobj)
1.4.634.3ID in 1st trailer and the last is different
1.2.532.1The length of the stream dictionary and the stream length is inconsistent
7.1131.9PDF/A Identification Schema not found
3.1.231.6Some mandatory fields are missing from the FONT Descriptor Dictionary
3.1.329.4Error on the "Font File x" in the Font Descriptor (ed.:font not embedded?)
3.1.127.2Some mandatory fields are missing from the FONT Dictionary
3.1.617.1Width array and Font program Width are inconsistent
5.2.213The annotation uses a flag which is forbidden
2.4.212.8CMYK color space used in the PDF file but the DestOutputProfile isn't CMYK
1.2.212Error on the stream delimiters (stream / endstream)
1.2.129.5The stream uses a filter which isn't defined in the PDF Reference document
1.4.19.3ID is missing from the trailer
3.1.118.4The CIDSet entry i mandatory from a subset of composite font
1.18.3Header syntax error
1.2.77.5The stream uses an invalid filter (The LZW)
3.1.57.3Encoding is inconsistent with the Font
2.36.7A XObject has an unexpected key defined
Exception6.6Preflight raised an exception
3.1.96.1The CIDToGID is invalid
3.1.45.7Charset declaration is missing in a Type 1 Subset
7.25Metadata mismatch between PDF Dictionnary and xmp
7.34.3Description schema required not embedded
2.3.24.2A XObject has an unexpected value for a defined key
7.1.13.3Unknown metadata
3.3.13.1a glyph is missing
1.4.82.6Optional content is forbidden
2.2.22.4A XObject SMask value isn't None
1.0.142.1An object has an invalid offset
1.4.101.6Last %%EOF sequence is followed by data
2.2.11.6A Group entry with S = Transparency is used or the S = Null
11.6Syntax error
5.2.31.5Annotation uses a Color profile which isn't the same than the profile contained by the OutputIntent
1.0.61.2The number is out of Range
5.3.11.1The AP dictionary of the annotation contains forbidden/invalid entries (only the N entry is authorized)
6.2.51An explicitly forbidden action is used in the PDF file
1.4.71EmbeddedFile entry is present in the Names dictionary

This table does look a bit intimidating (but see this summary of Preflight errors); nevertheless it is useful to point out a couple of general observations:

  • Some errors are really common; for instance, error 2.4.3 is reported for nearly 80% of all PDFs in the corpus!
  • Errors related to color spaces, metadata and fonts are particularly common.
  • File structure errors (1.x range) are reported quite a lot as well. Although I haven't looked at this in any detail, I expect that for some files these errors truly reflect a deviation from the PDF/A-1 profile, whereas in other cases these files may simply not be valid PDF (which would be more serious).
  • About 6.5% of all analysed files raised an exception in Preflight, which could either mean that something is seriously wrong with them, or alternatively it may point to bugs in Preflight.

Policy-based assessment

Although it's easy to get overwhelmed by the Preflight output above, we should keep in mind here that the ultimate aim of this work is not to validate against PDF/A-1, but to assess arbitrary PDFs against a pre-defined technical profile. This profile may reflect an institution's low-level preservation policies on the requirements a PDF must meet to be deemed suitable for long-term preservation. In SCAPE such low-level policies are called control policies, and you can find more information on them here and here.

To illustrate this, I'll be using a hypothetical control policy for PDF that is defined by the following objectives:

  1. File must not be encrypted or password protected
  2. Fonts must be embedded and complete
  3. File must not contain JavaScript
  4. File must not contain embedded files (i.e. file attachments)
  5. File must not contain multimedia content (audio, video, 3-D objects)
  6. File should be valid PDF

Preflight's output contains all the information that is needed to establish whether each objective is met (except objective 6, which would need a full-fledged PDF validator). By translating the above objectives into a set of Schematron rules, it is pretty straightforward to assess each PDF in our dataset against the control policy. If that sounds familiar: this is the same approach that we used earlier for assessing JP2 images against a technical profile. A schema that represents our control policy can be found here. Note that this is only a first attempt, and it may well need some further fine-tuning (more about that later).

Results of assessment

As a first step I validated all Preflight output files against this schema. The result is rather disappointing:

OutcomeNumber of files%
Pass397326
Fail1112074

So, only 26% of all PDFs in Govdocs Selected meet the requirements of our control policy! The figure below gives us some further clues as to why this is happening:

Here each bar represents the occurrences of individual failed tests in our schema.

Font errors galore

What is clear here is that the majority of failed tests is font-related. The Schematron rules that I used for the assessment currently includes all font errors that are reported by Preflight. Perhaps this is too strict on objective 2 ("Fonts must be embedded and complete"). A particular difficulty here is that it is often hard to envisage the impact of particular font errors on the rendering process. On the other hand, the results are consistent with the outcome of a 2013 survey by the PDF Association, which showed that its members see fonts as the most challenging aspect of PDF, both for processing and writing (source: this presentation by Duff Johnson). So, the assessment results may simply reflect that font problems are widespread4. One should also keep in mind that Govdocs selected was created by selecting on unique combinations of file properties from files in Govdocs1. As a result, one would expect this dataset to be more heterogeneous than most 'typical'PDF collections, and this would also influence the results. For instance, the Creating Program selection property could result in a relative over-representation of files that were produced by some crappy creation tool. Whether this is really the case could be easily tested by repeating this analysis for other collections.

Other errors

Only a small small number of PDFs with encryption, JavaScript, embedded files and multimedia content were detected. I should add here that the occurrence of JavaScript is probably underestimated due to a pending Preflight bug. A major limitation is that there are currently no reliable tools that are able to test overall conformity to PDF. This problem (and a hint at a solution) is also the subject of a recent blog post by Duff Johnson. In the current assessment I've taken the occurrence of Preflight exceptions (and general processing errors) as an indicator for non-validity. This is a pretty crude approximation, because some of these exceptions may simply indicate a bug in Preflight (rather than a faulty PDF). One of the next steps will therefore be a more in-depth look at some of the PDFs that caused an exception.

Conclusions

These preliminary results show that policy-based assessment of PDF is possible using a combination of Apache Preflight and Schematron. However, dealing with font issues appears to be a particular challenge. Also, the lack of reliable tools to test for overall conformity to PDF (e.g. ISO 32000) is still a major limitation. Another limitation of this analysis is the lack of ground truth, which makes it difficult to assess the accuracy of the results.

Demo script and data downloads

For those who want to have a go at the analyses that I've presented here, I've created a simple demo script here. The raw output data of the Govdocs selected corpus can be found here. This includes all Preflight files, the Schematron output and the error counts. A download link for the Govdocs selected corpus can be found at the bottom of this blog post.

Acknowledgements

Apache Preflight developers Eric Leleu, Andreas Lehmkühler and Guillaume Bailleul are thanked for their support and prompt response to my questions and bug reports.

Related blog posts


  1. This was already suggested by this re-analysis of the Acrobat Engineering files that I did in November. 

  2. This selection was only based on file extension, which introduces the possibility that some of these files aren't really PDFs. 

  3. Errors that were reported for less than 1% of all analysed PDFs are not included in the table. 

  4. In addition to this, it seems that Preflightsometimes fails to detect fonts that are not embedded, so the number of PDFs with font issues may be even greater than this test suggests. 

Taxonomy upgrade extras: 

Why can't we have digital preservation tools that just work?

$
0
0

One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!

FITS 0.8

Fast-forward 2.5 years: this week I saw the announcement of the latest FITS release. This got me curious, also because of the recent work on this tool as part of the FITS Blitz. So I downloaded FITS 0.8, installed it in a directory called c:\fits\on my Windows PC, and then typed (while being in directory f:\myData\):

f:\myData>c:\fits\fits

Instead of the expected helper message I ended up with this:

The system cannot find the path specified.
Error: Could not find or load main class edu.harvard.hul.ois.fits.Fits

Hang on, I've seen this before ... don't tell me this is the same bug that I already reported 2.5 years ago ? Well, turns out it is after all!

This got me curious about the status of the other tools that had similar problems in 2011, so I started downloading the latest versions of DROID, JHOVE2 and Fido. As I was on a roll anyway, I gave JHOVE a try as well (even though it was not part of the 2011 evaluation). The objective of the test was simply to run each tool and get some screen output (e.g. a help message), nothing more. I did these tests on a PC running Windows 7 with Java version 1.7.0_25. Here are the results.

DROID 6.1.3

First I installed DROID in a directory C:\droid\. Then I executed it using:

f:\myData>c:\droid\droid

This started up a Java Virtual Machine Launcher that showed this message box:

The Running DROID text document that comes with DROID says:

To run DROID on Windows, use the "droid.bat" file. You can either double-click on this file, or run it from the command-line console, by typing "droid"when you are in the droid installation folder.

So, no progress on this for DROID either, then. I was able to get DROID running by circumventing the launcher script like this:

java -jar c:\droid\droid-command-line-6.1.3.jar

This resulted in the following output:

No command line options specified

This isn't particularly helpful. There is a helper message, for which you have to give the -h flag on the command line. But you don't get to see this until you give the -h flag on the command line. Catch 22 anyone?

JHOVE2-2.1.0

After installing JHOVE2 in c:\jhove2\, I typed:

f:\myData>c:\jhove2\jhove2

This gave me 1393 (yes, you read that right: 1393!) Java deprecation warnings, each along the lines of:

16:51:02,702 [main] WARN  TypeConverterDelegate : PropertyEditor [com.sun.beans.editors.EnumEditor]
found through deprecated global PropertyEditorManager fallback - consider using a more isolated 
form of registration, e.g. on the BeanWrapper/BeanFactory! 

This was eventually followed by the (expected) JHOVE2 help message, and a quick test on some actual files confirmed that JHOVE2does actually work. Nevertheless, by the time the tsunami of warning messages is over, many first-time users will have started running for the bunkers!

Fido 1.3.1

Fido doesn't make use of any launcher scripts any more, and the default way to run it is to use the Python script directly. After installing in c:\fido\ I typed:

f:\myData>c:\fido\fido.py

Which resulted in ..... (drum roll) ... a nicely formatted Fido help message, which is exactly what I was hoping for. Beautiful!

JHOVE 1.11

I installed JHOVE in c:\jhove\ and then typed:

f:\myData>c:\jhove\jhove 

Which resulted in this:

Exception in thread "main" java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j
hove/viewer/ConfigWindow
        at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon
figFile(Unknown Source)
        at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source)
        at Jhove.main(Unknown Source)
Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co
nfigWindow
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 3 more

Ouch!

Final remarks

I limited my tests to a Windows environment only, and results may well be better under Linux for some of these tools. Nevertheless, I find it nothing less than astounding that so many of these (often widely cited) preservation tools fail to even execute on today's most widespread operating system. Granted, in some cases there are workarounds, such as tweaking the launcher scripts, or circumventing them altogether. However, this is not an option for less tech-savvy users, who will simply conclude "Hey, this tool doesn't work", give up, and move on to other things. Moreover, this means that much of the (often huge) amounts of development effort that went into these tools will simply fail to reach its potential audience, and I think this is a tremendous waste. I'm also wondering why there's been so little progress on this over the past 2.5 years. Is it really that difficult to develop preservation tools with command-line interfaces that follow basic design conventions that have been ubiquitous elsewhere for more than 30 years? Tools that just work?

EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!

$
0
0
A while back I wrote a blog post, MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system - without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out the help of a student from the local region’s university to begin looking at EDRMS systems, to understand what metadata they collected, and how to collect additional ‘technical’ metadata using the tools often found in the digital preservation toolkit.   
 
Sarah McKenzie is a student at Victoria University. She has been working at Archives New Zealand on a 400 Hour, Summer Scholarship Programme that takes place during the university’s summer-break. Our department submitted three research proposals to the School of Engineering and Computer Science and out of them Sarah selected the EDRMS focussed project. She began work in December and her scholarship is set to be completed mid-February. 
 
To add further detail, the title and focus of the project is as follows:
 
Mechanism to connect the tools in the digital preservation toolset to content management and database systems for metadata extraction and generation
 
Electronic and document records management systems (EDRMS) are the only legitimate mechanism for storing electronic documents with sufficient organisational context to develop archival descriptions but are not necessarily suited at the point of the creation of a record to store important technical information. Sitting atop database management technology we are keen to understand mechanisms of generating this technical metadata before ingest into a digital archive. 
 
We are keen to understand the challenge of developing this metadata from an EDRMS and DBMS perspective where it is appreciated that mechanisms of access may vary from system to another. In the DBMS context, technical metadata and contextual organisational metadata may be entirely non-existent. 
 
With interfaces to popular characterization tools biased towards that of the file system it is imperative that we create mechanisms to use tools central to the preservation workflow in alternative ways. This project will ask students to develop an interface to EDRMS and DBMS systems that can support characterization using multiple digital preservation tools.
 
Metadata we’re seeking to gather includes format identification, characterisation reports along with other such data as SHA-1 checksums. Tools typical to the digital preservation workflow include DROID, JHOVE, FITS and TIKA.
The blog continues with Sarah writing for the OPF on behalf of Archives New Zealand. She provides some insight into her work thus far, and insight into her own methods of research and discovery within a challenging government environment.
 
EDRMS Systems
 
An EDRMS is a system for controlling, and tracking the creation of documents from the point they are made through publication and possibly even destruction. They function as a form of version control for text documents, providing a way to accomplish a varying range of tasks in the management of documents. Some examples of tasks an EDRMS can perform are:
 
 Tracking creation date
 Changes and publication status
 Keeping a record of who has accessed the documents.
 
EDRMS stores are the individual databases of documents that are maintained for management. They are usually in a proprietary format, and interfacing directly with them means having access to the appropriate Application Layer Interface (API) and Software Development Kit (SDK). In some cases these are merged together requiring only one package. The actual structure of the store varies from system to system. Some use the directory structure that is part of the computer's file system and then have an interface from there. Others utilise a database for storing the documents.
 
Most EDRMS are running client/server architecture.
 
Currently Archives New Zealand has dealt with three different EDRMS stores: 
 
 IBM Notes (formerly called Lotus Notes)
 
‘Notes’ has a publically available API and the latest version is built in Java, allowing for ease of use with metadata extraction tools, used in the digital preservation community - The majority I have found to be written in Java. There are many EDRMS systems, and it's simply not possible to code a tool enabling our preservation toolkit to interact with all of them without a comprehensive review of all New Zealand government agencies and their IT suites. 
 
A survey has been partially completed by Archives New Zealand. The large number of systems suggested a more focused approach in my research project, i.e. a particular instance of EDRMS, over multiple systems. 
 
Gathering Information on Systems in Use
 
Within New Zealand, The Office of the Government Chief Information Officer (OGCIO) had already conducted a survey of electronic document management systems currently used by government agencies. This survey did not cover all government agencies, but with 113 agencies replying it was considered a large enough sample to understand the most widely used systems across government. Out of the 113, some  agencies did not provide any information, leaving only 69 cases where a form of EDRMS was explicitly named. These results were then turned into an alphabetical table listing:
 
 EDRMS names
 The company that created them
 Any notes on the entry
 A list of agencies using them
 
In addition to the information provided by the OGCIO survey, some investigative work was done in looking through the records of the Archives' own document management system to find any reference to other EDRMS in use across government. Other active EDRMS systems were uncovered.
 
For the purposes of this research it was assumed that if an agency has ever used a given EDRMS, it is still relevant to the work of Archives New Zealand, and considered ‘in-use’ until it is verified that there are no more document stores from that particular system which remain not archived, migrated to a new format, or destroyed.
 
Obstacles were encountered in the process of converting the information into a reference table useful for this project. Some agencies provided the names of companies that built their EDRMS. This is understandable to some extent, since there has been a vanity in the software industry where companies name their flagship product after the company (or vice versa). However, in some cases it was difficult to discern what was meant because the company that made the original software had been bought out and their product was still being sold by the new owner under the same name – or the name had been turned into a brand for an arm of the new parent company which deals with all their EDRMS software (e.g. Autonomy Corporation has now become HP Autonomy, Hewlett-Packard's EDRMS branch). 
 
In addition, sometimes there were multiple software packages for document management with the same name. While it was possible to deduce what some of these names meant, it was not possible to find all of them. In these cases the name provided by the agency was listed with a note explaining it was not possible to conclude what they meant, and some suggestions for further inquiry. Vendor acquisitions were listed to provide a path through to newer software packages that possibly have compatibility with the old software, and also provide a way to quickly track down current owners of an older piece of software.
 
The varying needs of different agencies means there is no one-size-fits-all EDRMS system (e.g. a system designed for legal purposes may offer specialised features one for general document handling wouldn't have). But since there has been no overarching standard for EDRMS for various purposes – it was assumed that agencies would make their own choices based on their business needs – there turned out to be a large number of systems in use, some of them obscure or old. The oldest system that could be reasonably verified as having been used was a 1990's version of a program originally created in the late 1980s, called Paradox. This was in progress of currently being upgraded and the data migrated to a system called Radar when the document mentioning it was written, but there was no clear note of this being completed.
 
At the time of writing it had been established that there were approximately 44 EDRMS ‘in-use’.
 
With 44 systems in use it was considered unfeasible to investigate the possibility of automating metadata extraction from all of them at this time. It was decided to set some boundaries for starting points. One boundary was, which EDRMS is the most used? The most common according to the information gathered looked to be Microsoft SharePoint, which we could gather may have 24 agencies using it, and Objective Corporation's Objective was associated with at least 12 agencies.
 
A second way to view this was to ask, ‘which systems have been recommended for use going forward?’ Archives New Zealand’s parent department The Department of Internal Affairs (DIA) has created a three-supplier panel for providing enterprise content management solutions to government agencies. Those suppliers are:
 
 
With two weeks remaining in the scholarship, and work already completed to connect a number of digital preservation tools together in a middle abstraction layer to provide a broad range of metadata for our digital archivists, it was decided that testing of the tool, that is connecting it to an EDRMS and extracting technical metadata, would be best done on a working, in-use EDRMS, from the proposed DIA supplier panel, that would continue to add value to Archives New Zealand’s work moving into the future.
 
Getting Things Out of an EDRMS
 
The following tools were considered to be a good set to start examining extraction of metadata from files:
 
 DROID
 JHOVE
 Tika
 
Linking the tools together has been done via a java application that uses each tool's command line API to run them in turn. The files are identified first by Droid, and then each tool is run over the file to produce a collection of all available metadata in Comma Separated Values format. This showed that some tools extract the information in different ways (date formatting is not consistent) and some tools can read data, others cannot; for example, due to a character encoding issue, a particular PDFs Title, Author, and Creator fields were not readable in JHOVE where they were read correctly in Tika, and NLMET - JHOVE still extracts information those tools do not.
 
When a tool sends its output to standard out it's a simple matter of working with the text output as it's fed back to the calling function from the process. In some cases a tool produces an output file which had to be read back in. In the case of the NLMET, a handler for the XML format had to be built. Since the XML schema had separate fields for date and time of creation and modification, the opportunity was taken to collate those into two single date-time fields so they would better fit into a schema.
 
The goal with the collated outputs is to have domain experts check over them to verify which tools produce the information they want, and once that is done a schema for which piece of data to get from which tool can be introduced to the program so it can create and populate the Archives metadata schema for the files it analyses.
 
The ideal goal for this tool is to connect it to an EDRMS system via an API layer, enabling the extraction of metadata from the files within a store without having to export the files. For that purpose the next stage in this research is to set up a test example of one of DIA’s proposed EDRMS solutions and try to access it with the tool unifier. It is hoped that this will provide an approach that can be applied to other document management systems moving forward.
 

SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

$
0
0

The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page.

 

We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):


 

 

but web page versions:

 

 

An efficient change detection approach is important for several issues:

 

  • Crawler optimization

  • Discovering new crawl strategies e.g. based on patterns

  • Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.

  • Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers

  • Archive maintenance, different operations like format migration can change the archived versions renderings.

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.

It is based on two different technologies:

 

1 – Web page segmentation (let's keep the details for another blog post)

2 – Supervised Learning with Support Vector Machine(SVM).

 

In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).

 

You calculate the similarity (or distance) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images similarities (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions: SIFT similarity and HSV similarity.

 

To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundation (IMF). With a part of your dataset you train your system.

 

Let's start training:

First, you put all your vectors in input space.As this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).

You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).

 

Let's classify:

 

Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.

The pair of urls in blue will be considered as dissimilar, the one in black will be considered as similar by pagelyzer.

When you choose different types of comparison, you choose different types of features and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.

An image to show what happens when you have more than two dimensions:

 

From www.epicentersoftware.com

 

 

References

Structural and Visual Comparisons for Web Page Archiving
M. T. Law, N. Thome, S. Gançarski, M. Cord
12th edition of the ACM Symposium on Document Engineering (DocEng) 2012

 

Structural and Visual Similarity Learning for Web Page Archiving
M. T. Law, C. Sureda Gutierrez, N. Thome, S. Gançarski, M. Cord
10th workshop on Content-Based Multimedia Indexing (CBMI) 2012

 

Block-o-Matic: a Web Page Segmentation Tool and its Evaluation

Sanoja A., Gançarski S.

BDA. Nantes, France. 2013.http://hal.archives-ouvertes.fr/hal-00881693/

 

Yet another Web Page Segmentation Tool

Sanoja A., Gançarski S.

Proceedings iPRES 2012. Toronto. Canada, 2012

 

Understanding Web Pages Changes.

Pehlivan Z., Saad M.B. , Gançarski S.

International Conference on Database and Expert Systems Applications DEXA (1) 2010: 1-15

 

 

 

Preservation Topics: 

A Nailgun for the Digital Preservation Toolkit

$
0
0

Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis:

   3 x 450,000 = 1,350,000 seconds
   1,350,000 seconds = 15.625 days

My bash script included calls to three Java applications, Apache Tika, 1.3 at the time, twice, running the -m and -d flags:

   -m or --metadata Output only metadata
   -d or --detect Detect document type

It also made a call to Jhove 1.11 in standard mode. The script also calculates SHA1 for de-duplication purposes, and to match Archives New Zealand's chosen fixity standard; computes a V4 UUID per file, and outputs the result of the Linux File command, in two separate modes, standard, and with the -i flag to attempt to identify mime-type.

Each application receives a path to a single file as an argument from a directory manifest. The script outputs five CSV files that can be further analysed.

The main function used in the script is as follows:

   dp_analysis ()
   {
      FUID=$(uuidgen)
      DIRN=$(dirname "$file")
      BASN=$(basename "$file")

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"file-5.11"'\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}file-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-md"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}tika-md-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-type"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}tika-type-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"jhove-1_11"'\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}jhove-analysis.log

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"sha-1-8.20"'\t' $(sha1sum -b "$file") >> ${LOGNAME}sha-one-analysis.log
   }

What I hadn't anticipated was the expense of starting the Java Virtual Machine (JVM) three times each loop, 450,000 times. The performance is prohibitive and so I immediately set out to find a solution. Either cut down the number of tools I was using, or figure out how to avoid starting the JVM each time. Fortunately a Google search led me to a solution, and a phrase, that I had heard before – Nailgun.

It has been mentioned on various forums, including comments on various OPF blogs, and it is even found in the Fits release notes. The phrase resonated and it turned out that it provided a single and accessible approach to do what we need.

One of the things that we haven't seen yet is a guide on using it within the digital preservation workflow. I'll describe how to make best use of this tool, and try and demonstrate its benefits during the remainder of this blog.

For testing purposes we will be generating statistics on a laptop that has the following specification:

   Product: Acer Aspire V5-571PG (Aspire V5-571PG_072D_2.15)
   CPU:     Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
   Width:   64 bits
   Memory:  8GiB

   OS:       Ubuntu 13.10
   Release:  13.10
   Codename: saucy

   Java version "1.7.0_21"
   Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
   Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

JVM Startup Time

First, let's demonstrate the startup cost of the JVM. If we take two functionally equivalent programs, the first in Java and the second in C++, we can look at the time taken to run them 1000 times consecutively.

The purpose of each application is to run and then exit with a return code of zero.

Java: SysExitApp.java:

   public class SysExitApp {
      public static void main(String[] args) {
         System.exit(0);
      }
   }

C++: SysExitApp.cpp:

   int main()
   {
      return(0);
   }

The Script to run both, and output the timing for each cycle, is as follows:

   #!/bin/bash

   time (for i in {1..1000}
   do
      java -jar SysExitApp,jar
   done)

   time (for i in {1..1000}
   do
      ./SysExitApp.bin
   done)

The source code can be downloaded from GitHub. Further information about how to build the C++ and Java applications is available in the README file. The output of the script is as follows:

   real 1m26.898s
   user 1m14.302s
   sys  0m13.297s

   real 0m0.915s
   user 0m0.093s
   sys  0m0.854s

With the C++ binary, the average time taken per execution is 0.915ms. The execution time of the Java application rises from this to 86.898ms on average. One can reasonably put this down to the cost of the JVM startup.

Both C++ and Java are compiled languages. C++ compiles down to machine code; instructions that can be executed directly by the CPU (Central Processing Unit). Java compiles down to bytecode. Bytecode lends itself to portability across many devices where the JVM provides an abstraction layer handling differences in hardware configuration before interpreting it down to machine code. .

A good proportion of the tools in the digital preservation toolkit are implemented in Java, e.g. DROID, Jhove, Tika, Fits. As such, we currently have to take this performance hit, and optimizations must focus on handling that effectively.

Enter Nailgun

Nailgun is a client/server application that removes the overhead of starting the JVM by running it once within the server and enabling all command-line based Java applications to run within that single instance. The Nailgun client then handles those applications' calls to the server; that might be the command line (stdin) one normally associates with a particular application, e.g. running Tika with the -m flag, and passing it a reference to a file. The application runs and Nailgun directs its stdout, and stderr back to the client which is then output to the console.

With the exception of the command line being executed within a call to the Nailgun client, behaviour remains consistent with that of the standalone Java application. The Nailgun background information page provides a more detailed description of the process.

How to build Nailgun

Before running Nailgun it needs to be downloaded from GitHub and built using Apache Maven to build the server, and the GNU Make utility to build the client. The instructions in the Nailgun README describe how this is done.

How to start Nailgun

Once compiled the server needs to be started. The command line to do this looks like this:

   java -cp /home/digital/dp-toolkit/nailgun/nailgun-server/target/nailgun-server-0.9.2-SNAPSHOT.jar -server com.martiansoftware.nailgun.NGServer

The classpath needs to include the path to the Nailgun server Jar file. The command to start the server can be expanded to include any further application classes you want to run. There are other ways it can be modified as well. For further information please refer to the Nailgun Quick Start Guide. For simplicity we start the server using the basic startup command.

Loading the tools (Nails) into Nailgun

As mentioned above, the tools you want to run can be loaded into Nailgun at startup. For my purposes, and to provide a useful and simple overview for all, I found it easiest to load them via the client application.

Applications loaded into Nailgun need to have a main class. It is possible to find if the application has a main class by opening the Jar in an archive manager capable of opening Jars such as 7-Zip. Locate the MET-INF folder, and within that the MANIFEST.MF file. This will contain a line similar to this example from the Tika Jar’s MANIFEST.MF in tika-app-1.5.jar.

   Main-Class: org.apache.tika.cli.TikaCLI

Confirmation of a main class means that we can load Tika into Nailgun with the command:

   ng ng-cp /home/digital/dp-toolkit/tika-1.5/tika-app-1.5.jar

Before working with our digital preservation tools. we can try running the Java application created to baseline the JVM startup time alongside the functionally comparable C++ application.

MANIFEST.MF within the SysExitApp.jar file reads as follows:

   Manifest-Version: 1.0
   Created-By: 1.7.0_21 (Oracle Corporation)
   Main-Class: SysExitApp

As it has a main class we can load it into the Nailgun server with the following command:

   ng ng-cp /home/digital/Desktop/dp-testing/nailgun-timing/exit-apps/SysExitApp.jar

The command ng-cp tells Nailgun to add it to its classpath. We provide an absolute path to the Jar we want to execute. We can then call its main class from the Nailgun client.

Calling a Nail from the Command Line

Following that, we want to call our application from within the terminal. Previously we have used the command:

   java -jar SysExitApp.jar

This calls Java directly and thus the JVM. We can replace this with a call to the Nailgun client and our application's main class:

   ng SysExitApp

We don't expect to see any output at this point, that is, provided no error occurs, it will simply return a new input line on the terminal. On the server, however, we will see the following:

   NGSession 1: 127.0.0.1: SysExitApp exited with status 0

And that's it. Nailgun is up and running with our application!

We can begin to see the performance improvement gained by removing the expense of the JVM startup when we execute this command using our 1000 loop script. We simply add the following lines:

   time (for i in {1..1000}
   do
      ng SysExitApp
   done)

This generates the output:

   real 0m2.457s
   user 0m0.157s
   sys  0m1.312s

Compare that to running the Jar, and compiled binary files before:

   real 1m26.898s
   user 1m14.302s
   sys  0m13.297s

   real 0m0.915s
   user 0m0.093s
   sys  0m0.854s

It is not as fast as the compiled C++ code but it represents an improvement of well over a minute compared to calling the JVM each loop.

The Digital Preservation Toolkit Comparison Script

Up and running we can now baseline Nailgun with the script used to run our digital preservation analysis tools.

We define two functions: one that calls the Jars we want to run without Nailgun, and the other to call the same classes, with Nailgun:

   dp_analysis_no_ng ()
   {
      FUID=$(uuidgen)
      DIRN=$(dirname "$file")
      BASN=$(basename "$file")

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"file-5.11"'\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-md"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}${TIKAMDLOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-type"'\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}${TIKATYPELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"jhove-1_11"'\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}${JHOVELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"sha-1-8.20"'\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG}
   }

   dp_analysis_ng ()
   {
      FUID=$(uuidgen)
      DIRN=$(dirname "$file")
      BASN=$(basename "$file")

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"file-5.11"'\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-md"'\t' $(ng org.apache.tika.cli.TikaCLI -m "$file") >> ${LOGNAME}${TIKAMDLOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"tika-1.5-type"'\t' $(ng org.apache.tika.cli.TikaCLI -d "$file") >> ${LOGNAME}${TIKATYPELOG}

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"jhove-1_11"'\t' $(ng Jhove "$file") >> ${LOGNAME}${JHOVELOG}  

      echo -e ${FUID} '\t'"$file"'\t' ${DIRN} '\t' ${BASN} '\t'"sha-1-8.20"'\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG}
   }

Before we define the functions, we load the applications into Nailgun with the following commands:

   #Load JHOVE and TIKA into Nailgun CLASSPATH
   $(ng ng-cp ${JHOVE_HOME}/bin/JhoveApp.jar)
   $(ng ng-cp ${TIKA_HOME}/tika-app-1.5.jar)

The complete script can be found on GitHub with more information in the README file.

Results

For the purpose of this blog I reworked the Open Planets Foundation Test Corpus and have used a branch of that to run the script across. There are 324 files in the corpus with numerous different formats. The script produces the following results:

   real 13m48.227s
   user 26m10.540s
   sys  0m59.861s

   real 1m32.801s
   user 0m4.548s
   sys  0m16.847s

Stderr is piped to a file called errorlog.txt using ‘2>’ syntax to enable me to capture the output of all the tools and to avoid any expense of the tool printing to the screen. Errors shown in the log relate to the tools ability to parse certain files in the corpus rather than to do with Nailgun. The errors should be reproducible with the same format corpus and tool set.

There is a marked difference in performance when running the script with Nailgun and without. Running the tools as-is we find that each pass takes approximately 2.52 seconds per file on average.

Using Nailgun this is reduced to approximately 0.28 seconds per file on average.

Conclusion

The timing results collected here will vary quite widely on different systems, even on the same system. The disparity between running applications executing the JVM each time and then running applications using Nailgun should show up with fairly even contrast.

While I hope this blog provides a useful Nailgun tutorial, the concern I have after being able to work in anger with the tools we talk about in the digital preservation community on a daily basis, is in understanding what smaller institutions with smaller IT departments, and potentially fewer IT capabilities, are doing. And whether they are even able to make use of the tools out there given the overheads described.

It is possible to throw more technology and more resources at this issue but it can't be expected that this will always be possible. The reason I sought this workaround is that I can't see that capability being developed at Archives New Zealand without significant time and investment, and that capability can't always be delivered in short-order within the constraints of working within government. My analysis, on a single collection of files, needs to be complete within the next few weeks. I need tools that are easily accessible and far more efficient to be able to do this. 

It is something I'll have to think about some more. 

Nailgun gives me a good shot-term solution, and hopefully this blog opens it up as a solution that will prove useful to others too.

It will be interesting to learn, following this work, how others have conquered similar problems, or equally interesting, if they are yet to do so.  

--

Notes:

Loops: I experimented with various loops for recursing the directories in the opf-format-corpus expecting to find differences in performance within each. Using the Linux time command I was unable to find any material difference in either loop. The script used for testing is available on GitHub. The loop executes a function that calls two Linux commands, ‘sha1sum’ and ‘file’. A larger test corpus may help to reveal differences in either approach. I opted to stick with iterating over a manifest as this is more likely to mirror processes within our organization.

Optimization: I recognize a naivety in my script. Produced to collect quick and dirty results from a test set that I only have available for a short period of time. The first surprise running the script was the expense of the JVM startup. After finding a workaround for that I now need to look at other optimizations to continue to approach the analysis this way. Failing that, I need to understand from others why this approach might not be appropriate, and/or sustainable. Comments and suggestions along those lines as part of this blog are very much appreciated.

And Finally...

All that glisters is not gold: Nailgun comes with its own overhead. Running the tool on a server at work with the following specification:

   Product: HP ProLiant ML310 G3
   CPU:     Intel(R) Pentium(R) 4 CPU 3.20GHz
   Width:   64 bits
   Memory:  5GiB
   
   OS:        Ubuntu 10.04.4 LTS
   Release:   10.04
   Codename:  lucid
   
   Java version "1.7.0_51"
   Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
   Java HotSpot(TM) Client VM (build 24.51-b03, mixed mode)

We find it running out of heap space around the 420,000th call to the server, with the following message:

   java.lang.OutOfMemoryError: Java heap space

If we look at the system monitor we can see that the server has maxed out the amount of RAM it can address. The amount of memory it is using grows with each call to the server. I haven't a mechanism to avoid this at present, other than chunking the file set and restarting the server periodically. Users adopting Nailgun might want to take note of this issue up-front. Throwing memory at the problem will help to some extent but a more sustainable solution is needed, and indeed welcomed. This might require optimizing Nailgun, or instead, further optimization of the digital preservation tools that we are using.

 

SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale

$
0
0

Overview

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.
ToMaR offers the possibility to use existing command-line tools and Java applications in Hadoop’s distributed environment very similarly to a Desktop computer. By utilizing SCAPE tool specification documents, ToMaR allows users to specify complex command-line patterns as simple keywords, which can be executed on a computer cluster or a single machine. ToMaR is a generic MapReduce application which does not require any programming skills.

This webinar will introduce you to the core concepts of Hadoop and ToMaR and show you by example how to apply it to the scenario of file format migration.

Learning outcomes

1. Understand the basic principals of Hadoop
2. Understand the core concepts of ToMaR
3. Apply knowledge of Hadoop and ToMaR to the file format migration scenario

Who should attend?

Practitioners and developers who are:

• dealing with command line tools (preferrably of the digital preservation domain) in their daily work
• interested in Hadoop and how it can be used for binary content and 3rd-party tools

Session Lead: Matthias Rella, Austrian Institute of Technology

Time: 10:00 GMT / 11:00 CET

Duration: 1 hour

Date: 
Friday, 21 March 2014
Location: 
United Kingdom
55° 22' 40.9836" N, 3° 26' 9.5028" W
Event Types: 

A Tika to ride; characterising web content with Nanite

$
0
0

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid  
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step.  The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files.  This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours.  More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage.  As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast.  Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster.  This is less than 1ms/file.  Libraries to be turned on/off relatively easily by editing the source or FormatProfiler.properties in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers.  The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best.  There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies. 

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly.  This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging.  As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika).  However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them.  This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment.  It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out.  The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika.  Firstly I ensured that any Exceptions/Errors/etc were caught.  Then I created a TimeoutParser that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded.  This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread.  Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects.  Details about this can be read in an issue on the Tika bug tracker.  Since I did not want to risk a corruption of the JVM I did not pursue this further. 

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread.  This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika.  This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM.  The new library controls restarting the process after a crash, or after processing times out (in case of a hang).  An API similar to a normal Tika parser is provided so it can be easily reused.  Communication by the library with the Tika-server is via REST, over the loopback network interface.  There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for.  Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files.  Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files.  This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header.  A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here.  I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above.  Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used

Nanite-core (Droid)

Tika detect() (mimetype only)

ProcessIsolatedTika parsers

WARC files

1000

Total WARC size

59.4GB (63,759,574,081 bytes)

 

Total files in WARCs (# input records)

7,612,660

Runtime (hh:mm:ss)

03:06:24

Runtime/file

1.47ms

Throughput

19.1GB/hour

Total Tika parser output size (compressed)

765MB (801,740,734 bytes)

 

Tika parser failures/crashes

15

Misc failures

Malformed records: 122

IOExceptions*: 3224

Other Exceptions: 430

Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika.  This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika.  When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside - if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

CSV Validator - beta releases

$
0
0

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema.  We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language.  The tool and source code are released under Mozilla Public Licence version 2.0.

For more details, links to the source code repository, release code on Maven Central, instructions and schema specification, see http://digital-preservation.github.io/csv-validator/

Feedback is welcome.  When we make the formal version 1.0 release there will be a fuller blog post on The National Archives blog.

Preservation Topics: 
Viewing all 53 articles
Browse latest View live




Latest Images