Ben's Journal: Jedediah Hotchkiss' Sketchbook | From Library of Congress image gallery to mobile friendly PDF

Monday, November 12, 2018

Jedediah Hotchkiss' Sketchbook | From Library of Congress image gallery to mobile friendly PDF

I believe that Jedediah Hotchkiss' Civil War sketchbook would make for interesting reading. While this work is publish on the Library of Congress's (LoC) website, at 224 pages I wanted a more convenient way of reading the document than looking through an online image gallery.

Here's how I arrived at a single PDF file that contained all 225 pages of Jed's personal sketchbook.

Step 1: I viewed the source of the LoC page and noted a rel="alterantive" link tag.

Step 2: curling this URL returned back a wealth of interesting information:

$ curl -s  'https://www.loc.gov/item/2005625258/?fo=json' | jq .
{
  "articles_and_essays": [
    {
      "site": [
        "lcweb"
      ],
      "contributor": [
        "potter, abbey"
      ],
      "original-format": [
        "web page"
      ],
      "partof": [
...

Step 3: rather than reading the details of this JSON format, I poked around until I found this critical block:

    "resources": [
      {
        "files": 117,
        "captions": "http://cdn.loc.gov/service/gmd/gmd388m/g3880m/g3880m/gcwh0001/captions.txt",
        "image": "http://cdn.loc.gov/service/gmd/gmd388m/g3880m/g3880m/gcwh0001/ca000001.gif",
        "url": "http://www.loc.gov/resource/g3880m.gcwh0001/"
      }
    ],

Step 4: between curl and my browser, I was able to write the following code which pulls down all 117 images associated with this LoC entry:

#!/bin/bash

##
## Grab content from the library of congress
##
## For example:
##  locget 'https://www.loc.gov/resource/g3880m.gcwh0001/?c=200&fo=json&st=slideshow' 
##

usage() {
  echo "`basename $0` {gallery-url}"
  exit 1
}

if [ -z "$1" ]; then
  usage
fi

resource_url="$1"
captions_url=$(curl -s $resource_url | jq -r '.resources[0].captions')
image_url=$(curl -s $resource_url | jq -r '.resources[0].image')

path=$(dirname $image_url | sed -e 's|http://cdn.loc.gov/||' \
                              -e 's|/|:|g')

curl -s $captions_url | while read row ; do
  file=`echo $row | cut -f 3 -d ' '`
  if [ -n "$file" ] ; then
    curl -s "http://tile.loc.gov/image-services/iiif/$path:$file/full/pct:100/0/default.jpg" > $file.jpg
  fi
done

Note the call to tile.loc.gov to pick up the image files. By setting pct:100, I'm able to request full size images. It's also possible provide a value like pct:50 to pick up images that are half size.

Step 5: with step 4 complete, I had a full set of images locally. However, each image contains both a left and right hand page. To split the pages into separate files, I used my good friend ImageMagick:

$ mkdir pages
$ cd pages
$ for f in ../*.jpg ; \
   do echo $f ; convert -crop 50%x100% +repage $f `basename $f` ; \
   done

Step 6: Finally, I created a single (massive) PDF file by running the command:

$ convert *.jpg master.pdf

You can download the generated PDF here.

And here's a few screenshots of me scrolling throw Jed's sketchbooks on my Galaxy S9+:

The formatting isn't perfect, and the PDF file is massive. But still, I'm able to scroll through the pages with ease, and I can view detail by simply zooming in.

If I had a horse, I could peruse the content from the same perspective Jedediah created it. Though, even I admit that's probably excessive.