In my last post I convinced myself that a standard USGS topo map contains a heap of image data, that if I removed, would result in a smaller map file and no loss of functionality. So now it's time to make that happen.
My first attempt was to follow this recipe. The suggestion was to use PyMuPDF to redact every image in the document. While functionally promising, the result was a miss. First, the redaction process takes a significant amount of time given that there's over 170 images in a single map. More importantly, the redacted images are replaced with text that leak outside the 'Images' map layer. The result was a map covered in redaction text, which as you can imagine, was useless.
Looking at the PyMuPDF docs I found Page.delete_image. Apparently, I had been overthing this. It looked like the image removal process was going to be as simple as:
for page_num in range(len(doc)): for img in doc.get_page_images(page_num): page = doc[page_num]; xref = img[0] page.delete_image(xref)
That is, for each page of the document, loop through every image on that page. For each of these images, call delete_image. Alas, when I tried this, delete_image triggered an error message:
File "/Users/ben/Library/Python/3.9/lib/python/site-packages/fitz/utils.py", line 255, in replace_image if not doc.is_image(xref): AttributeError: 'Document' object has no attribute 'is_image'
Looking at the source code, the error message is right: Document doesn't have an is_image method on it. This looks like a bug in PyMuPDF.
Fortunately, what was broken about delete_image was a pre-check I didn't need. The code that does the work of removing the image appears to be functional, so I grabbed it and used it directly.
Deleting an image is now accomplished with this code:
for page_num in range(len(doc)): for img in doc.get_page_images(page_num): page = doc[page_num]; xref = img[0] new_xref = page.insert_image(page.rect, pixmap=pix) doc.xref_copy(new_xref, xref) last_contents_xref = page.get_contents()[-1] doc.update_stream(last_contents_xref, b" ") doc.save(output_file, deflate=True, garbage=3);
The doc.save arguments of deflate=True and garbage=3 ensure that space is reclaimed from the removed images.
Given my new found knowledge, I enhanced pdfimages to support -a remove, which removes all images in a PDF.
Here's my script in action:
# 4 freshly downloaded USGS Topo Maps $ ls -lh *.pdf -rw-------@ 1 ben staff 53M Sep 23 00:20 VA_Bon_Air_20220920_TM_geo.pdf -rw------- 1 ben staff 56M Sep 17 00:17 VA_Chesterfield_20220908_TM_geo.pdf -rw------- 1 ben staff 48M Sep 23 00:21 VA_Drewrys_Bluff_20220920_TM_geo.pdf -rw------- 1 ben staff 48M Feb 8 08:05 VA_Drewrys_Bluff_20220920_TM_geo.with_images.pdf -rw------- 1 ben staff 51M Sep 23 00:22 VA_Richmond_20220920_TM_geo.pdf # Remove their images $ for pdf in *.pdf; do \ pdfimages -a remove -i $pdf -o compressed/$pdf ; \ done # And we're smaller! From 50meg to 6meg. Not bad. $ ls -lh compressed/ total 69488 -rw------- 1 ben staff 6.7M Feb 9 07:47 VA_Bon_Air_20220920_TM_geo.pdf -rw------- 1 ben staff 8.0M Feb 9 07:47 VA_Chesterfield_20220908_TM_geo.pdf -rw------- 1 ben staff 6.4M Feb 9 07:47 VA_Drewrys_Bluff_20220920_TM_geo.pdf -rw------- 1 ben staff 6.4M Feb 9 07:47 VA_Drewrys_Bluff_20220920_TM_geo.with_images.pdf -rw------- 1 ben staff 6.3M Feb 9 07:47 VA_Richmond_20220920_TM_geo.pdf # Are the PDF layers still intact? They are $ python3 ~/dt/i2x/src/trunk/tools/bash/bin/pdflayers -l compressed/VA_Richmond_20220920_TM_geo.pdf off:230:Labels on:231:Map Collar on:232:Map Elements on:233:Map Frame on:234:Boundaries on:235:Federal Administrated Lands on:236:National Park Service on:237:National Cemetery on:238:Jurisdictional Boundaries on:239:County or Equivalent on:240:State or Territory on:241:Woodland on:242:Terrain off:243:Shaded Relief on:244:Contours on:245:Hydrography on:246:Wetlands on:247:Transportation on:248:Airports on:249:Railroads on:250:Trails on:251:Road Features on:252:Road Names and Shields on:253:Structures on:254:Geographic Names on:255:Projection and Grids off:256:Images on:257:Orthoimage on:258:Barcode
My script shrinks a USGS PDF from 50'ish megs to 7'ish. That means I can now store the 1,697 map files for Virginia in 11.8 gigs of disk space, instead of 84.8 gigs. That's quite an improvement for a script that was relatively easy to write, and fast to execute.
The question remains: does the modified PDF remain a valid GeoPDF? Will Avenza Maps treat it like a location aware document? I loaded up one of my newly compressed maps into Avenza to confirm this:
Success! As you can see, Avenza is able to detect the coordinates on the map, as well as measure distances and bearings. The image-less maps are more compact, and completely functional.
You'll notice in the above screenshot that there are no street names printed on the map. That's by design. I turned off the layer that displays this information to verify that OCGs are still being respected. They are.
Time to start filling up Micro SD cards with collections of maps.
No comments:
Post a Comment