better pdfs on linux

June 29, 2019

Preface: Not that long ago I received a couple of textbooks and learning materials as a RAR archive with black-and-white TIFF and decided I surely could do better. I obtained and scanned the physical textbook I managed to find, and one I couldn’t re-scan myself went through a rebirth — TIFF files were mercilessly cut, recompressed and compiled to PDF in GIMP. While PDF creation process in GIMP is far from perfect, it is easy to use and you can make pretty good PDF from other document formats or raw image scans.

But organizing the document’s structure is another case — while instructions aren’t perfect, they can give you a hint what to expect and save you some time looking for perfect software (especially so if you decide it’s not worth it).

Software I used: Simple Scan, MuPDF (mupdf-tools package), jpdftweak, LibreOffice Calc, PDF Tricks. And to check the result — various readers: Evince, Okular, pdf.js on desktop and on mobile adobe reader, muPDF viewer.

1. Scanning — Simple Scan.

Take a look at your source to determine what kind of scan do you want to get — for (plain-text) textbooks default setting of 300 dpi is a bit too much, unless pages are small or there is a lot of small text. Personally I’ve gotten pleasant results with 150 dpi, but if your source is monochrome you can go for 300 or higher and enjoy really crisp characters. If you’re going to scan a book, select “multiple pages from flatbed” and set a time between scans — when you can’t keep up with your scanner, simply pause scan and increase it. After you’re done scanning, you can re-check scanned pages and apply basic manipulation to your result such as crop and rotate. Don’t worry about spreads (double pages) — just make sure you crop them with separating line in the middle. If there are some blurry pages, delete them and re-scan (will it be placed after selection or in the end?). Then save results in desired format (to follow this guide further, save as a PDF document). Take a sip of tea and look at the result.

2. Splitting double pages (or “Guillotining”, in French-revolutionary tradition).

Most convenient way I’ve found is simply using mutool from mupdf software suite. While being a CLI program, it’s usage is extremely straightforward:

usage: mutool poster [options] input.pdf [output.pdf] -p – password -x x decimation factor -y y decimation factor

If your scan looks like this,

You enter: > mutool poster -x 2 scanraw.pdf scansplit.pdf

Then open your new file to see it properly split and it’s number of pages now doubled. For re-printing or reading on another device these two steps may suffice.

If you plan to make heavy use of the file jumping around various parts and chapters, you may want to create a table of contents [3], or maybe just make it more lightweight to save space [4].

3. Writing the contents — jpdftweak.

For any textbook or a guideline this is essential. I did it using jpdftweak. You will need Java and patience.

Before you continue, open the PDF in a viewer of your choice and see if page numbering displayed in your viewer corresponds to the page numbers in your document. You can always correct extra pages or missing pages — but it’s better to do it before the contents so you’ll be able to check that everything points to the right parts of the document. And special numbering (I, II, III. 1-130. A, B) is a special caveat — while jpdftweak supports it, pages you enter in contents should still correspond to the raw page numbering, disregarding this feature.

Now, open your document in the jpdftweak select Bookmarks tab. Check “Change chapter bookmarks”. Structure is simple — for many cases you will need to edit only “Depth”, “Title” and “Page” — and maybe font options.

Note: Different viewers support content features differently — Evince (Gnome Document Viewer), for example, doesn’t care for font options (I guess it can be same for many people) and with some software levels are auto-expanded. So if you’re targeting specific use-case or software you may consider that and if the time you’re going to spend on content’s view is worth it.

Try filling a couple lines in jpdftweak’s editor and see if you’re ready to continue. GUI is really simple — and lacks support for many hotkeys so experience is far from pleasant, especially if your table of contents includes hundreds of entries. In that case, import and export feature comes to the rescue.

Warning: for jpdftweak levels should deepen strictly by one step: 1→2→3→4. Trying to do structure like 1→2→4 skipping depth level 3 will result in an error. Do that the other way around — first or second depth level work fine after any deeper level. So make sure there are jumps that jpdftweak can’t handle so everything will proceed smoothly.

Enter basic structure of your contents (for example, a couple of parts with different levels) and export it to .csv, or use [mine] and open it with LibreOffice Calc or other editor of your choice. Now you can see the structure and editing can be more straightforward — with the technical documents you just copy and paste repeating elements. As you finish, open the .csv with jpdftweak and check if everything is in order. See if there are any other options you would like to edit with jpdftweak (like document metadata, tags or above-mentioned special numbering — if you need it) and export to new file — I recommend using the options set below.

4. Compressing — PDF Tricks.

PDF Tricks is the most efficient easy-to-use app I’ve found. Open file you need compressed, select the level of compression and after a minute or some open your file with _compressed suffix. Install it from flathub. Approach is simple: try-and-see. Usually, default level shows if it’s worth to apply it and whether this type of document benefits from compression. With the above-mentioned set of programs, low compression method provided no reduction in filesize, and in most cases high compression created unpleasant JPEG-like artifacts.

Keep in mind that figures, plans or characters that need to be very precise — like CJK characters or other writing systems you aren’t well familiar with (but will have to go through) — may not like to be compressed and often become blurry as scan resolution decreases and compression level gets higher.

To share experience or provide feedback on this guide - (If there anything amiss or grossly imprecise about the instructions) Any suggestions on methods, language, software and settings are welcome: gokigenⒶⓉdisroot.org