Help:DjVu files - Wikisource, the free o

Help:DjVu files - Wikisource, the free online library
Jump to content
From Wikisource
Help:Contents
DjVu files
sister projects
Commons gallery
Wikidata item
Shortcut
H:DV
This page explains how to create, use, and upload files in the
DjVu format
, which groups scanned images into a single container format.
This page in a nutshell:
Please use
#The Internet Archive
and the
IA Upload tool
to get and produce DjVu files, it's the recommended and simplest method in most cases.
Specific trials (mostly obsoleted by new tools)
OCR with Tesseract (obsoleted by ocrodjvu, see below)
See also
Wikimedia Commons:Commons:DjVu
Image extraction
edit
Shortcut
H:DJVUIMG
DjVu files generally have very heavy image compression that is optimised for text. This results in severe damage to image quality for illustrations and photographs. In general, it is better not to extract images from DjVu files and instead use more original files, for example, the page JP2s at the Internet Archive.
Help:Image extraction
contains more guidance.
Conversion
edit
Images to DjVu
edit
Windows
edit
DjvuToy
is a software which provides different functionalities:
make a Djvu
merge Djvu files
split Djvu files
edit Djvu files
generate a
bundled
file
export from Djvu to another file
extract text from Djvu
download Djvu file structure info (eg. OCR coordinates)
Images → virtual printer → DjVu
edit
If the page scans are made available as a PDF file, e.g. Google Books scans, then this can be directly converted into a DjVu file using one of the following:
The free
Any2DjVu
online service; this can also
OCR
the text and embed it in the .djvu file.
The freeware
Pdf To Djvu GUI
. Note that this requires the installation of the
Cygwin
environment as a prerequisite to its own installation.
The freeware command-line tool with
GUI
for Windows is the
Djvu-Spec Pdf 2 Djvu Converter
from the djvu-spec.narod.ru software page. This tool offers many settings to change the quality and size of the resulting djvu file.
The free software command-line
pdf2djvu
(available in repositories, also for Linux), which is usually as simple as
pdf2djvu -o output.djvu input.pdf
. There's also
a GUI
available.
If you need to crop the PDF document, you can use pdfcrop.pl (see below) for black margins or freeware
Govert's PDF Cropper
for white margins (it requires
Ghostscript
and .Net 2.0).
If the scanned images are made available as individual images, then the easiest option is to print them to a PDF document via one of the many
"virtual printer" tools
, such as the free
PDFCreator
; then convert the PDF document to DjVu as described above.
Note that there are many other options for converting pages to .djvu. One could convert using
PostScript
or multipage
TIFF
as the intermediate format, rather than PDF, but this would of course require different conversion tools. It is also possible to convert from .pdf or .ps to .djvu with the
DjVuLibre
software and its
GSDjVu
plug-in but due to licensing restrictions installing the plug-in is a fairly intricate process that involves compiling a patched version of
Ghostscript
Another free Windows tool that can come in handy for the images-to-pdf-to-djvu process is
ConcatPDF
, a GUI tool that permits easy splitting and merging of PDF files. This tool can
also be used online
. An example of how ConcatPDF might be used is: if a 100-page document has previously been scanned and converted to .djvu and the single page #42 needs to be re-scanned, ConcatPDF would allow that one page to be inserted into the intermediate .pdf file without tracking down the other page images and re-composing the entire document. Installing ConcatPDF version 1.1 requires as prerequisites that the free Microsoft program libraries
Microsoft .NET Framework Version 1
and the corresponding
Visual J# .NET Redistributable Package
be installed beforehand.
Images directly to DjVu
edit
However, a far higher quality document can be achieved using the
DjVuLibre
software library. Jpeg images can be directly encoded into individual DjVu pages using the
c44
encoder. Images in lossless formats such as PNG should be converted to PPM (for colour scans) or PGM (for greyscale scans), then encoded using
c44
. For bitonal (i.e. black-and-white) scans, such as most page text images, a smaller DjVu file can be obtained by converting the page images to the monochrome PBM format, then encoding to DjVu using the
cjb2
encoder. All of these image format conversions can be performed by the free
ImageMagick
library (in batch, with
mogrify
). Individual DjVu pages can be aggregated into a multi-page DjVu using the
djvm
program; this program can also be used to insert or delete pages from a djvu file.
An important caveat of this process is that high quality scans come at the cost of larger files, and there is currently a 100 Mb limit on uploads to commons. The size can be substantially reduced by applying foreground/background separation with
didjvu
and/or
minidjvu
Scripting djVuLibre
edit
This script
allows you to take a whole directory of image files (JPG, PNG, GIF, TIFF, and any file that Imagemagick can convert to PPM) and convert and collate them automatically into a DJVU file. Currently this script is for Windows, but it can be easily converted for Linux. To use it, you will need Python, Imagemagick and DjvuLibre.
Linux
edit
See also:
User:GrafZahl/How to digitalise works for Wikisource
Method 0 - converting graphic files with foreground/background separation
edit
Just use
didjvu
You may consider preprocessing the scans with
Scan Tailor
Method 1 - page at a time with DjVuLibre
edit
You need the
djvu
software, which includes a viewer, and some tools for creating and handling DJVU files. You will probably also need the
Imagemagick
software for converting scans from one format to another:
The tool
cjb2
is used to create a DJVU file from (bitonal) PBM or TIFF file.
The tool
c44
is used to create a DJVU file a PNM or JPEG files. This handles colour images, but the compression is lower.
Therefore you need to convert your scans if they are not already in one of these formats.
Conversion to intermediate format
edit
DJVU cannot use JP2 or PNG as a format. So next, you need to convert to a format that will work as input to a DJVU. Options include PBM (turns all pixels black or white, no shades of grey); PGM (greyscale, lossless); or JPEG (lossy compression optimized for photographs).
Conversion from PNG format to PBM format with the tool
convert
from Imagemagick
convert filename-000.png filename-000.pbm
Depending on the quality of the original scans, you may find it useful to process them with the
unpaper
utility, which deletes black borders around the pages and aligns the scanned text squarely on the page. Unpaper is also capable of extracting two separate page images where facing pages of a book have been scanned into a single image. Another utility is
mkbitmap
, another
pdfcrop.pl
(Perl-based and free software, it requires Ghostscript and
texlive-extra-utils
on Ubuntu; it uses
BoundingBox
; it can crop a whole multipage PDF document in just one passage).
PDFCrop
(another one!) deletes white margins.
Conversion to DJVU page file
edit
Creation of a DJVU file from a PBM file: (this command will not work for PGM or JPG)
cjb2 -clean filename-000.pbm filename-000.djvu
Creation of a DJVU file from a PGM or JPEG file:
c44 -dpi 300 p100.jpg p100.djvu
(In this example, the PGM is specified to use a resolution of 300 dpi. The -dpi argument may be left out; the default value is 100.)
Creating final DJVU document
edit
Adding the DJVU file to the final document
djvm -i filename.djvu filename-000.djvu
You need to repeat these steps with a script for each page of the book. Example:
#!/bin/bash
for
in
seq
do
"filename-
$n
.png"
basename
$i
.png
convert
$i
$j
.pbm
cjb2
-clean
$j
.pbm
$j
.djvu
djvm
-i
filename.djvu
$j
.djvu
done
There is also another way to add all the *.djvu parts into one:
djvm -c filename.djvu filename-000.djvu filename-001.djvu filename-002.djvu
See the following section for an automated process for multiple pages.
Method 2 - PDF to DjVu bash script
edit
Use
this script
, which converts a PDF document (multiple or single page) into images, automatically crops them with ImageMagick, converts them in DjVu and bundles them.
This is very slow (a large PDF document can require days) but a little more efficient than the following method.
The resulting DjVu document is quite big and low-quality, probably because of poor font recognition, which may be fixed by newer versions of
poppler
(the used library): the version available in repositories is usually several months old.
You can also remove the pdftoppm part and use the script to convert multiple images directly in a multiple page PDF document. If images are not in pbm format, you can convert them with a single command using
mogrify
from ImageMagick.
Method 3 - pdf2djvu
edit
Simply download the
pdf2djvu
tool from your repository to directly convert PDF document (single or multiple pages) into DjVu.
If the document contains the results of OCR (as is the case e.g. with FineReader output) then they are preserved in the DjVu document as the hidden text layer. Some other properties of the source document, including metadata, are also preserved. The quality and the size of the output depends primarily on the features of the source document but can also be controlled with several program parameters, such the resolution of foreground and background.
The program is capable to use several threads to speed up the conversion.
As of 2019, file size on Wikimedia Commons is less important than image quality (although PDFs around 1 GiB in size can have problems with thumbnails). The simplest way to increase quality is to change
--bg-subsample
(default 3, max 12) to 2 or 1 (best quality).
An example command may therefore be:
pdf2djvu -j0 --bg-subsample=1 -o output.djvu input.pdf
Note on cropping
edit
With pdf2djvu, you need to crop directly the pdf before the conversion. On Linux this may be quite difficult. You could use ImageMagick
convert -crop
, but attention: with multiple page big PDF document, this can take several GB of memory (the limit is 16 TB!) and kill your computer if you don't use the
-limit area 1
option directly after
-crop
. This make the conversion very long.
When using ImageMagick, the resulting PDF document is increased in size and reduced in quality
because of rastering
See other crop tools
above
Method 4 - DjVuDigital
edit
Use djvudigital,
which like pdf2djvu converts pdf directly in DjVu.
There are licensing problems, because the GSDjVu library has a different license, then you'll need to compile it by yourself; the included utils make this step quite easy, but still long (about 1 hour) and a bit annoying.
But, then you can convert PDF document into DjVu with a single command (see the previous section for crop). The conversion is slow (
I find
it will complete a 300 page PDF document in about 30-40 minutes). The resulting DjVu is of higher quality and lower file size compared to both the previous two methods.
Additionally, DjVuDigital can handle JPEG2000 (aka JPX) files embedded in PDF documents, which is a feature of many Google books. pdf2djvu, Any2Djvu and Internet Archive conversions all fail to convert these files, leaving blank pages in the output.
DjVuDigital has many advanced options to improve results, but they can be difficult to master.
In general, altering the
--dpi
option can give you a quick reduction in file size without too much fiddling.
Online ([almost] all systems)
edit
Any2Djvu
edit
Another method to convert the images to djvu is to zip them and use the
Any2Djvu
site to create the djvu file. The Any2Djvu will extract the images in the zip and create a OCRed djvu. OCR functions will only with English text.
Any2Djvu cannot handle huge files. Big files are best dealt with if you upload them by URL (e.g. by entering a link like
ftp://ftp.bnf.fr/005/N0051165_PDF_1_-1DM.pdf
). Conversion can take several hours. Any2Djvu will sometimes run out of memory on large or highly-detailed files and fail. It will also not convert "JPX" images embedded into PDF documents, which are common in Google Books scans.
The Internet Archive
edit
See also:
Help:Internet Archive
Another method is to upload a PDF document (or archive of image files) to
the Internet Archive
You need to log in (don't use OpenId, it won't function
).
This page in a nutshell:
Just
upload your scan
as a single PDF file or a series of images
compressed
in a ZIP file with filename
ending as
_images.zip
. You'll get a nice PDF file with OCR! DjVu is usually available for older files as well.
Uploading
edit
Click "
Upload
" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead
10
) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).
OCR tricks
edit
When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.
11
Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs.
12
The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests.
13
You can check your progress in the queue
here
and more detailed information about jobs you submitted
here
(must be logged in).
The Internet Archive uses a professional, proprietary, commercial
ABBYY
software
14
with a quite good images and OCR output in many languages and fonts and an aggressive compression
15
which mantains an high quality of the final DjVu file.
However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually.
You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the
fixed-ppi
field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.
Images formats
edit
Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive.
16
It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.
Troubleshooting
edit
If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at info
archive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing!
DjVu to text
edit
OCR via
Any2DjVu
edit
The OCR option available at the free conversion service
Any2DjVu
does do an OCR of the scanned image but the resulting text is embedded within the .djvu file itself and must be extracted so it can be used on Wikisource.
One way to do this is to use the
DjVuLibre
software to extract the text, via a command like
djvused myfile.djvu -e 'print-pure-txt' > myfile.txt
or
djvutxt myfile.djvu > myfile-ocr.txt
JVbot
can automatically upload the text layer of a DJVU to the pages on Wikisource. For example,
Robert the Bruce and the struggle for Scottish independence - 1909
OCR via the Internet Archive
edit
See
above
: if you upload a DjVu file, the derive process will OCR it.
OCR with Tesseract
edit
OCR can be done with Tesseract, a free OCR software, and a script:
OCR with Tesseract
. Perl script.
OCR with Tesseract (Python)
, slightly more user-friendly Python script. Based on the Perl script.
OCR with Tesseract 3.x and other free OCR engines
edit
Use
ocrodjvu
DjVu to Images
edit
Linux
edit
To extract images from a DjVu file, you can use ddjvu
ddjvu -page=8 -format=tiff myfile.djvu myfile.tif
If you done all the pages (without
-page=**
) you can split the multi-page tiff into single pages png (or any other format)
convert -limit area 1 myfile.tif myfile.png
Extract all pages to single pages tiff with 80% quality.
ddjvu -format=tiff -eachpage -quality=80 myfile.djvu myfile-%03d.tiff
Manipulating
edit
There's some advice about manipulating DjVu files or images to be used to generate DjVu elsewhere:
#Method 1 - page at a time with DjVuLibre
(second bullet point)
User:GrafZahl/How to digitalise works for Wikisource/pbmextract.c
Help:DjVu files/other pages
Help:DjVu files/other pages#double pages in djvu
fr:Aide:Créer_un_fichier_DjVu/Linux#Script_Bash_de_conversion_PDF.E2.86.92Djvu
Splitting DjVu files
edit
The DjVu documents come in two flavours: bundled and unbundled (indirect); the latter format stores every page in a separate file. The comment below made by the original author concerns only bundled documents, which should be avoided.
Large works can not be uploaded onto Wikimedia servers which have a 100 MB upload limit. To split the DjVu, use
DjVuLibre
"Save as", and specify a page range which will produce a file small enough to be uploaded. Some trial and error may be necessary.
The easiest way to split DjVu files from the command line is with djvmcvt:
mkdir mydoc/ &&
djvmcvt -i 'mydoc.djvu' 'mydoc/' 'new-mydoc-index.djvu'
Alternatively, djvused can be used from the command line:
djvused myfile.djvu -e 'select 10; save-page-with p10.djvu'
This can be done for every page. To know the number of page of the file :
djvused myfile.djvu -e 'n'
Removing a copyright page
edit
Many of the already-created djvu files available at archive.org and other sites have the Google copyright page attached to the front of the document. Wikimedia policy, based on an analysis of the underlying law, does not accept that copyright is established on a
public domain
work simply by scanning or copying it or taking a two-dimensional photograph that faithfully represents its subject. See Wikimedia Commons for more information about
scans
artwork
and
the position of the WMF
Such copyright pages and other extraneous material can be removed with
DjVuLibre
, an open source program maintained by the inventors of djvu under the GNU Public License. Binaries are available for Windows, Mac, Linux, Solaris, and IRIX. It includes djvm.exe, which is run as a command-line utility. If you cannot figure out how to do this, you can message
Mkoyle
talk
), and he will do it for your file and email the file to you for upload. The command line to delete (-d) the first page (1) is as follows:
djvm -d filename.djvu 1
Inserting a new pages (e.g. a placeholder)
edit
Page placeholder file
If a DJVU file is missing pages, you can insert placeholders, so that if the pages are found and inserted later, existing pages won't need to be moved. You can use
File:Generic placeholder page.djvu
for the placeholder.
djvm -i main_document.djvu placeholder_file.djvu
Note: work backwards from the last missing page in the file, to avoid having to recalculate the page numbers as you insert pages.
Realigning shifted OCR
edit
It often happens that the text layers of some pages of a DjVu file are invalid; the way that MediaWiki gets the DjVu text layer causes the text of all pages after it to be shifted towards the beginning of the file, which makes it useless. To solve this, first identify the invalid page. You can do that with
djvused
file
.djvu -e "output-all" >
file
.dsed
If the OCR is shifted, this should output an error. Look at
file
.dsed
, and the last page number (indicated with
# page
) is the last valid page. The invalid page is the one after.
To fix this issue, you should remove the text of the invalid page, like so:
djvused
file
.djvu -e "select
[invalid page number]
; remove-txt; save"
(This will change
file
.djvu
.) The OCR should now be realigned (check with another output-all, if it still makes an error rinse and repeat).
See also:
phab:T219376
Displaying a particular page
edit
The [[File:...]] link tag accepts a named parameter "page" so that, for example, this wiki code displays the image of page 164 of the file
Emily Dickinson Poems (1890).djvu
on the right, 150 pixels wide (the rear cover of the book, containing no text):
[[File:Emily Dickinson Poems (1890).djvu|right|150px|page=164]]
The page image can be displayed in the books Wikisource main space as with
Personal Recollections of Joan of Arc/Book I/Chapter 2
using:
[[File:Personal_Recollections_of_Joan_of_Arc.djvu|page=27|right|thumbnail|200px|THE FAIRY TREE]]
Notes
edit
1.0
1.1
1.2
Example:
this
205 MB PDF document of a 1691 book from Gallica is converted by
pdf2djvu.sh
script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by
djvudigital
and in
a better quality 51.3 MB djvu
by the
Internet Archive
The defaults are sensible for most cases: --dpi=300 (but requires the metadata about size to be correct) and --bg-slices=72+11+10+10, which the c44 manual recommends for higher quality photography: «74+13+10, for instance, would be appropriate for compressing a photographic image with three progressive refinements. More quality and more refinements can be obtained with option -slice 72+11+10+10.» (Checked in DjVuLibre 3.5.27.)
From
: «BSF (Background Subsample Factor): The ratio of the foreground layer geometrical storage size (in pixels) to the background one (in DjVu). Ranges from 1 to 12. E.g. the background layer may be stored in a DjVu file downsampled to 1..12 times. [...] I recommend you to play only with BSF and not to touch the Background quality (because the latter almost doesn't make sense).»
For instance,
this
55 MB PDF document when cropped with ImageMagick gives a 100 MB PDF document which converted with pdf2djvu gives a 86.2 MB djvu, while
the Internet Archive
gives directly
a 10.1 MB djvu of better quality
Man page
A comparison
here
Complete instructions
here
Moreover, they can require the proprietary msepdjvu libray instead of csepdjvu: see
superhero pres: is it independently reproducible?
See forums:
Authentication error; not a valid OpenID
Login problems when I click "Share"
See
forum
If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.
Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.
See
forum
Example:
Vocabolario degli accademici della Crusca
, 1691, took 5.1 days to derive.
Version 9.0 since 2013.
In the example, dimension is 1/6 compared to djvudigital output.
FAQ
documentation of the format to use
. Remember: put extensions in lowercase everywhere, use
tif
with a single
, put the
ppi
value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably
first
the _images.zip
archive format
See also
edit
Wikisource:DjVu vs. PDF
Help:Beginner's guide to Index: files
List of DjVu programs
Help:Proofread
Convert djvu to pdf on a mac
DjVu Resources and Online Converter
Retrieved from "
Categories
Help
File creation help
Help
DjVu files
Add topic