-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In article <4033ab00$0$5691$45beb828@newscene.com>,
Michael Maginnis  <michaelmaginnis@yahoo.com> wrote:
>wseehorn@earthlink.net wrote:
>> I do scanning of PDF in the WinDoze world. In that version of Acrobat
>> if you use PDF distiller or PDF writer (from the Print menu) & select
>> screen resolution you cut down the file size considerably.
>
>Still can't get the file size per page down below 1 MB or so. Too large 
>for this project.

What bit depth are you using?  24 bpp is severe overkill.  8 bpp (grayscale)
is usually excessive as well.  For most printed matter (halftone instead of
continuous-tone), 1 bpp at 300-600 dpi will do, though you might have to
tweak the thresholds a bit.  What I've done in the past is scan grayscale
and use netpbm to convert to black-and-white.  Multiple pages can then be
stuffed into a single TIFF file with lossless compression of some sort (such
as group 3 fax).  A full page at 300 dpi is about 1 MB compressed, so this
should work out to something a bit less when compressed.

  _/_   Scott Alfter (address in header doesn't receive mail)
 / v \  send mail to $firstname@$lastname.us
(IIGS(  http://alfter.us/            Top-posting!
 \_^_/  rm -rf /bin/laden            >What's the most annoying thing on Usenet?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Linux)

iD8DBQFANQneVgTKos01OwkRAuOfAKDdF+cH7xkYHToRaS00invVBIR7LgCfURXg
wbl3ULaAta2aWniRqlQu1l8=
=Epnl
-----END PGP SIGNATURE-----


Mike Maginnis wrote:

> On Fri, 15 Apr 2005 12:19:20 -0700, "Michael J. Mahon"
> <mjmahon@aol.com> wrote:
> 
> 
>>Mike Maginnis wrote:
> 
> 
> --<snip>--
> 
> 
>>A searchable PDF sounds like Nirvana--but experience indicates that
>>OCR'd documents inevitably have errors that make it through the
>>proofing process--and the process is much more difficult as well
>>as errorprone.
>>
>>Scanning is a fine solution.
>>
>>You didn't mention what tools you have available for compression.
>>I find that Photoshop 7's "Save for web.." option is very versatile
>>and effective.
>>
>>For text, pre-processing to increase contrast and drop out the
>>background "white" noise, followed by .gif compression with 4
>>levels (black, 2 grays, and white) to be excellent.
>>
>>If you can still see the paper grain in the white background,
>>then you cannot achieve good compression--noise doesn't compress!
>>
>>-michael
>>
>>8-voice music synthesizer using NadaNet networking!
>>Home page:  http://members.aol.com/MJMahon/
> 
> 
> I've got Photoshop CS, but I've not played with image manipulation
> levels or compression, so I'm in over my head with that.  I made the
> unfortunate decision to purchase a SnapScan One-Touch scanner a while
> back, so I'm limited in my image acquiring methods; the One-Touch
> won't talk to anything but its own rather rudimentary software.

Good TWAIN scanners are down in the $50 range, so that shouldn't
be too much of a problem.  Or if your scanner can save in .TIF or
any other "standard" format, that could work, too.

Just load the image into Photoshop, adjust levels (primarily by
dropping the white level below the "grayest" parts of the paper,
and raising the black level above the "grayest" parts of the black
ink).  Leave the mid-tones, since they provide useful information
on letter shape.

Then in the file menu, select "Save for web..." and select GIF format
with, say 4 grayscale values (so a couple of mid-tones are preserved).

You will see a preview of the compressed picture with its compressed
size, and you can play with the parameters to see what works best.

It is a joy to use compared to "cut and try" methods.

Photoshop is one of the finest justifications for owning a computer,
so it's worthwhile to "break the ice".  ;-)

> Most of the errors in an OCR documents could probably be caught with a
> spellchecker - the hard part comes with character-by-character
> proofing the sections of programming code.

You might think so, but most computer documentation is loaded with
peculiar abbreviations and words set off by special fonts.  OCR is
a real bear.  It's enough to give you a new appreciation of the
human eye-brain combination.  ;-)

-michael

8-voice music synthesizer using NadaNet networking!
Home page:  http://members.aol.com/MJMahon/


Mike Maginnis <maginnis@tarnover.org> writes:
> Don't know if this has been addressed before. I'm planning to re-scan
> the Computist magazines that I've already done so far, and get moving
> on finishing the rest sometime this decade.
> 
> Any thoughts on the best way to scan these?  I can't seem to get a
> decent image down to less than ~200K.  Anything smaller and it image
> incurs serious quality loss.

For pages that only contain text and line art, *please* use Group 4 fax
lossless compression.  This is a natively supported format in TIFF,
Postscript, and PDF files, and results in 30K-50K sizes for a typical
page (though more for pages with lots of small type or fine drawing
details).  Any good scanning software can do it.  Since it's lossless,
the quality is identical to the original scan.

However, G4 compression is not good for photographs or other continuous-tone
images, because that's not what it's designed for.  Pages that have such
images (other than relatively unimportant incidental ones) should be
compressed with JPEG compression, which is what it sounds like you've
been doing.  Unfortunately JPEG compression blurs text and line art.

Ideally, for pages that consist of a mix of text and line art, and
continuous-tone images, you'd use a graphic editor (e.g., Gimp or
Photoshop) to separate the continuous-tone images into a separate layer,
apply JPEG compression to those only, and G4 to the text and line art.
Then you can composite the result in either Postscript or PDF.

That process yields really nice looking results, but takes a lot of
work.

Eric