Ripping books from UMDL Text: Leta S. Hollingworth's Gifted children, their nature and nurture

Jun 13, 2013

http://quod.lib.umich.edu/g/genpub/AGE2118.0001.001?view=toc

Due to this book repeatedly coming up in conversation regarding the super smart people, it seems to be worth reading. It is really old, and should obviously be out of copyright (thanks Disney!), however it possibly isn't and in any case I couldn't find a useful PDF.

I did however find the above. Now, it seems to lack a download all function, and it's too much of a hassle to download all 398 pictures manually. They also lack OCR. So, I set out to write a python script to download them.

First had to find a function to actually download files. So far I had only read the page source of pages and worked with that. This time I needed to save a file (picture) repeatedly.

Googling gave me: urllib.urlretrieve

So far so good right? Must be easy to just do a for loop now and get it overwith.

Not entirely. Turns out that the pictures are only stored temporarily in the website cache if one visits the page associated with the picture. So I had to make the script load the page before getting the picture. Slows it down a bit, but not too much trouble.

Next problem: sometimes the picture wasn't downloaded correctly for some reason. The filesize however was a useful proxy for this purpose. So had to find a way to get the filesize. Google gave me os.stat (import os). Problem solved.

So after doing that as well, some pictures were still not being downloaded correctly. Weird. After debugging, it turned out that some of the pictures were not .gif but .jpg files, and located in a slightly different place. So then I had to modify the code to work that out as well.

Finally worked out for all the pictures.

I OCR'd it with ABBYY FineReader (best OCR on the market AFAIK).

Enjoy:

GIFTED CHILDREN THEIR NATURE AND NURTURE - Leta S. Hollingworth

The python code is here: downloader.py

Just Emil Kirkegaard Things

Discussion about this post