Proposed method for estimating how much a person has read using an electronic library

Mar 01, 2012

Introduction

Many studies show that self-assessment is hard and often biased in a positive direction, i.e., people tend to overestimate themselves. Therefore, if one wants a correct self-assessment one needs to either adjust for the bias or develop some 'objective' method for self-assessment of the attribute in question or something else. In the course of my self-assessment I have come across the problem of assessing how much I have read in number of books and articles (journal, Wikipedia, and others). There are various methods to do this which I will discuss in this essay.

Measurement of books and files in a folder

Measuring number of books, information in those books etc.

For books the method that I have developed is simply a spreadsheet with all the books that I have read, their publication date, authors and some keywords. This provides an imperfect but pretty good way of measuring how many books that I have read and therefore also of how much information I have acquired from books. This method for estimating total information acquired from books suffers from a number of problems: 1) one could have forgotten to add a particular book to the list, either because one forgot to add it after one read it, or because one read it before one started noting down each book that one has read and forgot to add it later, 2) it does not currently include the length of the book, 3) if it did include the length of the books in pages, then even that would be an imperfect measure because the page length differs between books, 4) even if one normalized the page length to some set number of words or symbols (as is customary at many universities), material with longer words tends to be more informative than material with shorter words but having the same number of symbols, or at least, this may be the case, 5) and which part of the book do we include? The easiest answer is “all of it”, but that would favor books with huge lists of references, although this is perhaps not a bad idea, 6) which books should one include on the list? I currently only include non-fiction books.

Supposing that one tries really hard to recall all books read and remembers to note it every time one finishes a book, the first problem won't be that big of a problem unless one starts doing this late in life.

I propose that one counts all non-fiction books, includes all the symbols and the pages in the book and counts them. A technical solution to this task is required. It would need to be able to parse .pdf files to begin with. One might as well add support for .txt and similar easy to parse formats because of their use on e-book readers.

Measuring other files on the computer

Next up is the task of measuring how much information one has acquired from non-books. To begin with I started noting every journal article that I could think of that I had read including noting which of all those in my reading folder on my computer that I had read. After doing that, I would need to remember to note a new article every time I finished one. This proved to be too much work, so I stopped.

I think there is a way to solve this similar to the first method. First, one has to do an indexing of all the files in the electronic library. This would at the minimum include noting the length and number of files, but should also include the title of the work which can be extracted from the file (can be tricky) or the file-name or some combination of those. Then, because the electronic library might (probably will) include files that one has not read, one would need some way of noting it in the system which works one has read and which one has not read. However, if the number of files is very large, then this can be a practical problem (it is for me). Sampling is the solution for this problem. One makes the program select a large enough to be representative but small enough to be manageable sample from the entire electronic library, then the user notes in the program which of the files he has read (and perhaps notes some he has partially read with a % as that would be ideal). Then, then program uses the sample to determine how much of the (total) electronic library the person has read.

There is one problem with the electronic-library method compared with the list-of-books method, and that is that if one has read paper books that one does not have an ebook version of, then those books are not included. There are, however, various ways to solve this: 1) acquire an ebook version of the book, 2) note the book manually in the program using the page number of the book as a guide to the length of it (since counting the number of symbols would be too impractical). (1) is by far the best method. (2) can work by using page number and using data from the electronic library to estimate the probable number of symbols per page. One could do this by taking an electronic library average of symbols/page or, slightly more sophisticated, look at files of similar length in the electronic library and average their symbols/page and use that number.

Statistical analysis of the data

There should be various statistical mechanisms to analyze the final data. For instance, it should count the average word length, and average length of files.

Looking on to non-electronic works

Paper articles

If one has used the method described above, then one should have a pretty good idea of how much one has read. However, there are still many other ways to read. An obvious example is articles that one read in paper format. This can be solved by: 1) acquiring electronic copies of the papers, 2) noting the papers manually. The first is sort of good but it would still be a lot of work if one has read a large number of papers in paper format. The second will result in the same problems as with the paper books described in the electronic-library method section.

Online articles

Other times one has read an article online and so never saved it in the electronic library on one's computer. Perhaps because it was short, uninteresting, one was using someone else's computer, or a combination of those. However, it should be possible to measure online articles in much the same way as before. One can create and use a browser plug-in that monitors every e.g. Wikipedia article that one looks at and then noting it in the database. Since one doesn't read every Wikipedia page in full one looks at, one should use the sample-to-entire-library method used earlier to determine how many of the Wikipedia pages in the database one has actually read and if partially, then how much.

What about other sites? It would be a real hazzle to write code specialized for every page that one reads at. I suggest that one writes specialized code only for sites one reads a lot at (Wikipedia, etc.), and then use a manual system for sites where there is no specialized code for. This could be done with a simple text-marking tool which notes the length of the text (in symbols and words) in the marked text and sends that information to the database.

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?