Ripping threads from able2know forums

Jan 18, 2013

So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.

code is here: Forum crawler.py

or here:

import urllib2
import re
import os
def isodd(num):
    return num & 1 and True or False
#open data about what to rip
file0 = open("Forum threads/threads.txt",'r')
#assign data to var
data0 = file0.readlines()
#hard var
number = 0
#skip odd numbers
for line in data0:
#if it is even, then set first var, and continue
    if not isodd(number):
        outputfolder = line
        outputfolder = outputfolder[:-1]
        number = number+1
        continue
#get thread url and remove last chars (linebreak +
    if isodd(number):
        threadurl = line
        threadurl = threadurl[:-2]
        number = number+1
        print "starting to crawl thread "+threadurl
#create folder
    if not os.path.isdir("Forum threads/"+outputfolder):
        os.makedirs("Forum threads/"+outputfolder)
#var introduction
    lastdata2 = "ijdsfijkds"
    lastdata = "kjdsfsa"
#looping over all the pages
    for page in range(999):
        #range starts at 0, so +1 is needed
        response = urllib2.urlopen(threadurl+str(page+1))
#assign the data to a var
        page_source = response.read()
#used for detection of identical output
        #replace data in var2
        lastdata2 = lastdata
#load new data into var1
        lastdata = page_source
#check if they are identical
        if page>0:
            if lastdata == lastdata2:
                print "data identical, stopping loop"
                break
#alternative check, check len
        if page>0:
            if len(lastdata) == len(lastdata2):
                print "length identical, stopping loop"
                break
                #used for detection of identical output
#create a file for each page, and save the data in it
        output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w')
        output.write(page_source)
#progress
        print "wrote page "+str(page+1)+" in "+outputfolder+"/"
        print "length of file is "+str(len(page_source))

Just Emil Kirkegaard Things

Discussion about this post