Discover more from Just Emil Kirkegaard Things
Ripping threads from able2know forums
So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.
code is here: Forum crawler.py
import urllib2 import re import os def isodd(num): return num & 1 and True or False #open data about what to rip file0 = open("Forum threads/threads.txt",'r') #assign data to var data0 = file0.readlines() #hard var number = 0 #skip odd numbers for line in data0: #if it is even, then set first var, and continue if not isodd(number): outputfolder = line outputfolder = outputfolder[:-1] number = number+1 continue #get thread url and remove last chars (linebreak + if isodd(number): threadurl = line threadurl = threadurl[:-2] number = number+1 print "starting to crawl thread "+threadurl #create folder if not os.path.isdir("Forum threads/"+outputfolder): os.makedirs("Forum threads/"+outputfolder) #var introduction lastdata2 = "ijdsfijkds" lastdata = "kjdsfsa" #looping over all the pages for page in range(999): #range starts at 0, so +1 is needed response = urllib2.urlopen(threadurl+str(page+1)) #assign the data to a var page_source = response.read() #used for detection of identical output #replace data in var2 lastdata2 = lastdata #load new data into var1 lastdata = page_source #check if they are identical if page>0: if lastdata == lastdata2: print "data identical, stopping loop" break #alternative check, check len if page>0: if len(lastdata) == len(lastdata2): print "length identical, stopping loop" break #used for detection of identical output #create a file for each page, and save the data in it output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w') output.write(page_source) #progress print "wrote page "+str(page+1)+" in "+outputfolder+"/" print "length of file is "+str(len(page_source))