

Discover more from Just Emil Kirkegaard Things
No filter science
Over 5,000 subscribers
Continue reading
Ripping threads from able2know forums
So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.
code is here: Forum crawler.py
or here:
import urllib2
import re
import os
def isodd(num):
return num & 1 and True or False
#open data about what to rip
file0 = open("Forum threads/threads.txt",'r')
#assign data to var
data0 = file0.readlines()
#hard var
number = 0
#skip odd numbers
for line in data0:
#if it is even, then set first var, and continue
if not isodd(number):
outputfolder = line
outputfolder = outputfolder[:-1]
number = number+1
continue
#get thread url and remove last chars (linebreak +
if isodd(number):
threadurl = line
threadurl = threadurl[:-2]
number = number+1
print "starting to crawl thread "+threadurl
#create folder
if not os.path.isdir("Forum threads/"+outputfolder):
os.makedirs("Forum threads/"+outputfolder)
#var introduction
lastdata2 = "ijdsfijkds"
lastdata = "kjdsfsa"
#looping over all the pages
for page in range(999):
#range starts at 0, so +1 is needed
response = urllib2.urlopen(threadurl+str(page+1))
#assign the data to a var
page_source = response.read()
#used for detection of identical output
#replace data in var2
lastdata2 = lastdata
#load new data into var1
lastdata = page_source
#check if they are identical
if page>0:
if lastdata == lastdata2:
print "data identical, stopping loop"
break
#alternative check, check len
if page>0:
if len(lastdata) == len(lastdata2):
print "length identical, stopping loop"
break
#used for detection of identical output
#create a file for each page, and save the data in it
output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w')
output.write(page_source)
#progress
print "wrote page "+str(page+1)+" in "+outputfolder+"/"
print "length of file is "+str(len(page_source))