Just Emil Kirkegaard Things

Share this post

Ripping threads from able2know forums

www.emilkirkegaard.com

Discover more from Just Emil Kirkegaard Things

No filter science
Over 5,000 subscribers
Continue reading
Sign in

Ripping threads from able2know forums

Emil O. W. Kirkegaard
Jan 18, 2013
Share this post

Ripping threads from able2know forums

www.emilkirkegaard.com
Share

So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.

code is here: Forum crawler.py

or here:

import urllib2
import re
import os
def isodd(num):
    return num & 1 and True or False
#open data about what to rip
file0 = open("Forum threads/threads.txt",'r')
#assign data to var
data0 = file0.readlines()
#hard var
number = 0
#skip odd numbers
for line in data0:
#if it is even, then set first var, and continue
    if not isodd(number):
        outputfolder = line
        outputfolder = outputfolder[:-1]
        number = number+1
        continue
#get thread url and remove last chars (linebreak +
    if isodd(number):
        threadurl = line
        threadurl = threadurl[:-2]
        number = number+1
        print "starting to crawl thread "+threadurl
#create folder
    if not os.path.isdir("Forum threads/"+outputfolder):
        os.makedirs("Forum threads/"+outputfolder)
#var introduction
    lastdata2 = "ijdsfijkds"
    lastdata = "kjdsfsa"
#looping over all the pages
    for page in range(999):
        #range starts at 0, so +1 is needed
        response = urllib2.urlopen(threadurl+str(page+1))
#assign the data to a var
        page_source = response.read()
#used for detection of identical output
        #replace data in var2
        lastdata2 = lastdata
#load new data into var1
        lastdata = page_source
#check if they are identical
        if page>0:
            if lastdata == lastdata2:
                print "data identical, stopping loop"
                break
#alternative check, check len
        if page>0:
            if len(lastdata) == len(lastdata2):
                print "length identical, stopping loop"
                break
                #used for detection of identical output
#create a file for each page, and save the data in it
        output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w')
        output.write(page_source)
#progress
        print "wrote page "+str(page+1)+" in "+outputfolder+"/"
        print "length of file is "+str(len(page_source))
Share this post

Ripping threads from able2know forums

www.emilkirkegaard.com
Share
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Emil O. W. Kirkegaard
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing