Many of us are familiar with the ticalc.org CD from a while ago. With the magic of modern technology, you too can build a CD (well, DVD) containing the ticalc.org file archives.
The main thing enabling this is the WARC of the site I built a few months ago. I used the following short python script to extract files from the WARC:
Code:
(Requires the IA WARC library.)
Running this script emits about 170 thousand files into the 'out' directory, containing the file archives and a few top-level pages. Because I don't rewrite any links, you'll want to browse if with a web server (rather than directly from the filesystem), such as python's http.server module.
Building an actual disc image from that directory is pretty straightforward, for example with mkisofs:
Code:
Result:
The main thing enabling this is the WARC of the site I built a few months ago. I used the following short python script to extract files from the WARC:
Code:
import os
import warc
GROUPS = ('/pub/', '/archives/files/', '/images/', '/style.css', '/global.js', '/mfunctions.js')
GROUPS = ['http://www.ticalc.org' + s for s in GROUPS]
def extract_files(filename):
f = warc.WARCFile(filename)
for record in f:
if record.type == 'response':
if any(map(record.url.startswith, GROUPS)) or record.url == 'http://www.ticalc.org/':
relpath = record.url[len('http://www.ticalc.org/'):]
response_code = record.payload.readline()
# Skip redirects and whatnot
if not response_code.startswith('HTTP/1.1 200'):
continue
print relpath
if len(relpath) == 0 or relpath[-1] == '/':
relpath += 'index.html'
(dirpath, _) = os.path.split(relpath)
dirpath = 'out/' + dirpath
if not os.path.exists(dirpath):
os.makedirs(dirpath)
# Skip past response header
hdr = 0
for line in record.payload:
if line == '\r\n':
break
else:
hdr = 0
# Write response body
with open('out/' + relpath, 'wb') as f:
f.write(record.payload.read())
if __name__ == '__main__':
import sys
extract_files(sys.argv[1])
Running this script emits about 170 thousand files into the 'out' directory, containing the file archives and a few top-level pages. Because I don't rewrite any links, you'll want to browse if with a web server (rather than directly from the filesystem), such as python's http.server module.
Building an actual disc image from that directory is pretty straightforward, for example with mkisofs:
Code:
$ mkisofs -iso-level 3 -udf -V "ticalc.org August 2014" -o ticalc.org.iso out
Result: