Some weeks ago I took on a project.
The project was to rebuild a major blog with next to nothing to start with. The blog had been running for years and had an enormous following. The author had been diagnosed with cancer some years ago but had continued her writings. Last December she decided to discontinue the site, as her health was seriously deteriorated at that stage. She erased the site. Not long after, she expressed a sincere regret that the site was gone, and it was one of her dying wishes that somehow the site be restored. Last August, she died without seeing that wish fulfilled.
I assumed there were some backups lying around and offered to restore the site to its former glory. Unfortunately my assumption was wrong. I contacted the company which had hosted the site. Sorry! We don’t keep backups that long. So the call went out to see if anyone had miraculously got a backup of their own.
The first to respond was the blogger’s husband who was as anxious to restore the site as his wife had been. He announced that there were backups on the author’s PC. I duly received 1,419 files (one per post) to find that they were all in MHT format. MHT contains all the information for an individual page including HTML, JS, CSS and images. However, MHT is not a universally recognised format so putting them on-line would only work in Internet Explorer (or in some other browsers with a specific add-on installed). I had to find a way of extracting them, each into its component parts.
After some searching, I came across MHT2HTM which happily extracts the required contents and works on Linux. I installed it and ran it. It worked perfectly.
So I now had 1,419 folders, each containing everything required to reconstruct the posts. However, files in folders – 59,337 files weighing 834Mb – are not very efficient as an archive. They are bulky in the extreme and aren’t searchable and can’t be listed by anything other than folder name. The answer was to extract the information from each folder, to move the image files to a central point and to populate a database with the correct information.
I set about writing a programme (in PHP). The programme was simple enough in concept – it had to cycle through each folder, reading the HTML file and extracting al the required information to place in the database. This meant stripping all the HTML off the file leaving just two parts – the post content and the comments. These two were added so that the comments became part of the post. That was moved to the database along with other information which had been extracted such as the author (there had been quite a few contributors to the site), the title, the name (slightly different to the title as far as the database is concerned) and the date of posting.
In the meantime I had taken over the domain name and had set up a server with a WordPress site. I uploaded the database I had created and the first generation of the archive was live.
Also in the meantime I had received a few (!) more files from others who had had the foresight to backup some of the original.
One batch of 1,462 files was also in MHT format. I ran that through MHT2HTM and I now had 1,462 folders. Unfortunately the two archives had been saved using slightly different parameters so I couldn’t compare the two directly. It would have to be done automatically.
I took my home-grown PHP setup and modified the code. It now had to make sure that the record didn’t exist already from the first run. This wasn’t quite so simple as because of the different parameters, there were some variations in post titles. I had to generate a unique key for each post (which meant running the first batch again) and compare keys. The key was simple enough – just strip all non-alpha-numeric characters from the title and replace spaces with dashes.
Next I had received some more copies of the site. A couple had used WGET to download the site and a couple had used HTTrack.. These of course gave me a load of files all with different layouts and small but important differences in the HTML coding.
Once again I rewrote my PHP programme and updated the database. I was now on a third and fourth generation of the site, each time adding a few more missing posts.
I had been dipping through the WayBack machine which had also sporadically taken copies of the site. Then I discovered wayback-machine-downloader. I downloaded and installed it and did a test run. It seemed to do the job so I did a live run. About eighteen hours later it finished.
Naturally, the downloader had dumped all its results into a series of folders with yet another layout. This time I tried a different approach – I used “locate” to find the index files and piped the result into a file. I discovered that if I created an array from that file using “explode” and “/” that the required files were to be found in all elements with a count of 12. This made the job easier. Once again, I used a modified version of the PHP programme to generate database entries.
I think I have gone as far as I can. There are still some small tweaks to be made to the site, including a few duplicate posts that somehow crept in. I still have to find some lost images as well, and maybe sort out additions to the various categories.
The result of all this effort can be found here:
The Anna Raccoon Archive.