Linux | A Pipe and a Keyboard

Rebuilding a WordPress site from nothing

Posted on September 19, 2017 by RichardSeptember 19, 2017

Some weeks ago I took on a project.

The project was to rebuild a major blog with next to nothing to start with. The blog had been running for years and had an enormous following. The author had been diagnosed with cancer some years ago but had continued her writings. Last December she decided to discontinue the site, as her health was seriously deteriorated at that stage. She erased the site. Not long after, she expressed a sincere regret that the site was gone, and it was one of her dying wishes that somehow the site be restored. Last August, she died without seeing that wish fulfilled.

I assumed there were some backups lying around and offered to restore the site to its former glory. Unfortunately my assumption was wrong. I contacted the company which had hosted the site. Sorry! We don’t keep backups that long. So the call went out to see if anyone had miraculously got a backup of their own.

The first to respond was the blogger’s husband who was as anxious to restore the site as his wife had been. He announced that there were backups on the author’s PC. I duly received 1,419 files (one per post) to find that they were all in MHT format. MHT contains all the information for an individual page including HTML, JS, CSS and images. However, MHT is not a universally recognised format so putting them on-line would only work in Internet Explorer (or in some other browsers with a specific add-on installed). I had to find a way of extracting them, each into its component parts.

After some searching, I came across MHT2HTM which happily extracts the required contents and works on Linux. I installed it and ran it. It worked perfectly.

So I now had 1,419 folders, each containing everything required to reconstruct the posts. However, files in folders – 59,337 files weighing 834Mb – are not very efficient as an archive. They are bulky in the extreme and aren’t searchable and can’t be listed by anything other than folder name. The answer was to extract the information from each folder, to move the image files to a central point and to populate a database with the correct information.

I set about writing a programme (in PHP). The programme was simple enough in concept – it had to cycle through each folder, reading the HTML file and extracting al the required information to place in the database. This meant stripping all the HTML off the file leaving just two parts – the post content and the comments. These two were added so that the comments became part of the post. That was moved to the database along with other information which had been extracted such as the author (there had been quite a few contributors to the site), the title, the name (slightly different to the title as far as the database is concerned) and the date of posting.

In the meantime I had taken over the domain name and had set up a server with a WordPress site. I uploaded the database I had created and the first generation of the archive was live.

Also in the meantime I had received a few (!) more files from others who had had the foresight to backup some of the original.

One batch of 1,462 files was also in MHT format. I ran that through MHT2HTM and I now had 1,462 folders. Unfortunately the two archives had been saved using slightly different parameters so I couldn’t compare the two directly. It would have to be done automatically.

I took my home-grown PHP setup and modified the code. It now had to make sure that the record didn’t exist already from the first run. This wasn’t quite so simple as because of the different parameters, there were some variations in post titles. I had to generate a unique key for each post (which meant running the first batch again) and compare keys. The key was simple enough – just strip all non-alpha-numeric characters from the title and replace spaces with dashes.

Next I had received some more copies of the site. A couple had used WGET to download the site and a couple had used HTTrack.. These of course gave me a load of files all with different layouts and small but important differences in the HTML coding.

Once again I rewrote my PHP programme and updated the database. I was now on a third and fourth generation of the site, each time adding a few more missing posts.

I had been dipping through the WayBack machine which had also sporadically taken copies of the site. Then I discovered wayback-machine-downloader. I downloaded and installed it and did a test run. It seemed to do the job so I did a live run. About eighteen hours later it finished.

Naturally, the downloader had dumped all its results into a series of folders with yet another layout. This time I tried a different approach – I used “locate” to find the index files and piped the result into a file. I discovered that if I created an array from that file using “explode” and “/” that the required files were to be found in all elements with a count of 12. This made the job easier. Once again, I used a modified version of the PHP programme to generate database entries.

I think I have gone as far as I can. There are still some small tweaks to be made to the site, including a few duplicate posts that somehow crept in. I still have to find some lost images as well, and maybe sort out additions to the various categories.

The result of all this effort can be found here:

The Anna Raccoon Archive.

Bad interpreter error

Posted on October 28, 2016 by RichardNovember 26, 2016

I use shell scripts for a few jobs.

Recently I rebuilt a laptop and installed a few scripts from a working machine. The other day I went to run one of the scripts from a command line and got the following:

/bin/bash^M: bad interpreter: No such file or directory

Now the new machine had an identical setup to the old and the script I was trying to run had the right permissions(and of course ran perfectly on the old machine), so the problem had to lie with the formatting of the script file ( and the ^M was a bit of a hint too?). Somehow the file transfer had corrupted the script file and trying to edit it didn't fix it.

A simple fix.

I used the following:

sed -i -e 's/\r$//' myfile.sh

The file then ran perfectly.

Install Google Earth with photos on Linux Mint 18 (64bit)

Posted on July 3, 2016 by RichardDecember 6, 2017

Latest: Mint 18.3/Google appears to have sorted the problems.
However there is still an issue with some video drivers.
————————-

Now that Mint 18 has been released, the first thing I noticed was that Google Earth is not available in the repositories and requires a manual download and install.

~~Before doing that, make sure Mint updates have been applied as the LSB libraries are missing from the installation disk, but are now available as an update.~~

First open a terminal and enter the following –

sudo apt-get install lsb-core -y -f

Then download Google Earth.

Finally, in Terminal, run –

sudo dpkg -i google-earth-stable_current_amd64.deb

The problem with the photographs not displaying still exists, so I have created a small script.

Please note – run the script at your own risk. It does however run perfectly on my setup.

Download it here.

Open a terminal in the folder where you have saved the download –

chmod +x GEImages.sh

sudo ./GEImages.sh

Once it has finished, run Google Earth and enjoy!

Running a live image on the desktop

Posted on April 28, 2016 by RichardApril 28, 2016

Things have been somewhat quite here of late?

Occasionally I like to set myself little mental exercises to keep the brain ticking over. My latest bright idea was to replace my desktop background wallpaper with a live [or nearly live] image from the Internet.

There is a site with a live webcam pointing to a view that I love. The webcam updates roughly once a minute producing a JPG image. My task was to use that image as a desktop background.

For the purposes of illustration, I have used imaginary URLs but you can take it from me that it works.

The first part was to write a shell script to retrieve the file from the website. This was short and simple –

#!/bin/bash
# retrieves newest image from web camera # and sets this as desktop background in cinnamon
wget "http://webcamerasite.com/webcams/cameraoutput.jpg" -O /home/username/Images/camera.jpg
gsettings set org.cinnamon.desktop.background picture-uri "file:///home/username/Images/camera.jpg"

Note that the final line is for Linux Mint Cinnamon edition. Modify it for other versions.

Save the script and make it executable. Running the script will replace the background image with the latest from the website.

The next task is to set up a Cron job to run the script.

In Terminal, type crontab -e

Select the editor (I chose 2 – Nano).

Add the following line at the bottom –

* * * * /home/username/location of shell script

This will run the script once a minute.

Press Ctrl-o to save the file, press Enter to confirm the file name and Ctrl-w to exit.

Your desktop will now refresh every minute with the latest view.

The latest image on my desktop

Auto mounting an NTFS partition in Linux

Posted on December 3, 2015 by RichardDecember 3, 2015

I have a simple setup on this machine.

It has a 1Tb hard disk so I allocated 500Gb for Windows and 500Gb for Linux [Mint 17.3 Cinnamon].

My problem was that I wanted to be able to permanently access the Windows NTFS partition from Linux as there were some files that I wanted to be able to edit from either OS, and it made sense to retain those files on NTFS where they can be accessed seamlessly from either OS.

So how do I get Linux to automatically mount the NTFS partition on boot?

Simple.

In Linux, select Menu -> Preferences -> Disks.

[Click to enlarge]

Select the NTFS partition and then click on the cog icon below the selection (not the one at the top right corner) as shown above.

Select "Edit Mount Options…"

[Click to enlarge]

Simple set the Automatic Mount Options to "OFF", and make sure Mount at startup is selected, as above.

Reboot and the partition is mounted.

A Pipe and a Keyboard

A sort of Linux scrapbook

Category Archives: Linux