What's the best way of archiving articles on a website, by the way? I should... probably save a copy of all my published Waypoint work.
Ideally something like a script that I can just give URLs to and spits out local files
writer (derogatory). lead designer on Fallen London.
http://twitter.com/notbrunoagain
THESE POSTS ARE PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE POSTS OR THE USE OR OTHER DEALINGS IN THE POSTS.
What's the best way of archiving articles on a website, by the way? I should... probably save a copy of all my published Waypoint work.
Ideally something like a script that I can just give URLs to and spits out local files
Gut instinct is a “print to PDF” in your browser as it’ll preserve the formatting and the presentation in one doc as opposed to downloading an html.
Depends on what you need I think. If it's just text then dropping a URL into wget will probably do it. If it's everything then you're going to have more of a challenge. As the other comment said printing is straightforward and you get what you see. You could also try recursive wget with a fake user agent, although that will still be incomplete for most new sites.
If you're looking to also grab transitively-linked image, JS, and CSS files then the best/easiest option might be a crawler. It's hitting an ant with a sledgehammer, but there's a menagerie of open-source web crawlers out there in various languages.
Depending on the size of the work involved, another option would be getting a packet sniffer like Charles Proxy or Wireshark, and manually go through the game with session recording enabled. Extracting the work from the saved session is an exercise left to the reader.
On Remap Radio's article about Vice closing, Patrick and Rob mention a custom scraper that Matthew Gault's wife and friend built that pulled high quality PDF copies of their work from Waypoint--one of them might have that still?
I've been meaning to try out https://archivebox.io/ which saves webpages as HTML and PDF and a few other formats, they've also got a long list of other similar tools linked on their github
https://web.archive.org/save has an option to email you the archived page alongside adding it to the Wayback Machine