Day 5: Simple Wayback Machine Archiving

I want to share lists of links, but make them readable and archived

Close up photo of keyboard keys. — | *'TYPE' by SarahDeer is licensed with CC BY 2.0* |

Project Scope and ToDos

Take a link and turn it into an oEmbed/Open Graph style share card
Take a link and archive it in the most reliable way
When the link is a tweet, display the tweet but also the whole tweet thread.
When the link is a tweet, archive the tweets, and display them if the live ones are not available.
Capture any embedded retweets in the thread. Capture their thread if one exists
Capture any links in the Tweet
Create the process as an abstract function that returns the data in a savable way

Archive links on Archive.org and save the resulting archival links
Create link IDs that can be used to cache related content
Integrate it into the site to be able to make context pages here.
Check if a link is still available at build time and rebuild the block with links to an archived link

Day 5

Ok, the Archive.is stuff isn't working for no clear reason. Let's step back and try Archive.org. I want to first standardize to a single set of finalized meta values. I built a function to find the right values moving down priorities from metadata to OpenGraph to JSON-LD, with JSON-LD (where there) being the most likely to have accurate metadata.

Ok, let's look at Web Archive documentation

If You See Something, Save Something - 6 Ways to Save Pages In the Wayback Machine - Internet Archive Blogs

@internetarchive

1/13/2019

In recent days many people have shown interest in making sure the Wayback Machine has copies of the web pages they care about most. These saved pages can be cited, shared, linked to – and they will continue to exist even after the original page changes or is removed from the web. There are several ways to save pages and […]

Read Archived

Archive Team

1/21/2022

We Are Going To Rescue Your Shit. Archive Team has 588 repositories available. Follow their code on GitHub.

Read Archived

Dev/Source Code - Archiveteam

1/21/2022

Fork me on GitHub! File and triage issues, fix bugs, refactor code, submit pull requests… all welcome! Discussion in #archiveteam-dev (on hackint).

Read Archived

GitHub - ArchiveTeam/seesaw-kit: Making a reusable toolkit for writing seesaw scripts

1/21/2022

Making a reusable toolkit for writing seesaw scripts - GitHub - ArchiveTeam/seesaw-kit: Making a reusable toolkit for writing seesaw scripts

Read Archived

GitHub - ArchiveTeam/grab-site: The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

1/21/2022

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns - GitHub - ArchiveTeam/grab-site: The archivist's web crawler: WARC output, dashboard for all cra...

Read Archived

How to upload files to create a new item page – Internet Archive Help Center

2/6/2022

You can upload movies, audio, texts, software, images, and other formats.

Read Archived

Example of good metadata for items – Internet Archive Help Center

2/6/2022

When creating an item good metadata is important for archival purposes, informing users and allowing it to be found easier in search results. In this article we show examples of items with good, comprehensive metadata. Of course you can add more or different metadata.

Read Archived

Help:Using the Wayback Machine - Wikipedia

1/21/2022

The Wayback Machine is a service which can be used to cite archived copies of web pages used by articles. This is useful if a web page has changed, moved, or disappeared; links to the original content can be retained. This process can be performed automatically, using the web interface for User:InternetArchiveBot.

Read Archived

It can definitely get complicated depending on how complex we want our archiving process to be. But let's start with a very basic version. It looks like I should just be able to send a request and start the archiving process off? Let's try setting up a basic fetch.

Hmm there's a bunch of repeated code I'll need to implement for fetch. Let's pull it out into its own file. Now I can basically use it in place and also reuse it for my link archiver functions:

// Using suggestion from the docs - https://www.npmjs.com/package/node-fetch#loading-and-configuring-the-module

const fetch = (...args) =>
	import("node-fetch").then(({ default: fetch }) => fetch(...args));

const ua =
	"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)";

const getRequestHeaders = () => {
	return {
		cookie: "",
		"Accept-Language": "en-US,en;q=0.8",
		"User-Agent": ua,
	};
};

class HTTPResponseError extends Error {
	constructor(response, ...args) {
		super(
			`HTTP Error Response: ${response.status} ${response.statusText}`,
			...args
		);
		this.response = response;
	}
}

const checkStatus = (response) => {
	if (response.ok) {
		// response.status >= 200 && response.status < 300
		return response;
	} else {
		throw new HTTPResponseError(response);
	}
};

const fetchUrl = async (url, options = false, ua = true) => {
	let response = false;
	let finalOptions = options
		? options
		: {
				method: "get",
		  };
	if (ua) {
		finalOptions.header = ua === true ? ua : getRequestHeaders();
	}
	try {
		response = await fetch(url, finalOptions);
	} catch (e) {
		if (e.hasOwnProperty("response")) {
			console.error("Fetch Error in response", e.response.text());
		} else if (e.code == "ENOTFOUND") {
			console.error("URL Does Not Exist", e);
		}
		return false;
	}
	response = checkStatus(response);
	return response;
};

module.exports = fetchUrl;

Ok, that works! I am getting a 200 back, implying that it is being archived. Yeah, when I check the archive page for me test link it is working!

git commit -am "Set up for further archiving and abstract fetch tools. Send links to Wayback Machine"

Previous Day Next Day