Project Scope and ToDos
- Take a link and turn it into an oEmbed/Open Graph style share card
- Take a link and archive it in the most reliable way
- When the link is a tweet, display the tweet but also the whole tweet thread.
- When the link is a tweet, archive the tweets, and display them if the live ones are not available.
- Capture any embedded retweets in the thread. Capture their thread if one exists
- Capture any links in the Tweet
- Create the process as an abstract function that returns the data in a savable way
- Archive links on Archive.org and save the resulting archival links
- Create link IDs that can be used to cache related content
- Integrate it into the site to be able to make context pages here.
- Archive linked YouTubes
Day 1
Ok, so this is a thing that happens a lot. I collect a bunch of links to a particular topic, and I want to share it. But it's hard to read a bunch of links, so how do I make it more readable?
I thought through some scope requirements and to dos and put them on the top of this page first. My first goal is to take a list of links and turn them into something more easy to read. I think the best way is by creating Open Graph style share cards for each link and replacing the link in place with those cards. So let's handle that request process.
Selecting test tool
I think the easiest way to move forward is to build some test processes first so that I can run links through the function I'm building and test my outputs. I've now done tests with Jest and Mocha. Another popular library is Chai, so let's try that.
Archiving Tools Refrerence
It's also worthwhile to do exactly the sort of thing I'm talking about here and record some info about archiving links.
Save My News: A personal, permanent clipping service - GitHub - palewire/savemy.news: Save My News: A personal, permanent clipping service
A simple Python wrapper for the archive.is capturing service - GitHub - palewire/archiveis: A simple Python wrapper for the archive.is capturing service
Perma.cc helps scholars, journals, courts, and others create permanent records of the web sources they cite.
Collect and revisit web pages — Free, open-source web archiving service.
The Wayback Machine is a service which can be used to cite archived copies of web pages used by articles. This is useful if a web page has changed, moved, or disappeared; links to the original content can be retained. This process can be performed automatically, using the web interface for User:InternetArchiveBot.
Many people have shown interest in making sure the Wayback Machine has copies of the web pages they care about most. These saved pages can be cited, shared, linked to – and they will continue to exist even after the original page changes or is removed from the web.
This document provides information about the Memento compliant archive.is.
Last updated: January 19, 2015
If you have any issues or feedback, see the AT #warrior IRC channel on hackint.
The WARC Ecosystem has information on tools to create, read and process WARC files.
freeyourstuff.cc - universal content liberation. Contribute to eloquence/freeyourstuff.cc development by creating an account on GitHub.
A list of tools related to W(eb)ARC(hive). Contribute to dhamaniasad/WARCTools development by creating an account on GitHub.
Learn to generate a Puppeteer PDF document from a heavily styled React page using Node.js, headless Chrome and Docker.
Convert any html content or html page to PDF. Latest version: 1.0.8, last published: 2 months ago. Start using html-pdf-node in your project by running `npm i html-pdf-node`. There are 6 other projects in the npm registry using html-pdf-node.
A standalone version of the readability lib. Contribute to mozilla/readability development by creating an account on GitHub.
This is all pretty much more extensive then I want to do for my first run at this project, but it is good to have a list. To start, let's turn link lists into HTML cards.
Sanitizing the URL
Ok, first thing is to sanitize the URL.
There's a fairly popular Node sanitation library, I'll start there.
I'll pull the regex WordPress uses to clean URLs, as I've used that in PHP and it's fairly reliable.
Finally, I want to strip marketing params that are commonly used in links. I could make my own code here, but a quick search around has revealed that someone built some good regexes to handle this.
Ok, this makes for a good test setup. It looks like Chai builds on top of Mocha, so let's install that too.
Ok, it looks like Chai has a suite of tools, the major ones are should, expect and assert.
Ok, let's make some bad links.
I want to invalidate mailto
links also. So let's see if I can throw an error and capture it in Chai.
I should be able to capture the tests with .should.Throw
and expect(fn).to.throw(new Error('text'))
Hmm, that's not working.
Ok, it looks like it has a different format and does require we put the error-throwing function inside another function... for some reason. I also can't use the error object, just the error text. Also unclear from the docs.
it("should throw on mailto links", () => {
expect(() => {
linkModule("mailto:test@example.com?subject=hello+world");
}).to.throw("Invalid Mailto Link");
});
Ok, my sanitizer looks good and I think that I have some good coverage. Next step will be handling the Fetch step and building out the data model. But this is a good place to stop.