clipper.js - Simplify Web Content Conversion to Markdown Using Terminal Commands

Introduction to Clipper.js

Clipper.js is a powerful Node.js command-line tool designed to make the process of clipping content from web pages and converting it to Markdown remarkably easy. The tool leverages the capabilities of Mozilla's Readability library to effectively parse web page content and uses Turndown to convert it into Markdown format.

This is particularly useful for those who want to save snippets of web content for personal use, such as note-taking or archiving, without relying on browser extensions like Evernote Web Clipper or Notion Web Clipper. Notably, Clipper.js operates entirely within the terminal environment, eliminating the need for additional software installations or account sign-ups.

Installation

To get started with Clipper.js, users simply need to run the following command to install it globally via npm:

npm install -g @philschmid/clipper

It is important to note that for web content crawling, additional installation of playwright and its browser dependencies is required.

Using Clipper.js

Clipping Content

Clipper.js offers several options for clipping content from various inputs:

-i, --input <file> | <directory>: Specifies an input file (in HTML format) or a directory from which content is to be clipped. If a directory is given, all included files will be clipped.
-u, --url <url>: Provides a URL from which to clip content.
-f, --format <format>: Defines the output format, either Markdown or JSON, with Markdown as the default.
-o, --output <file>: Sets the output file for the clipped content, defaulting to output.md.

Example Commands:

Clip from a URL:
```
clipper clip -u <url>
```
Clip from a file:
```
clipper clip -i <file>
```

Clip from a directory and save as JSON:

clipper clip -i <directory> -f json -o dataset.jsonl

Crawling Websites

Warning: Crawling websites can be resource-intensive. Make sure you understand the risks involved and proceed with caution.

-u, --url <url>: The URL to begin the crawl.
-g, --glob <glob>: A glob pattern to match URLs for crawling.
-o, --output <file>: The file where the crawled content will be saved, with dataset.jsonl as default.

Example Command:

clipper crawl -u <url>

This command will crawl the specified site and store all content in the dataset.jsonl file.

Alternative Use Cases

Converting PDF to Markdown

Clipper.js can transform a PDF into Markdown by first converting the PDF to HTML using a tool like Poppler, and then using Clipper.js for the Markdown conversion:

pdftohtml -c -s -noframes test.pdf test.html
clipper clip -i test.html

Local Development and Testing

Developing Clipper.js locally involves several steps such as cloning the repository, installing dependencies, and testing various functionalities like clipping and crawling. Instructions for testing and building the project for production are provided.

Credits and Licensing

Clipper.js relies on several open-source libraries to function, specifically Mozilla Readability for parsing, Turndown for converting content to Markdown, and Crawlee for crawling tasks. The project is available under the Apache 2.0 license, making it free and open for modification and distribution.

Publishing to npm

To release a new version of Clipper.js to npm, follow the steps of cleaning up old build files, updating the version, building the project, and publishing it to npm, in addition to creating a release on GitHub. This ensures that the latest version is accessible to the user community.