Building a Personal Digital Archive Bot

From Ephemeral to Eternal
My journey building an automated personal archive bot to preserve digital media forever

Why Build an Archive Bot?

The internet is incredible—it's a boundless treasure trove of information, entertainment, and culture. But it's also fleeting. I hate the feeling of attempting to revisit an old favorite video, X.com post, or website, only to discover it vanished into digital oblivion.

📝Did You Know?

Two thousand years ago, the eruption of Mount Vesuvius buried an ancient library now known as the Herculaneum Papyri. These ancient scrolls miraculously survived and were decoded in 2023! Ironically, they're probably more permanent than the digital content we produce today.

This realization inspired me to build my personal archive bot: ensuring that the media I value remains accessible for me, my children, and future generations.

How the Archive Bot Works

I built a straightforward TypeScript bot deployed on a DigitalOcean Droplet. Here's the simplified workflow:

  • I send a URL to my Telegram bot.
  • The server downloads the media using yt-dlp for videos or a headless Chrome browser for other content.
  • The downloaded file is uploaded to DigitalOcean Spaces (a cloud storage solution similar to Amazon S3).
  • The bot sends back a confirmation message via Telegram, indicating success or reporting any issues encountered (so I can go ssh into the server and fix it).
  • The temporary files are deleted from the server after being uploaded, keeping everything tidy.

Project Setup and Implementation

The core functionality involves:

  • Receiving URLs through Telegram.
  • Validating and downloading content using yt-dlp.
  • Automatically uploading to DigitalOcean Spaces.
  • Persistent, unattended operation on a cloud VPS.

Setting up the Telegram Bot

First of all: why Telegram? Well, I want to be able to communicate with my bot from any of my devices. Telegram has a great mobile app and a great web app. And it's free! I've built some Telegram bots in the past, so I'm already familiar with how they work. Using Telegram as the "frontend" to my bot allows me to send a simple text message with a URL from any device, even when I'm on the go. Then, the bot will take care of automatically archiving the media and texting me back with status updates.

I used Telegraf, a lightweight framework, to handle interactions via Telegram. Creating a bot through @BotFather provided the required API token.

Here's a simplified snippet of the bot's logic:

tsx
1import { Telegraf } from "telegraf";
2
3const TELEGRAM_TOKEN = process.env.TELEGRAM_TOKEN!;
4const ALLOWED_USER_ID = Number(process.env.ALLOWED_USER_ID);
5
6const bot = new Telegraf(TELEGRAM_TOKEN);
7
8bot.on("text", async (ctx) => {
9if (ctx.from?.id !== ALLOWED_USER_ID) {
10 await ctx.reply("Unauthorized user.");
11 return;
12}
13
14// do some stuff
15await ctx.reply("Stuff was done!");
16});
17
18bot.launch();
19console.log("Bot is running...");

Downloading Content with yt-dlp

The bot leverages yt-dlp to handle video downloads.

⚠️Disclaimer

Make sure to follow all applicable copyrightlaws in your country when downloading content.

Since yt-dlp is an executable, we need to use Bun's spawn function to run it. Here's a simplified example of how we can do that. One interesting note for this part is that we need to listen to stdout to figure out the name of the downloaded file. We also capture the stderr so that if something goes wrong, I can report the error to myself via a Telegram message.

tsx
1import { spawn } from "child_process";
2
3async function downloadVideo(url: string): Promise<string> {
4return new Promise((resolve, reject) => {
5 let stdout = "";
6 const process = spawn("yt-dlp", [url, "-o", "./downloads/%(title)s.%(ext)s", "--restrict-filename"]);
7
8 process.stdout.on("data", (data) => stdout += data.toString());
9 process.stderr.on("data", (data) => console.error("stderr:", data.toString()));
10
11 process.on("close", (code) => {
12 if (code === 0) {
13 const match = stdout.match(/[download] Destination: (.+)/);
14 resolve(match ? match[1].trim() : "");
15 } else {
16 reject(new Error("Download failed"));
17 }
18 });
19});
20}

Uploading with Bun's Native S3 Support

Using Bun's native S3 client streamlined uploads to DigitalOcean Spaces:

tsx
1import { S3Client } from "bun";
2
3const s3Client = new S3Client({
4endpoint: process.env.S3_ENDPOINT!,
5bucket: process.env.S3_BUCKET!,
6accessKeyId: process.env.S3_ACCESS_KEY_ID!,
7secretAccessKey: process.env.S3_SECRET_ACCESS_KEY!,
8});
9
10async function uploadToSpaces(filePath: string): Promise<void> {
11const fileName = filePath.split("/").pop();
12if (!fileName) throw new Error("Invalid file path");
13
14const s3file = s3Client.file(`archive/${fileName}`);
15await Bun.write(s3file, Bun.file(filePath));
16}

Deployment Steps

Deployment involved setting up a DigitalOcean droplet, installing dependencies (Bun, yt-dlp, ffmpeg), and securely transferring files from my local machine.

To keep the bot running continuously, I used tmux:

bash
1# Start tmux session
2tmux new -s archive_bot
3
4# Run bot inside tmux
5bun run main.ts
6
7# Detach from session (bot continues running)
8Ctrl+B, D
9
10# Reattach later
11tmux attach -t archive_bot

Monitoring and Debugging

Check Sessions
tmux ls
Capture Logs
tmux capture-pane -p -S - > session.log
Real-time Logging
tmux pipe-pane -o "cat >> bot.log"

Final Thoughts

💡Digital Preservation

Building this archive bot not only solved my practical need to preserve digital content but also deepened my understanding of modern web automation tools like Bun and yt-dlp. In an era where digital content disappears without warning, it feels reassuring to build a small piece of digital permanence, just as the Herculaneum scrolls preserved history for future generations.

🔗Code

I've made the full source code for this archive bot available on GitHub: