Building a Personal Digital Archive Bot
Why Build an Archive Bot?
The internet is incredible—it's a boundless treasure trove of information, entertainment, and culture. But it's also fleeting. I hate the feeling of attempting to revisit an old favorite video, X.com post, or website, only to discover it vanished into digital oblivion.
Two thousand years ago, the eruption of Mount Vesuvius buried an ancient library now known as the Herculaneum Papyri. These ancient scrolls miraculously survived and were decoded in 2023! Ironically, they're probably more permanent than the digital content we produce today.
This realization inspired me to build my personal archive bot: ensuring that the media I value remains accessible for me, my children, and future generations.
How the Archive Bot Works
I built a straightforward TypeScript bot deployed on a DigitalOcean Droplet. Here's the simplified workflow:
- I send a URL to my Telegram bot.
- The server downloads the media using yt-dlp for videos or a headless Chrome browser for other content.
- The downloaded file is uploaded to DigitalOcean Spaces (a cloud storage solution similar to Amazon S3).
- The bot sends back a confirmation message via Telegram, indicating success or reporting any issues encountered (so I can go ssh into the server and fix it).
- The temporary files are deleted from the server after being uploaded, keeping everything tidy.
Project Setup and Implementation
The core functionality involves:
- Receiving URLs through Telegram.
- Validating and downloading content using yt-dlp.
- Automatically uploading to DigitalOcean Spaces.
- Persistent, unattended operation on a cloud VPS.
Setting up the Telegram Bot
First of all: why Telegram? Well, I want to be able to communicate with my bot from any of my devices. Telegram has a great mobile app and a great web app. And it's free! I've built some Telegram bots in the past, so I'm already familiar with how they work. Using Telegram as the "frontend" to my bot allows me to send a simple text message with a URL from any device, even when I'm on the go. Then, the bot will take care of automatically archiving the media and texting me back with status updates.
I used Telegraf, a lightweight framework, to handle interactions via Telegram. Creating a bot through @BotFather provided the required API token.
Here's a simplified snippet of the bot's logic:
1import { Telegraf } from "telegraf";23const TELEGRAM_TOKEN = process.env.TELEGRAM_TOKEN!;4const ALLOWED_USER_ID = Number(process.env.ALLOWED_USER_ID);56const bot = new Telegraf(TELEGRAM_TOKEN);78bot.on("text", async (ctx) => {9if (ctx.from?.id !== ALLOWED_USER_ID) {10 await ctx.reply("Unauthorized user.");11 return;12}1314// do some stuff15await ctx.reply("Stuff was done!");16});1718bot.launch();19console.log("Bot is running...");
Downloading Content with yt-dlp
The bot leverages yt-dlp to handle video downloads.
Make sure to follow all applicable copyrightlaws in your country when downloading content.
Since yt-dlp is an executable, we need to use Bun's spawn
function to run it. Here's a simplified example of how we can do that. One interesting note for this part is that we need to listen to
stdout
to figure out the name of the downloaded file. We also capture the stderr
so that if something goes wrong, I can report the error to myself via a Telegram message.
1import { spawn } from "child_process";23async function downloadVideo(url: string): Promise<string> {4return new Promise((resolve, reject) => {5 let stdout = "";6 const process = spawn("yt-dlp", [url, "-o", "./downloads/%(title)s.%(ext)s", "--restrict-filename"]);78 process.stdout.on("data", (data) => stdout += data.toString());9 process.stderr.on("data", (data) => console.error("stderr:", data.toString()));1011 process.on("close", (code) => {12 if (code === 0) {13 const match = stdout.match(/[download] Destination: (.+)/);14 resolve(match ? match[1].trim() : "");15 } else {16 reject(new Error("Download failed"));17 }18 });19});20}
Uploading with Bun's Native S3 Support
Using Bun's native S3 client streamlined uploads to DigitalOcean Spaces:
1import { S3Client } from "bun";23const s3Client = new S3Client({4endpoint: process.env.S3_ENDPOINT!,5bucket: process.env.S3_BUCKET!,6accessKeyId: process.env.S3_ACCESS_KEY_ID!,7secretAccessKey: process.env.S3_SECRET_ACCESS_KEY!,8});910async function uploadToSpaces(filePath: string): Promise<void> {11const fileName = filePath.split("/").pop();12if (!fileName) throw new Error("Invalid file path");1314const s3file = s3Client.file(`archive/${fileName}`);15await Bun.write(s3file, Bun.file(filePath));16}
Deployment Steps
Deployment involved setting up a DigitalOcean droplet, installing dependencies (Bun, yt-dlp, ffmpeg), and securely transferring files from my local machine.
To keep the bot running continuously, I used tmux:
1# Start tmux session2tmux new -s archive_bot34# Run bot inside tmux5bun run main.ts67# Detach from session (bot continues running)8Ctrl+B, D910# Reattach later11tmux attach -t archive_bot
Monitoring and Debugging
tmux ls
tmux capture-pane -p -S - > session.log
tmux pipe-pane -o "cat >> bot.log"
Final Thoughts
Building this archive bot not only solved my practical need to preserve digital content but also deepened my understanding of modern web automation tools like Bun and yt-dlp. In an era where digital content disappears without warning, it feels reassuring to build a small piece of digital permanence, just as the Herculaneum scrolls preserved history for future generations.
I've made the full source code for this archive bot available on GitHub: