From Tedious Text to Polished HTML: Automating Your Blog with a Local LLM
I publish my blogs on Medium first, then mirror them on my website so non-Medium members can read them. My site is a static, plain-HTML setup. The problem? Medium doesn’t give you a quick way to export a single article as HTML. My old workflow was clumsy — copy the article, lose all the formatting, then manually rebuild the HTML and fix it up.
This post is about how I killed that manual work by automating the process with local Ollama and small, efficient models like gemma3n and phi-3. If you’ve ever poured hours into writing a great article, only to waste more time wrangling formatting, you’ll know exactly why I built this.
The Problem: The Manual Formatting Grind
Let’s break down the manual work involved in taking a draft to a final HTML page:
- Basic Tagging: Adding <h1>, <h2>, <p>, etc.
- Code Blocks: Carefully wrapping code in <pre><code> tags, ensuring indentation is preserved, and maybe adding language classes for syntax highlighting.
- Hyperlinking: Finding all URLs and turning them into <a> tags.
- Value-Add Linking: Spotting a mention of “Microsoft,” “Apache Parquet,” or “React,” then opening a new tab, searching for the official website, and adding the hyperlink. This is valuable for SEO and reader experience but is incredibly time-consuming.
- Ad Placement: Inserting ad blocks consistently under certain headings.
The Old Solution (And Its Failings)
A programmer’s first instinct is often to reach for regular expressions (regex). We could try to write a script that:
- Finds lines that look like headings and wraps them.
- Finds blocks of indented text and assumes they are code.
- Uses a regex to find URLs.
But this approach is brittle and quickly becomes a nightmare:
- Complexity: The regex for handling all edge cases becomes monstrous and unmaintainable.
- Lack of Context: How does a regex know the difference between “Apple” the company and “apple” the fruit? How can it find the “legitimate official website” for a tool it has never heard of? It can’t.
- Inflexibility: What if you want to change the rules slightly? You’re back to wrestling with complex code instead of just stating your new requirement.
But What About WYSIWYG Editors?
You might be thinking, “Why not just use a WYSIWYG (What You See Is What You Get) editor like the one in WordPress or other CMS platforms?” While these editors are great for simple posts, they fall short when it comes to specialized, high-volume, or complex content for a few key reasons:
- The “Value-Add” Bottleneck: A WYSIWYG editor can’t perform intelligent tasks like identifying a tool name (“Pandas”) and automatically linking to its official documentation. This crucial, time-consuming step remains entirely manual. You’re still stuck with the copy-paste-link workflow for every single entity.
- Lack of Batch Automation: While you can format text, you’re still doing it block-by-block. Our script processes the entire document and applies dozens of rules (headings, code blocks, entity links, ad placements) in a single, automated pass.
- Inconsistency: Manual formatting, even in a nice editor, is prone to human error. Did you remember to set target="_blank" on every external link? Is every ad block placed correctly? Automation guarantees perfect consistency every time.
The New Solution: A Local LLM as Your Formatting Engine
Instead of writing complex logic to parse the text, what if we could just describe the final output we want? This is where LLMs shine. By giving the model a set of plain English instructions, we can delegate all the complex, context-aware formatting tasks.
payload = {
"model": "gemma3n:e4b",
"prompt": prompt,
"stream": True,
"options": {
"temperature": 0.2,
"num_ctx": 16384
}
}
Here’s a peek at the Python payload that puts it all together:
prompt = """
You are an expert HTML formatter. Convert the following article text into clean, semantic HTML. Follow these rules precisely.
**Formatting Rules:**
- The first line of the text is the main title, wrap it in an <h1> tag.
- Identify all subsequent headings and use appropriate heading tags (<h2>, <h3>, etc.).
- Wrap all paragraphs in <p> tags.
- Identify blocks of code. Wrap them in <pre><code> tags. It is crucial that you preserve all original indentation and line breaks within these code blocks. If you can identify the language, add a class like `class="language-python"` to the <code> tag.
- Find all URLs (like http://example.com) and convert them into anchor tags (<a href="URL" target="_blank">URL</a>).
**Entity Linking Rule:**
- Identify names of companies, tools, frameworks, and organizations.
- For each *unique* entity, hyperlink its *first occurrence* to its legitimate official website.
- Do NOT add links inside <h1>, <h2>, <h3>, <pre>, or <code> tags.
- **Here are some examples of how to apply this rule:**
- If you see “Apache Parquet”, the first time it appears it should become <a href="https://parquet.apache.org/" target="_blank">Apache Parquet</a>.
- If you see “Apache Spark”, the first time it appears it should become <a href="https://spark.apache.org/" target="_blank">Apache Spark</a>.
- If you see “Pandas”, the first time it appears it should become <a href="https://pandas.pydata.org/" target="_blank">Pandas</a>.
- If you see “AWS Athena”, the first time it appears it should become <a href="https://aws.amazon.com/athena/" target="_blank">AWS Athena</a>.
- Use your knowledge to find the official websites for other tools and frameworks mentioned, even if they are not in these examples.
The End Result: Time Saved, Quality Gained
What was once a 30-minute manual task of copying, pasting, and formatting is now a 1-minute automated script. We execute python process_draft.py, and out comes a beautiful, fully-formed draft.html file with:
- Semantic <h1>, <h2>, and <p> tags.
- Perfectly preserved <pre><code> blocks with language detection.
- Context-aware hyperlinks to official company and tool websites.
- Strategically placed ad blocks.
By offloading the complex, nuanced work to a locally-run LLM, we’ve created a tool that is not only more powerful and reliable than a regex-based script but also infinitely easier to update. If we want to change a rule, we just edit a line of English in our prompt. This is a game-changer for content creation workflows.
--- Note: I've removed the metadata (author name, read time, etc.) as per your instructions and only included the article body HTML.