Armaan Bhojwani


Published on

Server-Side Syntax Highlighting with Pygments

I recently implemented entirely server-side syntax highlighting on this site using the Pygments syntax highlighter. Here is an example of what syntax highlighted Python code looks like. This code is from commit fe0ed8d to Phrases. The theme used is Monokai, and the font is your browser's default fixed-width font.

#!/usr/bin/env python3
# Extract Latin famous phrases from wikipedia
# Armaan Bhojwani 2020

import argparse
import sqlite3
import requests
from bs4 import BeautifulSoup

def parse_args():
    parser = argparse.ArgumentParser(
        description="Generate SQLite db of Latin famous phrases from Wikipedia.")
    parser.add_argument("-o", "--output",
                       default="phrases.db",
                       help="set custom output file location")
    parser.add_argument("-v", "--version",
                        action="store_true",
                        help="print script version")
    return parser.parse_args()

def get_html(url):
    print("downloading webpage")
    return BeautifulSoup(requests.get(url).content, "html.parser")

def prep_database():
    print("prepping database")
    c.execute("DROP TABLE IF EXISTS phrases")
    c.execute("""CREATE TABLE phrases(
              id INTEGER,
              latin TEXT,
              english TEXT,
              notes TEXT,
              length INTEGER)""")

def fill_database(list_table):
    i = 0 # phrase id
    print("iterating through tables")
    for table in list_table:
        for row in table.tbody.find_all("tr", recursive=False):
            cell = row.find_all("td", recursive=False)
            if len(cell) > 2:
                print(i, end="\r")

                latin = (cell[0].get_text(" ", strip=True)).rstrip()
                english = (cell[1].get_text(" ", strip=True)).rstrip()
                notes = (cell[2].get_text(" ", strip=True)).rstrip()
    
                c.execute("""INSERT INTO phrases
                         (id, latin, english, notes, length)
                         VALUES(?, ?, ?, ?, ?)""",
                         (i, latin, english, notes, len(latin)))
                conn.commit()
            i = i + 1

def get_tables():
    url = ("""https://en.wikipedia.org/w/index.php?title=List_of_Latin_phrases_(
          full)&oldid=986793908""")
    return get_html(url).find_all("table", attrs={"class":"wikitable"})

def main():
    if args.version:
        print(version)
    else:
        prep_database()
        fill_database(get_tables())

if __name__ == "__main__":
    version = "phrases extract.py 1.0.1"
    args = parse_args()
    conn = sqlite3.connect(args.output)
    c = conn.cursor()
    main()


The highlighting isn't perfect, however its much better than just plain white text.

Why Pygments

There are many syntax highlighters available for the web. Unfortunately, must of them are written in Javascript, and would not work given the constraints I have on the design of this website (no Javascript, and only server-side computation, among other things). I needed to find something that works server-side which is configurable and flexible. This quickly led me to Pygments and Chroma. Chroma is written in Golang, and heavily inspired by Pygments. It, however, does not include as many lexers, and is still fairly new software.

I chose Pygments for this project because I know Python already and am not interested in learning Golang just for this project. I appreciated the maturity of Pygments compared to Chrom. I ended up just using the CLI for Pygments and not importing Pygments into a Python script as I originally intended, though, so the language factor was a nonissue.

Implementation

The first hurdle I had to jump when implementing syntax highlighting into the site was making 80 columns of code fit onto the screen well without needing to scroll or make the text unreadably small. I did this by increasing the width of the pages from 650px to 750px, which I'm not super happy about, but it was needed to make this work. If viewing on a screen smaller than 750px, you will still have to scroll side to side unfortunately. I also reduced the size of the font in code blocks to 0.8em, which was about as small as I could make it while maintaining good readability.

I then had to figure out how to extract the code to be highlighted, use Pygments to convert it to colored HTML, and then insert the formatted HTML back into the page. The way I settled on doing this was to keep the code snippets in a separate directory from the markdown that I write these posts in, using a shell script to convert the snippets to HTML through Pygments, and then using Caddy server's templating functionality to include the formatted HTML into the markdown. The source code for this post as seen here shows this in action.

This is the short script that I use to call Pygments with. It's just one component in a suite of shell scripts I utilize to generate this site.

#!/usr/bin/env sh

find posts/code -type f -not -name "*.html" | \
while read i ; do
  base=$(basename ${i} | cut -d '.' -f 1)
  outp=posts/code/${base}.html
  pygmentize -f html ${i} > $outp
  sed -i 's/<\/div>/<\/pre>/g' $outp
  echo "<link rel='stylesheet' href='/resources/css/syntax.min.css'>" >> $outp
done


It first finds all of the files in the code directory that are not HTML, then converts each one, overwriting the output file if it already exists. It then fixes a small tag issue using sed, and adds in a link to the syntax stylesheet.

The HTML which Pygments generates is pretty ugly, with almost every word surrounded in a <span> element. For example, here is the raw HTML of the snippet above:

<div class="highlight"><pre><span></span><span class="ch">#!/usr/bin/env sh</span>

find posts/code -type f -not -name <span class="s2">&quot;*.html&quot;</span> <span class="p">|</span> <span class="se">\</span>
<span class="k">while</span> <span class="nb">read</span> i <span class="p">;</span> <span class="k">do</span>
  <span class="nv">base</span><span class="o">=</span><span class="k">$(</span>basename <span class="si">${</span><span class="nv">i</span><span class="si">}</span> <span class="p">|</span> cut -d <span class="s1">&#39;.&#39;</span> -f <span class="m">1</span><span class="k">)</span>
  <span class="nv">outp</span><span class="o">=</span>posts/code/<span class="si">${</span><span class="nv">base</span><span class="si">}</span>.html
  pygmentize -f html <span class="si">${</span><span class="nv">i</span><span class="si">}</span> &gt; <span class="nv">$outp</span>
  sed -i <span class="s1">&#39;s/&lt;\/div&gt;/&lt;\/pre&gt;/g&#39;</span> <span class="nv">$outp</span>
  <span class="nb">echo</span> <span class="s2">&quot;&lt;link rel=&#39;stylesheet&#39; href=&#39;/resources/css/syntax.min.css&#39;&gt;&quot;</span> &gt;&gt; <span class="nv">$outp</span>
<span class="k">done</span>
</pre></pre>
<link rel='stylesheet' href='/resources/css/syntax.min.css'>


Each span's style is defined in another stylesheet, also generated with Pygments. I load in the stylesheet separately because relatively few pages have syntax highlighting, so it would be a waste to include it in the main stylesheet which every page loads.

If this system becomes unwieldy as this website scales, I will rewrite the syntax-gen script in Python, importing Pygments as a library instead of using it's command-line interface. There are a few changes that doing this would allow me to implement like including the syntax stylesheet in the head of the document, and extracting the code directly from code blocks within the markdown files. Not to say that this is impossible to do in Bash, but it certainly is not something I would like to take on.

The only major con of this method that I can see is that it creates large HTML files. This page is about 10 KB when compressed and minified, and 34 KB uncompressed, by far the biggest page on this website so far, although still minuscule relative to the average website.