The sitemap.xml file is used to let search engines know the content of your website for indexing. You can create a sitemap automatically using the official Jekyll sitemap plugin. I keep my blog in a git repository, so I thought I could do better for updating the <lastmod> tag, which indicates when the page was last changed. The official plugin needs an explicit date in the front matter (which is tedious to maintain correctly) or it will use the file modification time. Git does not preserve file modification times, so this method will get confused when checking out the repository on another computer. Instead, I wanted to use the git metadata, which more accurately records when the file was last changed.

Getting the data from git

As a first step, let us get the date of the very latest change to the git repository:

git log -1 --format="%ct"

The -1 tells git to only show a log entry for the last commit and the %ct format outputs the so-called “committer date” for the last commit. This is the last modification date of a commit, which is updated not only when the commit is made, but also on rebasing the commit, for example. It thus represents the latest change to the whole repository. We will use it as a fallback if no better date is available.

For individual files, though, we want to know when the actual commit was first made and do not care for rebase or similar operations, which leave the content intact. We thus use the “author date”:

git log -1 --format="%at" -- <filename>

Both commands output a single Unix timestamp, which can be parsed easily.

Jekyll plugin

With the above, there is not much left to do to generate a sitemap.xml file with a Jekyll plugin. First we define some functions to execute git. The first one will ensure that git runs correctly even when called from a git hook (see my blog post for details):

def exec_git(cmdline)
  env = ENV.to_hash
  # Remove all GIT_* env variables to ensure clean execution!
  env.delete_if { |key, value| key.start_with?("GIT_") }
  # Execute and return stdout
  IO.popen(env, cmdline,
           :unsetenv_others=>true) { |stdout| stdout.read }
end

def get_timestamp_global
  exec_git(["git", "log", "-1", "--format=%ct"]).to_i
end

def get_timestamp(fname)
  exec_git(["git", "log", "-1", "--format=%at", "--", fname]).to_i
end

Next, get the “global” modification time in UTC:

sitetime = Time.at(get_timestamp_global)
               .utc
               .strftime "%Y-%m-%dT%H:%M:%SZ"

Then iterate over all HTML pages and try to get their git timestamp. If they do not have one, use the global timestamp:

sitemap = []
[site.posts.docs, site.pages, site.static_files].each do |l|
  l.each do |page|
    # Skip all non-html pages.
    next if !(page.url.end_with?("/") || page.url.end_with?(".html"))
    # Try to find source file.
    src_file = page.path
    # Get time.
    timestamp = get_timestamp(src_file)
    if timestamp == 0
      # If there is no source file, git's output is empty.
      # This yields an output of 0 from get_timestamp().
      # In this case, we use sitetime.
      time_str = sitetime
    else
      time_str = Time.at(timestamp)
                     .utc
                     .strftime "%Y-%m-%dT%H:%M:%SZ"
    end
    # We got it.
    sitemap.push( { "loc"     => page.url,
                    "lastmod" => time_str } )
  end
end

We will use a generator plugin to store the contents of the sitemap variable in the sitemap page’s entries variable and render the data using the following layout:

---
---
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% for entry in page.entries %}  <url>
    <loc>{{ entry.loc | prepend: site.baseurl
                      | prepend: site.url
                      | xml_escape }}</loc>
    <lastmod>{{ entry.lastmod }}</lastmod>
  </url>
{% endfor %}</urlset>

The full version of the current plugin code can be found on my “code for this blog” page.