Creating sitemap.xml with Jekyll using git data
The sitemap.xml
file is used to let
search engines know the content of your website for indexing. You can
create a sitemap automatically using the official Jekyll sitemap
plugin. I keep my blog in a
git repository, so I thought I could do better for updating the
<lastmod>
tag, which indicates when the page was last changed. The
official plugin needs an explicit date in the front matter (which is
tedious to maintain correctly) or it will use the file modification
time. Git does not preserve file modification times, so this method
will get confused when checking out the repository on another
computer. Instead, I wanted to use the git metadata, which more
accurately records when the file was last changed.
Getting the data from git
As a first step, let us get the date of the very latest change to the git repository:
git log -1 --format="%ct"
The -1
tells git to only show a log entry for the last commit and
the %ct
format outputs the so-called “committer date” for the last
commit. This is the last modification date of a commit, which is
updated not only when the commit is made, but also on rebasing the
commit, for example. It thus represents the latest change to the whole
repository. We will use it as a fallback if no better date is
available.
For individual files, though, we want to know when the actual commit was first made and do not care for rebase or similar operations, which leave the content intact. We thus use the “author date”:
git log -1 --format="%at" -- <filename>
Both commands output a single Unix timestamp, which can be parsed easily.
Jekyll plugin
With the above, there is not much left to do to generate a
sitemap.xml
file with a Jekyll plugin. First we define
some functions to execute git. The first one will ensure that git runs
correctly even when called from a git hook (see
my blog post for details):
def exec_git(cmdline)
env = ENV.to_hash
# Remove all GIT_* env variables to ensure clean execution!
env.delete_if { |key, value| key.start_with?("GIT_") }
# Execute and return stdout
IO.popen(env, cmdline,
:unsetenv_others=>true) { |stdout| stdout.read }
end
def get_timestamp_global
exec_git(["git", "log", "-1", "--format=%ct"]).to_i
end
def get_timestamp(fname)
exec_git(["git", "log", "-1", "--format=%at", "--", fname]).to_i
end
Next, get the “global” modification time in UTC:
sitetime = Time.at(get_timestamp_global)
.utc
.strftime "%Y-%m-%dT%H:%M:%SZ"
Then iterate over all HTML pages and try to get their git timestamp. If they do not have one, use the global timestamp:
sitemap = []
[site.posts.docs, site.pages, site.static_files].each do |l|
l.each do |page|
# Skip all non-html pages.
next if !(page.url.end_with?("/") || page.url.end_with?(".html"))
# Try to find source file.
src_file = page.path
# Get time.
timestamp = get_timestamp(src_file)
if timestamp == 0
# If there is no source file, git's output is empty.
# This yields an output of 0 from get_timestamp().
# In this case, we use sitetime.
time_str = sitetime
else
time_str = Time.at(timestamp)
.utc
.strftime "%Y-%m-%dT%H:%M:%SZ"
end
# We got it.
sitemap.push( { "loc" => page.url,
"lastmod" => time_str } )
end
end
We will use a generator plugin to store the contents of the sitemap
variable in the sitemap page’s entries
variable and render the data
using the following layout:
---
---
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% for entry in page.entries %} <url>
<loc>{{ entry.loc | prepend: site.baseurl
| prepend: site.url
| xml_escape }}</loc>
<lastmod>{{ entry.lastmod }}</lastmod>
</url>
{% endfor %}</urlset>
The full version of the current plugin code can be found on my “code for this blog” page.