How to create Build Pipelines in Scala

Posted 2019-08-02

How to work with JSON in Scala Easy Parallel Programming with Scala Futures

Build pipelines are a common pattern, where you have files and assets you want to process but want to do so efficiently and incrementally. Usually that means only re-processing files when they change, and otherwise re-using the already-processed assets as much as possible. This blog post will walk through how to use the Mill build tool to set up these build pipelines, using a real-world use case, and demonstrate the advantages a build pipeline gives you over a naive build script.

About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming

Scala Scripts

As a worked example for this blog post, we will be starting with the simple single-file static site generator discussed here:

Scala Scripting and the 15 Minute Blog Engine

The full 50 line build.sc script is as follows:

// build.sc
import $ivy.`com.lihaoyi::scalatags:0.7.0`
import $ivy.`com.atlassian.commonmark:commonmark:0.5.1`
import scalatags.Text.all._

interp.watch(os.pwd / 'post)
val postInfo = os
  .list(os.pwd / 'post)
  .filter(_.last.contains(" - "))
  .map(p => p.last.split(" - ") match{ case Array(prefix, suffix) => (prefix, suffix, p)})
  .sortBy(_._1.toInt)

def mdNameToHtml(name: String) = name.stripSuffix(".md").replace(" ", "-").toLowerCase + ".html"

val bootstrapCss = link(
  rel := "stylesheet",
  href := "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
)

os.remove.all(os.pwd / "out" / "post")
os.makeDir.all(os.pwd / "out" / "post")
for((_, suffix, path) <- postInfo) {
  val parser = org.commonmark.parser.Parser.builder().build()
  val document = parser.parse(os.read(path))
  val renderer = org.commonmark.html.HtmlRenderer.builder().build()
  val output = renderer.render(document)
  os.write(
    os.pwd / "out" / 'post / mdNameToHtml(suffix),
    html(
      head(bootstrapCss),
      body(
        h1(a("Haoyi's Blog", href := "../index.html")),
        h1(suffix.stripSuffix(".md")),
        raw(output)
      )
    ).render
  )
}

os.write(
  os.pwd / "out" / "index.html",
  html(
    head(bootstrapCss),
    body(
      h1("Haoyi's Blog"),
      for((_, suffix, _) <- postInfo)
      yield h2(a(suffix, href := ("post/" + mdNameToHtml(suffix))))
    )
  ).render
)

To run this script:

First install Ammonite:

sudo sh -c '(echo "#!/usr/bin/env sh" && curl -L https://github.com/lihaoyi/Ammonite/releases/download/1.6.9/2.13-1.6.9) > /usr/local/bin/amm && chmod +x /usr/local/bin/amm' && amm

Second, create a posts/ folder and put your you can put .md files inside
Lastly, run the script using amm blog.sc

This will generate an index.html file with the listing of all the blog posts, and a blog/ folder containing HTML files one for each input .md. You can open these files in the browser and interact with them:

Problems with Scala Scripts

While the above build.sc script works fine in small cases, there is one big limitation: the entire script runs every time. Even if you only change one blog post's .md file, every file will need to be re-processed. This is wasteful, and can be slow as the number of blog posts grows. On this blog, re-processing every post can take upwards of 20-30 seconds: a long time to wait every time you tweak some wording!

While it is possible to manually keep track of which .md file was converted into which .html file, and thus avoid re-processing .md files unnecessarily, this kind of book-keeping is tedious and easy to get wrong. This is especially true if we want to add more steps to the build process. For example, here are some possible extensions we may want to add to this build script:

Download the bootstrap.min.css file at build time and bundle it with the static site, to avoid a dependency on the third party service
Extract the first paragraph of each blog post and include it on the home page
Use git log to find when each blog post was first written, and include it on both that blog's page as well as on the home page
Deploy the completed static site to the web and make it available to the public

Each of these additional steps is something you would have to execute, cache, and decide when to re-execute. We'll now see how we can use the Mill build tool to do this automatically.

Conversion to a Mill Build Pipeline

To begin with, let's install Mill

curl -L https://github.com/lihaoyi/mill/releases/download/0.5.0/0.5.0 > mill && chmod +x mill

This makes the ./mill executable available in the current directory for you to use.

Now, we can convert the above build.sc file into a Mill build pipeline:

// build.sc
import $ivy.`com.lihaoyi::scalatags:0.7.0`
import $ivy.`com.atlassian.commonmark:commonmark:0.5.1`
import mill._, scalatags.Text.all._

interp.watch(os.pwd / 'post)
val postInfo = os
  .list(os.pwd / 'post)
  .filter(_.last.contains(" - "))
  .map(p => p.last.split(" - ") match{ case Array(prefix, suffix) => (prefix, suffix, p)})
  .sortBy(_._1.toInt)

def mdNameToHtml(name: String) = name.stripSuffix(".md").replace(" ", "-").toLowerCase + ".html"

val bootstrapCss = link(
  rel := "stylesheet",
  href := "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
)

object post extends Cross[PostModule](postInfo.map(_._1):_*)
class PostModule(number: String) extends Module{
  val Some((_, suffix, path)) = postInfo.find(_._1 == number)
  def srcPath = T.sources(path)
  def render = T{
    val Seq(src) = srcPath()
    val parser = org.commonmark.parser.Parser.builder().build()
    val document = parser.parse(os.read(src.path))
    val renderer = org.commonmark.html.HtmlRenderer.builder().build()
    val output = renderer.render(document)
    os.write(
      T.ctx().dest /  mdNameToHtml(suffix),
      html(
        head(bootstrapCss),
        body(
          h1(a("Haoyi's Blog", href := "../index.html")),
          h1(suffix.stripSuffix(".md")),
          raw(output)
        )
      ).render
    )
    PathRef(T.ctx().dest / mdNameToHtml(suffix))
  }
}

def index = T{
  os.write(
    T.ctx().dest / "index.html",
    html(
      head(bootstrapCss),
      body(
        h1("Haoyi's Blog"),
        for ((_, suffix, _) <- postInfo)
          yield h2(a(suffix, href := ("post/" + mdNameToHtml(suffix))))
      )
    ).render
  )

  PathRef(T.ctx().dest / "index.html")
}

val posts = mill.define.Task.sequence(postInfo.map(_._1).map(post(_).render))

def dist = T {
  for (post <- posts()) {
    os.copy(post.path, T.ctx().dest / 'post / post.path.last, createFolders = true)
  }
  os.copy(index().path, T.ctx().dest / "index.html")

  PathRef(T.ctx().dest)
}

Here, we are defining a cross-build of PostModules, one for each post in the post folder, and each with a target .render that parses the markdown into HTML, writing it to disk and returning a PathRef to the generated files. We then combine that into a single posts target, containing all the generated files, and make use of that in dist which copies them all into a single folder and writes an index.html containing links to the individual blog posts.

Given the following posts:

$ tree .
├── post
│   ├── 1 - Automatic Binary Serialization in uPickle 0.7.md
│   ├── 2 - Benchmarking Scala Collections.md
│   ├── 3 - What's in a Build Tool?.md
│   └── ...
└── build.sc

We can build this blog using

$ ./mill dist

And see the folder that's generated using

$ ./mill show dist
"ref:b33a3c95:/Users/lihaoyi/Github/blog/out/dist/dest"

We can list the contents using tree:

$ tree /Users/lihaoyi/Github/blog/out/dist/dest
/Users/lihaoyi/Github/blog/out/dist/dest
├── post
│   ├── automatic-binary-serialization-in-upickle-0.7.html
│   ├── benchmarking-scala-collections.html
│   ├── what's-in-a-build-tool?.html
│   └── ...
└── index.html

And open the index.html in our browser to view the blog.

Every time you run ./mill dist, Mill will only re-process the blog posts that have changed since you last ran it. You can also use ./mill --watch dist or ./mill -w dist to have Mill watch the filesystem and automatically re-process the files every time they change.

How it works

Now that we've seen working code, let us walk through the example step by step to understand it.

import $ivy.`com.lihaoyi::scalatags:0.7.0`
import $ivy.`com.atlassian.commonmark:commonmark:0.5.1`
import mill._, scalatags.Text.all._

To begin with, we import the same third-party libraries as we did in our original Scala Script: Scalatags to render HTML, and CommonMark to parse Markdown. In addition to that we import mill._, we brings our Mill-related functions into scope to build our pipelines.

interp.watch(os.pwd / 'post)
val postInfo = os
  .list(os.pwd / 'post)
  .filter(_.last.contains(" - "))
  .map(p => p.last.split(" - ") match{ case Array(prefix, suffix) => (prefix, suffix, path})
  .sortBy(_._1.toInt)

def mdNameToHtml(name: String) = name.stripSuffix(".md").replace(" ", "-").toLowerCase + ".html"

val bootstrapCss = link(
  rel := "stylesheet",
  href := "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
)

This section is roughly the same as the Ammonite script we saw earlier: listing the markdown files and extracting their index, suffix and path to use later. Note that the interp.watch(os.pwd / 'post) is necessary because our build pipeline is dynamic: the number of post[n] modules depends on the list of files in the post/ folder, and so we need to use interp.watch to ensure Mill knwos to re-compute the post[n] modules every time that folder changes

object post extends Cross[PostModule](postInfo.map(_._1):_*)
class PostModule(number: String) extends Module{
  val Some((_, suffix, path)) = postInfo.find(_._1 == number)
  def srcPath = T.sources(path)
  def render = T{
    val Seq(src) = srcPath()
    val parser = org.commonmark.parser.Parser.builder().build()
    val document = parser.parse(os.read(src.path))
    val renderer = org.commonmark.html.HtmlRenderer.builder().build()
    val output = renderer.render(document)
    os.write(
      T.ctx().dest /  mdNameToHtml(suffix),
      html(
        head(bootstrapCss),
        body(
          h1(a("Haoyi's Blog", href := "../index.html")),
          h1(suffix.stripSuffix(".md")),
          raw(output)
        )
      ).render
    )
    PathRef(T.ctx().dest / mdNameToHtml(suffix))
  }
}

This is the first part of Mill-specific functionality: we define PostModule extends Module, each of which has a def render = T{...} target, and populates a post extends Cross[PostModule] object with the indices of all the markdown files from postInfo.

Each post has a srcPath defined using T.sources; this tells the build pipeline that those specified files are inputs. When the files change, Mill then knows to invalidate the downstream build steps and re-evaluate them to keep the results up to date.

Each post's render function, rather than writing the output to a global output folder, writes it to T.ctx().dest folder. This helps ensure each post gets a unique working folder and avoid conflicts. We can render each post individually using ./mill post[n].render:

$ ./mill post[1].render

$ ./mill show post[1].render
"ref:c53cc5ae:out/post/1/render/dest/code-reviewing-my-earliest-surviving-program.html"

$ ./mill show post[2].render
"ref:99e5ad6d:out/post/2/render/dest/strategic-scala-style:-principle-of-least-power.html"

We then prepare the index.html file, in an index target using def index = T{...}:

def index = T{
  os.write(
    T.ctx().dest / "index.html",
    html(
      head(bootstrapCss),
      body(
        h1("Haoyi's Blog"),
        for ((_, suffix, _) <- postInfo)
        yield h2(a(suffix, href := ("post/" + mdNameToHtml(suffix))))
      )
    ).render
  )

  PathRef(T.ctx().dest / "index.html")
}

This index file simply has a title ("Haoyi's Blog") and a list of links to each of the individual posts.

Lastly, we have the target defined that assembled the blog posts and the index.html file into a single folder:

val posts = mill.define.Task.sequence(postInfo.map(_._1).map(post(_).render))

def dist = T {
  for (post <- posts()) {
    os.copy(post.path, T.ctx().dest / 'post / post.path.last, createFolders = true)
  }
  os.copy(index().path, T.ctx().dest / "index.html")

  PathRef(T.ctx().dest)
}

Here we use Task.sequence to convert the Seq[T[PathRef]] into a T[Seq[PathRef]], and make use of that in the dist target. In dist, we simply copy the already-generated HTML files for each blog post into the T.ctx().dest folder, along with the index target containing an index.html file, and we're done. Essentially, we have defined the following pipeline:

We can now run ./mill dist to build the dist target, and assemble the output into a folder to use: either browsing locally, or for deployment.

Extending The Blog

Now that we've defined a simple pipeline, let's consider two of the four extensions we mentioned earlier:

Download the bootstrap.min.css file at build time and bundle it with the static site, to avoid a dependency on the third party service
Extract the first paragraph of each blog post and include it on the home page

Bundling Bootstrap

Bundling bootstrap is simple. We simply define a bootstrap target to download the file:

- val bootstrapCss = link(
-   rel := "stylesheet",
-   href := "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
- )
+ def bootstrap = T{
+   os.write(
+     T.ctx().dest / "bootstrap.min.css",
+     requests.get("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css").text()
+   )
+   PathRef(T.ctx().dest / "bootstrap.min.css")
+ }

Include it in our dist:

os.copy(bootstrap().path, T.ctx().dest / "bootstrap.min.css")

And then update our two bootstrapCss links to use a local URL:

- head(bootstrapCss)
+ head(link(rel := "stylesheet", href := "bootstrap.min.css"))

- head(bootstrapCss)
+ head(link(rel := "stylesheet", href := "../bootstrap.min.css"))

Now, when you run ./mill dist, you can see that the bootstrap.min.css file is downloaded and bundled with your dist folder:

$ tree out/dist/dest/
out/dist/dest/
├── blog
│   ├── automatic-binary-serialization-in-upickle-0.7.html
│   ├── what's-functional-programming-all-about?.html
│   ├── what's-in-a-build-tool?.html
│   └── ...
├── bootstrap.min.css
└── index.html

And we can see in the browser that we are now using a locally-bundled version of Bootstrap:

Since it does not depend on any T.sources, the bootstrap = T{} target never invalidates, which is usually what you want when depending on a stable URL like bootstrap/3.3.6.

We now have the following build pipeline, with the additional bootstrap step:

The code now looks like this:

// build.sc
import $ivy.`com.lihaoyi::scalatags:0.7.0`
import $ivy.`com.atlassian.commonmark:commonmark:0.5.1`
import mill._, scalatags.Text.all._

interp.watch(os.pwd / 'post)
val postInfo = os
  .list(os.pwd / 'post)
  .filter(_.last.contains(" - "))
  .map(p => p.last.split(" - ") match{ case Array(prefix, suffix) => (prefix, suffix, p)})
  .sortBy(_._1.toInt)

def mdNameToHtml(name: String) = name.stripSuffix(".md").replace(" ", "-").toLowerCase + ".html"

def bootstrap = T{
  os.write(
    T.ctx().dest / "bootstrap.min.css",
    requests.get("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css").text()
  )
  PathRef(T.ctx().dest / "bootstrap.min.css")
}

object post extends Cross[PostModule](postInfo.map(_._1):_*)
class PostModule(number: String) extends Module{
  val Some((_, suffix, path)) = postInfo.find(_._1 == number)
  def srcPath = T.sources(path)
  def render = T{
    val Seq(src) = srcPath()
    val parser = org.commonmark.parser.Parser.builder().build()
    val document = parser.parse(os.read(src.path))
    val renderer = org.commonmark.html.HtmlRenderer.builder().build()
    val output = renderer.render(document)
    os.write(
      T.ctx().dest /  mdNameToHtml(suffix),
      html(
        head(head(link(rel := "stylesheet", href := "../bootstrap.min.css"))),
        body(
          h1(a("Haoyi's Blog", href := "../index.html")),
          h1(suffix.stripSuffix(".md")),
          raw(output)
        )
      ).render
    )
    PathRef(T.ctx().dest / mdNameToHtml(suffix))
  }
}

val posts = mill.define.Task.sequence(post.itemMap.values.map(_.render).toSeq)

def dist = T {
  for (post <- posts()) {
    os.copy(post.path, T.ctx().dest / 'post / post.path.last, createFolders = true)
  }

  os.copy(bootstrap().path, T.ctx().dest / "bootstrap.min.css")

  os.write(
    T.ctx().dest / "index.html",
    html(
      head(head(link(rel := "stylesheet", href := "bootstrap.min.css"))),
      body(
        h1("Haoyi's Blog"),
        for ((_, suffix, _) <- postInfo)
        yield h2(a(suffix, href := ("post/" + mdNameToHtml(suffix))))
      )
    ).render
  )

  PathRef(T.ctx().dest)
}

First Paragraph Preview

To render a paragraph preview of each blog post in the index.html page, the first step is to generate such a preview for each PostModule:

class PostModule(number: String) extends Module{
  val Some((_, suffix, path)) = postInfo.find(_._1 == number)
  def srcPath = T.sources(path)
  def preview = T{
    val Seq(src) = srcPath()
    val parser = org.commonmark.parser.Parser.builder().build()
    val firstPara = os.read.lines(src.path).takeWhile(_.nonEmpty)
    val document = parser.parse(firstPara.mkString("\n"))
    val renderer = org.commonmark.html.HtmlRenderer.builder().build()
    val output = renderer.render(document)
    output
  }
  def render = T{
    ...
  }
}

Here we are simply leaving the preview as a output: String rather than writing it to a file and using a PathRef.

Next, we need to aggregate the previews the same way we aggregated the renders earlier:

val previews = mill.define.Task.sequence(post.itemMap.values.map(_.preview).toSeq)

Lastly, in dist, zip the preview together with the postInfo in order to render them:

- for ((_, suffix, _) <- postInfo)
- yield h2(a(suffix, href := ("post/" + mdNameToHtml(suffix))))
+ for (((number, suffix, _), preview) <- postInfo.zip(previews()))
+ yield frag(
+   h2(a(suffix, href := ("post/" + mdNameToHtml(suffix)))),
+   raw(preview)
+ )

Now we get pretty previews in index.html!

The build pipeline now looks like:

Note how we now have both post[n].preview and post[n].render targets, with the preview targets being used in index to generate the home page and the render targets only being used in the final dist.

And here's the complete code:

// build.sc
import $ivy.`com.lihaoyi::scalatags:0.7.0`
import $ivy.`com.atlassian.commonmark:commonmark:0.5.1`
import mill._, scalatags.Text.all._

interp.watch(os.pwd / 'post)
val postInfo = os
  .list(os.pwd / 'post)
  .filter(_.last.contains(" - "))
  .map(p => p.last.split(" - ") match{ case Array(prefix, suffix) => (prefix, suffix, p)})
  .sortBy(_._1.toInt)

def mdNameToHtml(name: String) = name.stripSuffix(".md").replace(" ", "-").toLowerCase + ".html"

def bootstrap = T{
  os.write(
    T.ctx().dest / "bootstrap.min.css",
    requests.get("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css").text()
  )
  PathRef(T.ctx().dest / "bootstrap.min.css")
}

object post extends Cross[PostModule](postInfo.map(_._1):_*)
class PostModule(number: String) extends Module{
  val Some((_, suffix, path)) = postInfo.find(_._1 == number)
  def srcPath = T.sources(path)
  def renderMarkdown(s: String) = {
    val parser = org.commonmark.parser.Parser.builder().build()

    val document = parser.parse(s)
    val renderer = org.commonmark.html.HtmlRenderer.builder().build()
    renderer.render(document)
  }
  def preview = T{
    val Seq(src) = srcPath()
    val firstPara = os.read.lines(src.path).takeWhile(_.nonEmpty)
    renderMarkdown(firstPara.mkString("\n"))
  }
  def render = T{
    val Seq(src) = srcPath()
    val output = renderMarkdown(os.read(src.path))
    os.write(
      T.ctx().dest /  mdNameToHtml(suffix),
      html(
        head(head(link(rel := "stylesheet", href := "../bootstrap.min.css"))),
        body(
          h1(a("Haoyi's Blog", href := "../index.html")),
          h1(suffix.stripSuffix(".md")),
          raw(output)
        )
      ).render
    )
    PathRef(T.ctx().dest / mdNameToHtml(suffix))
  }
}

val posts = mill.define.Task.sequence(post.itemMap.values.map(_.render).toSeq)
val previews = mill.define.Task.sequence(post.itemMap.values.map(_.preview).toSeq)

def index = T{
  os.write(
    T.ctx().dest / "index.html",
    html(
      head(head(link(rel := "stylesheet", href := "bootstrap.min.css"))),
      body(
        h1("Haoyi's Blog"),
        for (((number, suffix, _), preview) <- postInfo.zip(previews()))
        yield frag(
          h2(a(suffix, href := ("post/" + mdNameToHtml(suffix)))),
          raw(preview)
        )
      )
    ).render
  )
  PathRef(T.ctx().dest / "index.html")
}

def dist = T {
  for (post <- posts()) {
    os.copy(post.path, T.ctx().dest / 'post / post.path.last, createFolders = true)
  }

  os.copy(bootstrap().path, T.ctx().dest / "bootstrap.min.css")
  os.copy(index().path, T.ctx().dest / "index.html")

  PathRef(T.ctx().dest)
}

Exercises

In the interest of time, this blog post only contains examples walking you through the first two extensions we mentioned earlier:

Use git log to find when each blog post was first written, and include it on both that blog's page as well as on the home page
Deploy the completed static site to the web and make it available to the public

These would require two more concepts we haven't seen so far:

T.input tasks, letting your build react to arbitrary changes in the system
T.command tasks, which can perform arbitrary actions without caching.

Implementing those extensions using these Mill features is left as an exercise to the reader!

Conclusion

In this blog post, we have seen how to take a simple Scala script that generates a static website, and convert it into a Mill build pipeline. Unlike a naive script, this pipeline allows fast incremental updates whenever the underlying sources changes. We have also seen how to extend the Mill build pipeline, adding additional build steps to compute things like bundling CSS files or showing post previews, all while preserving the ability to do fast incremental updates.

While the Mill build tool is often used to compile Java and Scala source code into executables, it can also be used to create general-purpose build pipelines for all sorts of data transformations. A developer can simply specify what each build step needs as input and what computation it performs, and Mill will handle all the ordering, caching and invalidation for you, giving you blazing fast incremental builds without any manual effort.

This blog post is just a quick introduction to the ideas and concepts behind the Mill build tool. For a more thorough reference, take a look at the Mill documentation:

Mill Documentation