Pretty-printing hierarchical data into a human-readable form is a common thing to do. While a computer doesn't care exactly how you format the same textual data, a human would want to view data that is nicely laid out and indented, yet compact enough to make full use of the width of the output window and avoid wasting horizontal screen space. This post presents an algorithm which achieves optimal usage of horizontal space, predictable layout, and good runtime characteristics: peak heap usage linear in the width of the output window, the ability to start and stop the pretty-printing to print any portion of it, with total runtime linear in the portion of the structure printed.
About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming
Many data formats are hierarchical: whether textual formats like JSON or YAML, binary formats like MessagePack or Protobuf, or even program source code in common languages like Java or Python. However, the same textual data can be formatted a variety of ways: perhaps it was written without strong style conventions, or minified to send over the wire. Binary data, of course, needs to be converted to textual data for human readability. While some data formats have interactive GUIs to let you explore them, in most cases that job falls to the pretty-printer to convert the data structure into a plain-text string that is sufficiently nicely formatted that someone can skim over it and find what they want without undue difficulty.
Let us consider two samples of hierarchical data, formatted to fit within a 50 character wide screen. A JSON blob:
{
"person1": {
"name": "Alice",
"favoriteColors": ["red", "green", "blue"]
},
"person2": {
"name": "Bob",
"favoriteColors": [
"cyan",
"magenta",
"yellow",
"black"
]
}
}
And a Python code snippet:
model = Sequential(
Dense(512, activation=relu),
Dense(10, activation=softmax)
)
These two examples (both simplified from real code) are both roughly formatted to fit nicely within a 50 character wide output. Note how both examples have a mix of horizontally and vertically laid out structures: e.g. ["red", "green",
"blue"]
is horizontally laid out because it can fit within our 50 character max width, while ["cyan", "magenta", "yellow", "black"]
is vertically laid out because if laid out horizontally it would overshoot. This layout makes maximal use of the horizontal space available while also formatting things vertically where necessary, resulting in compact output that is easy to read.
While there are some variety in exactly how things should be formatted - e.g. some people prefer closing braces on the same line as the enclosing statement - this post will walk through the algorithm for the choice of pretty-printing given above. Adapting it to other styles is straightforward.
The way you pretty-print a structure depends on how wide you want it to be: for example, above we assumed a target width of 50 characters. If it was narrower, we would likely want to spread things out more vertically to avoid overshooting our target width. Here is the Javascript example formatted for a target width of 30 characters:
{
"person1": {
"name": "Alice",
"favoriteColors": [
"red",
"green",
"blue"
]
},
"person2": {
"name": "Bob",
"favoriteColors": [
"cyan",
"magenta",
"yellow",
"black"
]
}
}
Here, we can see the ["red", "green", "blue"]
array that was previously laid out horizontally now is laid out vertically to avoid hitting the line limit. And the Python code snippet:
model = Sequential(
Dense(
512,
activation=relu
),
Dense(
10,
activation=softmax
)
)
Here, we see the Dense
expressions are laid out vertically, to fit within the 30 character width limit.
Output widths can also be wider, rather than narrower. If we expand our output to 80 characters, we would expect to see more things laid out horizontally to take advantage of the added space to the right:
{
"person1": {"name": "Alice", "favoriteColors": ["red", "green", "blue"]},
"person2": {
"name": "Bob",
"favoriteColors": ["cyan", "magenta", "yellow", "black"]
}
}
model = Sequential(Dense(512, activation=relu), Dense(10, activation=softmax))
Here, we see more items laid out horizontally: the entire "person1"
in the JSON example fits on one line, as does the entire model
in the Python example, while "person2"
is still too long to fit within the 80 character limit without wrapping.
The last requirement that is often seen in pretty-printing is that of efficiency:
Constant heap usage: I would want to be able to pretty print large data structures without needing to load it all into memory at the same time for manipulation.
Linear execution time: pretty-printing should take time proportional to the thing you are trying to print, and shouldn't scale quadratically or exponentially when that thing gets bigger
Laziness: I should be able to print only the part of the structure that I want, and then stop. For large structures, you often only need to see part of it (e.g. first 100 lines) to figure out what you need, and you shouldn't need to pay to pretty-print the entire structure.
Note that most common pretty-printers fail these requirements. Specifically, even something like a .toString
function fails (1) because it has to materialize the whole thing thing in memory, and (3) because it has to construct the entire output string before you can see any output at all. This post will show you how to do better.
I will present the algorithm written in Scala, but it should be easy to understand for anyone with familiarity with programming and trivially convertible to any language of your choice.
Now that we know what our requirements are, let us define the interface of our pretty-print function:
def prettyprint(t: T, maxWidth: Int, indent: Int): Iterator[String]
In detail:
t
is the thing being pretty-printed, of type T
. That would be either our JSON data structure, our Python syntax tree, or something else
maxWidth
is the width the pretty-printer will try to avoid breaching.
indent
is how far to indent nested parts of the pretty-printed output. In the above examples, this would be 4 spaces.
The Iterator[String]
being returned represents the chunks of the pretty-printed output, that the caller of the function can stream on-demand and handle individually (writing them to a file, writing to stdout, ...) with the ability to stop early at any time and without ever materializing the whole output in memory.
One comon use case that we leave out here is a maxHeight: Int
flag: while it is very common to want to see the first N lines of the pretty-printed output without evaluating the whole thing, this is trivially implementable on top of the Iterator[String]
that prettyprint
returns, and so implementing a maxHeight
flag is left as an exercise to the reader.
To begin with, I will define a Tree
data structure: this will be used to represent the thing we are trying to print in a hierarchical fashion:
sealed trait Tree
// Foo(aa, bbb, cccc)
case class Nested(prefix: String,
children: Iterator[Tree],
sep: String,
suffix: String) extends Tree
// xyz
case class Literal(body: String) extends Tree
This defines a type Tree
with two subclasses: a Tree
is either an Nested
node representing something with a prefix/children/separator/suffix such as Foo(aa, bbb, cccc)
or (123 456 789)
, or a Literal
node representing a simple string. Note that the children
of an Nested
node is a one-shot Iterator[Tree]
, rather than a concrete Array[Tree]
: we can define a Tree
to mirror our data structure without actually materializing the whole thing in memory, as long as we only need to iterate over the tree once.
This is a relatively minimal representation, and is simplified for the sake of this blog post: it does not have handling for infix operators LHS op RHS
, any sort of terminal nodes beyond a literal string, or anything else. Nevertheless, it is enough to handle many common formats, including the examples above. the Python example:
model = Sequential(
Dense(512, activation=relu),
Dense(10, activation=softmax)
)
Can be represented as:
def python = Nested(
"model = Sequential(",
Iterator(
Nested(
"Dense(",
Iterator(Literal("512"), Literal("activation=relu")),
", ",
")"
),
Nested(
"Dense(",
Iterator(Literal("100"), Literal("activation=softmax")),
", ",
")"
),
),
", ",
")"
)
This is a bit of a mouthful, but it represents the entirety of the Tree
that can be constructed from your Python syntax tree. Note that in real code, this would be constructed from an existing structure, rather than laid out literally as above.
Similarly the JSON snippet:
{
"person1": {
"name": "Alice",
"favoriteColors": ["red", "green", "blue"]
},
"person2": {
"name": "Bob",
"favoriteColors": [
"cyan",
"magenta",
"yellow",
"black"
]
}
}
Can be represented as:
def json = Nested(
"{",
Iterator(
Nested(
"\"person1\": {",
Iterator(
Literal("\"name\": \"Alive\""),
Nested(
"\"favoriteColors\": [",
Iterator(
Literal("\"red\""),
Literal("\"green\""),
Literal("\"blue\"")
),
", ",
"]"
)
),
", ",
"}"
),
Nested(
"\"person2\": {",
Iterator(
Literal("\"name\": \"Bob\""),
Nested(
"\"favoriteColors\": [",
Iterator(
Literal("\"cyan\""),
Literal("\"magenta\""),
Literal("\"yellow\""),
Literal("\"black\"")
),
", ",
"]"
)
),
", ",
"}"
)
),
", ",
"}"
)
Again, this would typically be constructed from your JSON data structure programmatically, and because Nested
's children
is an Iterator
we do not need to materialize the entire Tree
in memory at the same time.
Now that we have an iterator-based Tree
representation, let's change the signature of our pretty-printing function slightly to take Tree
s instead of T
s:
def prettyprint(t: Tree, maxWidth: Int, indent: Int): Iterator[String]
We leave out the code to go from JSON => Tree
, or from Python Syntax Tree =>
Tree
, since that would depend on the exact API of the data structure you are trying to pretty-print. For now, we will simple assume that such * => Tree
functions exist.
The basic approach we will take with prettyprint
is:
Recurse over the Tree
, keeping track of the current left offset at every node
For each node, return a multiLine: Boolean
of whether the current node's pretty-printing is multiple lines long, and an chunks: Iterator[String]
of the chunks of pretty-printed output
For Literal
nodes, this is trivial: if the body contains a newline, it is multiple lines long, and the chunks is an iterator whose sole contents is the body
For Nested
nodes, this is more involved: to decide whether something is multiLine
or not, we buffer up the pretty-printed chunks of its children and use those chunks to decide.
If we exhaust all the children, then return multiLine = false
and an iterator over all the buffered chunks, each child's chunks joined by the separator
If a child returns multiLine = true
or we hit the maxWidth
limit, stop buffering and return multiLine = true
and a combined iterator of the buffered chunks and the iterator of remaining not-yet-buffered chunks, each child's chunks joined by both the separator and the current left offset and a newline
Essentially, each Nested
buffers up some amount of chunks from its children, enough to be able to decide whether or not it needs to be printed vertically (multiLine = true
) or there is space to be printed horizontally (multiLine =
false
).
One detail not specified above is how insertion of indentation/newlines works.
Indentation offsets are propagated recursively: every Nested
receives the current leftOffset: Int
, and passes leftOffset + indent
to its children
Actual insertion of indentation & newlines are only a matter for Nested
nodes with multiLine = true
. Such a Nested
node is only responsible for indentation & newlines between its prefix
and suffix
: it's enclosing parent (if any) will handle the indentation & newlines before prefix
and after suffix
.
In code, this looks like:
import collection.mutable.Buffer
def prettyprint(t: Tree, maxWidth: Int, indent: Int): Iterator[String] = {
def recurse(current: Tree, leftOffset: Int, enclosingSepWidth: Int): (Boolean, Iterator[String]) = {
current match{
case Literal(body) =>
val multiLine = body.contains('\n')
val chunks = Iterator(body)
(multiLine, chunks)
case Nested(prefix, children, sep, suffix) =>
var usedWidth = leftOffset + prefix.length + suffix.length + enclosingSepWidth
var multiLine = usedWidth > maxWidth
val allChildChunks = Buffer[Iterator[String]]()
// Prepare all child iterators, but do not actually consume them
for(child <- children){
val (childMultiLine, childChunks) = recurse(
child,
leftOffset + indent,
if (children.hasNext) sep.trim.length else 0
)
if (childMultiLine) multiLine = true
allChildChunks += childChunks
}
val bufferedChunks = Buffer[Buffer[String]]()
val outChunkIterator = allChildChunks.iterator
var remainingIterator: Iterator[String] = Iterator.empty
// Buffer child node chunks, until they run out or we become multiline
while(outChunkIterator.hasNext && !multiLine){
bufferedChunks.append(Buffer())
val childIterator = outChunkIterator.next()
if (outChunkIterator.hasNext) usedWidth += sep.length
while (childIterator.hasNext && !multiLine){
val chunk = childIterator.next()
bufferedChunks.last.append(chunk)
usedWidth += chunk.length
if (usedWidth > maxWidth) {
remainingIterator = childIterator
multiLine = true
}
}
}
def joinIterators(separated: Iterator[TraversableOnce[String]],
sepChunks: Seq[String]) = {
separated.flatMap(sepChunks ++ _).drop(1)
}
val middleChunks =
if (!multiLine) {
// If not multiline, just join all child chunks by the separator
joinIterators(bufferedChunks.iterator, Seq(sep))
} else{
// If multiline, piece back together the last half-consumed iterator
// of the last child we half-buffered before we stopped buffering.
val middleChildChunks = bufferedChunks.lastOption.map(_.iterator ++ remainingIterator)
// Splice it in between the chunks of the fully-buffered children
// and the not-at-all buffered children, joined with separators
joinIterators(
separated = bufferedChunks.dropRight(1).iterator ++
middleChildChunks ++
outChunkIterator,
sepChunks = Seq(sep.trim, "\n", " " * (leftOffset + indent))
) ++ Iterator("\n", " " * leftOffset)
}
val chunks = Iterator(prefix) ++ middleChunks ++ Iterator(suffix)
(multiLine, chunks)
}
}
val (_, chunks) = recurse(t, 0, 0)
chunks
}
There's a bit of messiness in keeping track of the usedWidth
, joining child iterators by the separator, and mangling the half-buffered-half-not-buffered middleChildChunks
, but otherwise it should be relatively clear what this code is doing.
We can run this on the example JSON and Python Tree
s above, using a loop to only print out the maxWidth
s at which the pretty-printing changes:
var last = ""
for(i <- 0 until 100){
val current = prettyprint(python, maxWidth = i, indent = 4).mkString
if (current != last){
println("width: " + i)
println(current)
last = current
}
}
for(i <- 0 until 200){
val current = prettyprint(json, maxWidth = i, indent = 4).mkString
if (current != last){
println("width: " + i)
println(current)
last = current
}
}
Here's the output for pretty-printing the Python source code:
width: 0
model = Sequential(
Dense(
512,
activation=relu
),
Dense(
100,
activation=softmax
)
)
width: 32
model = Sequential(
Dense(512, activation=relu),
Dense(
100,
activation=softmax
)
)
width: 34
model = Sequential(
Dense(512, activation=relu),
Dense(100, activation=softmax)
)
width: 79
model = Sequential(Dense(512, activation=relu), Dense(100, activation=softmax))
Here, we can see every width at which the pretty-printing changes:
Dense
call can be one-linedDense
call can be one-linedWe can also see an identical sort of progression in the pretty-printed JSON, with it starting off totally vertically expanded but taking advantage of the horizontal space to one-line parts of the JSON as we provide it a wider and wider acceptable width:
width: 0
{
"person1": {
"name": "Alive",
"favoriteColors": [
"red",
"green",
"blue"
]
},
"person2": {
"name": "Bob",
"favoriteColors": [
"cyan",
"magenta",
"yellow",
"black"
]
}
}
width: 50
{
"person1": {
"name": "Alive",
"favoriteColors": ["red", "green", "blue"]
},
"person2": {
"name": "Bob",
"favoriteColors": [
"cyan",
"magenta",
"yellow",
"black"
]
}
}
width: 64
{
"person1": {
"name": "Alive",
"favoriteColors": ["red", "green", "blue"]
},
"person2": {
"name": "Bob",
"favoriteColors": ["cyan", "magenta", "yellow", "black"]
}
}
width: 77
{
"person1": {"name": "Alive", "favoriteColors": ["red", "green", "blue"]},
"person2": {
"name": "Bob",
"favoriteColors": ["cyan", "magenta", "yellow", "black"]
}
}
width: 88
{
"person1": {"name": "Alive", "favoriteColors": ["red", "green", "blue"]},
"person2": {"name": "Bob", "favoriteColors": ["cyan", "magenta", "yellow", "black"]}
}
width: 160
{"person1": {"name": "Alive", "favoriteColors": ["red", "green", "blue"]}, "person2": {"name": "Bob", "favoriteColors": ["cyan", "magenta", "yellow", "black"]}}
You can copy-paste the code snippets above into any Scala program, or Scala REPL, and should see the output as shown above.
Earlier, we claimed the following properties of our pretty-printing algorithm:
Let us look at these in turn.
Our recurse
function walks over the Tree
structure in order to return the pretty-printed Iterator[String]
. One thing of note is that the Tree
nodes are "lazy": Nested
contains an iterator of children, not a concrete array of children, and so at no point is the entire tree materialized at the same time.
At any point in time, there are a number of recurse
calls on the stack linear in the depth of the tree we're printing (if it's roughly balanced, that means roughly O(log n)
the total tree size). Within each of those recurse
calls, we buffer up some number of chunks: however, the total number of buffered chunks by all calls in the call-stack cannot exceed the maxWidth
value, since each call subtracts the width of it's prefix whenever it recurses into a child.
Note that siblings in the Tree
each have separate buffers, which each can be up to maxWidth
in size. However, as siblings they are never active on the call stack at the same time: the first sibling's returned Iterator[String]
must be exhausted before the second sibling starts evaluating, so the peak heap usage is still limited by maxWidth
.
Strictly speaking, the total heap usage is O(max-width + tree-depth +
biggest-literal)
. This is much better than algorithms that involve materializing the entire O(size-of-tree)
data structure in memory to manipulate.
Since the output of prettyprint
is an Iterator[String]
, we can choose to consume as much or as little of the output as we want, resulting in a corresponding amount of computation happening. We do not need to wait for the entire pretty-printing to complete before we start receiving chunks.
Because we do not construct the entire pretty-printed output up-front, we also do not need to pay for the portions of the Tree
structure that we did not print! This means that prettyprint
ing the first few lines of a huge data structure is very fast, where-as calling a .toString
that materializes the whole output in memory could easily take a long time.
The key insight behind this prettyprint
algorithm is that for this common kind of indentation-based mixed horizontal/vertical layout, you can make layout decisions in a streaming fashion with only a bounded amount of buffering:
When you need buffer all of a child's chunks, it is to verify that a child can fit on one line, and thus buffering everything is cheap.
When a child's output is very large, we can make that determination after buffering at most one line of chunks, so that is cheap as well.
Unlike other algorithms which run a graph-search or constraint-minimization to try and find a good layout, this implementation of prettyprint
generates the layout in a single depth-first traversal of the input data. This gives the pretty-printer very predictable runtime characteristics, and allows it to be used in a streaming fashion on very large data structures without accumulating a corresponding very-large-string in memory.
The described prettyprint
algorithm is currently being used in my Scala PPrint library, though extended in a few incidental ways (colored output, infix nodes, etc.). It is also used to display values in the Ammonite Scala REPL, and I have used it to prettyprint huge data-structures. Both the quality of pretty-printed output and convenient runtime characteristics behave exactly as described.
This blog post extracts the core pretty-printing algorithm, separates it out from all the incidental concerns and documents it it for posterity. While the initial goal was to pretty-print sourcecode representations of Scala data structures, this blog post demonstrates how the core pretty-printing algorithm can be applied to any hierarchical data formats comprised of (1) terminal nodes with a plain string representation and (2) nested nodes, with a prefix, suffix, and a list of children with separators. This means simple formats like JSON or the subset of Python shown are trivially supported, while the core algorithm can be easily extended with additional Tree
subclasses to support additional syntax such as infix operators, bin-packed lists, vertical-aligned indentation, or other constructs.
About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming