Working with files and the filesystem is one of the most common things you do when programming. This tutorial will walk through how to easily work with files in the Scala programming language, in a way that scales from interactive usage in the REPL, to your first Scala scripts, to usage in a production system or application.
About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming
The easiest way to work with the filesystem in Scala is through the OS-Lib filesystem library. OS-Lib is available on Maven Central for you to use with any version of Scala:
// SBT
"com.lihaoyi" %% "os-lib" % "0.2.7"
// Mill
ivy"com.lihaoyi::os-lib:0.2.7"
OS-Lib also comes bundled with Ammonite, and can be used within the REPL and *.sc
script files.
All functionality within this library comes from the os
package, e.g. os.Path
, os.read
, os.list
, and so on. To begin with, I will install Ammonite:
$ sudo sh -c '(echo "#!/usr/bin/env sh" && curl -L https://github.com/lihaoyi/Ammonite/releases/download/1.6.7/2.12-1.6.7) > /usr/local/bin/amm && chmod +x /usr/local/bin/amm'
And open the Ammonite REPL, using os.<tab>
to see the list of available operations:
$ amm
Loading...
Welcome to the Ammonite Repl 1.6.7
(Scala 2.12.8 Java 11.0.2)
@ os.<tab>
/ RelPath list
BasePath ResourceNotFoundException makeDir
BasePathImpl ResourcePath move
BasicStatInfo ResourceRoot mtime
Bytes SeekableSource owner
CommandResult SegmentedPath perms
FilePath Shellable proc
...
From there, we can begin our tutorial.
Most operations we will be working with involve filesystem paths: we read data from a path, write data to a path, copy files from one path to another, or list a folder path to see what files are inside of it. This is represented by the os.Path
type
By default, you have a few paths available: os.pwd
, os.root
, os.home
:
@ os.pwd
res0: os.Path = root/'Users/'lihaoyi/'Github/'blog
@ os.root
res1: os.Path = root
@ os.home
res2: os.Path = root/'Users/'lihaoyi
These refer to your process working directory, filesystem root, and user home folder respectively. To refer to paths relative to an existing path, you can use the /
operator to add additional path segments:
@ os.pwd / "post"
res3: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post
@ os.home / "Github" / "blog"
res4: os.Path = root/'Users/'lihaoyi/'Github/'blog
Note that you can only append single path segments to a path using the /
operator on strings, e.g. this is not allowed:
@ os.home / "Github/blog"
os.PathError$InvalidSegment: [Github/blog] is not a valid path segment.
[/] is not a valid character to appear in a path segment. If you want to parse
an absolute or relative path that may have multiple segments, e.g. path-strings
coming from external sources use the Path(...) or RelPath(...) constructor calls
to convert them.
As the error message suggests, you need to use the os.RelPath constructor in order to construct a relative path of more than one segment. This helps avoid confusion between working with individual path segments (as String
s) and working with more general relative paths (as os.RelPath
s)
You can also use the special os.up
path segment to move up one level:
@ os.pwd
res5: os.Path = root/'Users/'lihaoyi/'Github/'blog
@ os.pwd / os.up
res6: os.Path = root/'Users/'lihaoyi/'Github
@ os.pwd / os.up / os.up
res7: os.Path = root/'Users/'lihaoyi
@ os.pwd / os.up / os.up / os.up
res8: os.Path = root/'Users
@ os.pwd / os.up / os.up / os.up / os.up
res9: os.Path = root
You can construct os.Path
s from strings:
@ os.Path("/")
res10: os.Path = root
@ os.Path("/Users/lihaoyi")
res11: os.Path = root/'Users/'lihaoyi
This is helpful when paths are coming in from elsewhere, e.g. read from a file or command-line arguments.
Note that by default this only allows absolute paths:
@ os.Path("post")
java.lang.IllegalArgumentException: requirement failed: post is not an absolute path
If you want to take in a path that is relative, you have to provide a base
path from which that relative path will begin at
@ os.Path("post", base = os.pwd)
res13: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post
@ os.Path("../Ammonite", base = os.pwd)
res14: os.Path = root/'Users/'lihaoyi/'Github/'Ammonite
If you want to model relative paths, you want a os.RelPath
:
@ os.RelPath("post")
res20: os.RelPath = 'post
@ os.RelPath("../hello/world")
res21: os.RelPath = up/'hello/'world
This helps ensure you do not mix up what you are working with, os.Path
s are always absolute, os.RelPath
s are always relative. To convert a relative path to an absolute path, you can use the same /
operator:
@ val postFolder = os.RelPath("post")
postFolder: os.RelPath = 'post
@ os.pwd / postFolder
res23: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post
@ val helloWorldFolder = os.RelPath("../hello/world")
helloWorldFolder: os.RelPath = up/'hello/'world
@ os.home / helloWorldFolder
res25: os.Path = root/'Users/'hello/'world
If you want the relative path between two absolute paths, you can use .relativeTo
:
@ val githubPath = os.Path("/Users/lihaoyi/Github")
githubPath: os.Path = root/'Users/'lihaoyi/'Github
@ val usersPath = os.Path("/Users")
usersPath: os.Path = root/'Users
@ githubPath.relativeTo(usersPath)
res36: os.RelPath = 'lihaoyi/'Github
@ usersPath.relativeTo(githubPath)
res37: os.RelPath = up/up
os.Path
s always resolve any ..
segments:
@ val githubPathOne = os.Path("/Users/lihaoyi/Github/../Github")
githubPathOne: os.Path = root/'Users/'lihaoyi/'Github
@ val githubPathTwo = os.Path("/Users/lihaoyi/Github/../Github/../Github")
githubPathOne: os.Path = root/'Users/'lihaoyi/'Github
@ githubPathOne == githubPathTwo
res17: Boolean = true
As well as redundant/unnecessary /
s, either in the middle of a path or trailing:
@ os.Path("/Users/lihaoyi////Github/")
res18: os.Path = root/'Users/'lihaoyi/'Github
@ os.Path("/Users/lihaoyi/Github") == os.Path("/Users/lihaoyi////Github/")
res19: Boolean = true
Thus, you can be sure that an os.Path
is always in its canonical representation, and can be easily printed, compared, sorted, de-duplicated, etc.
Relative os.RelPath
s are also canonical:
@ val helloPathOne = os.RelPath("../hello/world")
helloPathOne: os.RelPath = up/'hello/'world
@ val helloPathTwo = os.RelPath("../hello/../hello/world//../world")
helloPathTwo: os.RelPath = up/'hello/'world
@ helloPathOne == helloPathTwo
res29: Boolean = true
Given an absolute path and a relative path:
@ val githubPath = os.Path("/Users/lihaoyi/Github")
githubPath: os.Path = root/'Users/'lihaoyi/'Github
@ val postPath = os.RelPath("post")
postPath: os.RelPath = 'post
You can only extend an absolute path with a relative path:
@ githubPath / postPath
res32: os.Path = root/'Users/'lihaoyi/'Github/'post
Or a relative path with another relative path:
@ postPath / postPath
res33: os.RelPath = 'post/'post
But you cannot extend an absolute path with an absolute path:
@ githubPath / githubPath
cmd34.sc:1: type mismatch;
found : os.Path
required: os.RelPath
val res34 = githubPath / githubPath
^
Compilation Failed
Or a relative path with an absolute path
@ postPath / githubPath
cmd34.sc:1: type mismatch;
found : os.Path
required: os.RelPath
val res34 = postPath / githubPath
^
Compilation Failed
It basically never makes sense to extend something with an absolute path, and the os.Path
type makes sure you do not do so by accident.
The first thing you may want to do is see what's available in a particular folder, which you can do using os.list:
@ os.list(os.pwd)
res38: WrappedArray[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/".gitignore",
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
root/'Users/'lihaoyi/'Github/'blog/'post,
)
os.walk for a recursive listing:
@ os.walk(os.pwd)
res40: IndexedSeq[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/'post,
root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining,
root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubSearch.png",
...
You can also use os.stat, os.isFile, os.size, etc. to read metadata of individual files or folders.
os.read to read a file:
@ os.read(os.pwd / ".gitignore")
res39: String = """target/
scratch/
*.iml
.idea
.settings
.classpath
.project
.cache
.sbtserver
project/.sbtserver
tags
"""
os.write to write a file:
@ os.write(os.pwd / "new.txt", "Hello World")
@ os.list(os.pwd)
res42: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/".gitignore",
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
root/'Users/'lihaoyi/'Github/'blog/"new.txt",
root/'Users/'lihaoyi/'Github/'blog/'post,
)
@ os.read(os.pwd / "new.txt")
res43: String = "Hello World"
os.move to move a file:
@ os.move(os.pwd / "new.txt", os.pwd / "newer.txt")
@ os.list(os.pwd)
res45: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/".gitignore",
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
root/'Users/'lihaoyi/'Github/'blog/"newer.txt",
root/'Users/'lihaoyi/'Github/'blog/'post,
)
os.copy to copy a file:
@ os.copy(os.pwd / "newer.txt", os.pwd / "newer-2.txt")
@ os.list(os.pwd)
res47: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/".gitignore",
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
root/'Users/'lihaoyi/'Github/'blog/"newer-2.txt",
root/'Users/'lihaoyi/'Github/'blog/"newer.txt",
root/'Users/'lihaoyi/'Github/'blog/'post,
)
os.remove to remove a file:
@ os.remove(os.pwd / "newer.txt")
@ os.list(os.pwd)
res49: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/".gitignore",
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
root/'Users/'lihaoyi/'Github/'blog/"newer-2.txt",
root/'Users/'lihaoyi/'Github/'blog/'post,
)
os.makeDir to create a new folder
@ os.makeDir(os.pwd / "new-folder")
@ os.list(os.pwd)
res51: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/".gitignore",
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
root/'Users/'lihaoyi/'Github/'blog/"new-folder",
root/'Users/'lihaoyi/'Github/'blog/"newer-2.txt",
root/'Users/'lihaoyi/'Github/'blog/'post,
)
Many of these commands take flags that let you configure the operation, e.g. os.read
lets you pass in an offset to read from and a count
of characters to read, and have variants like os.read.bytes
to read binary data, os.read.lines
to read lines. os.makeDir
has os.makeDir.all
to recursively create necessary folders, os.remove.all
to recursively remove a folder and its contents, and so on.
The linked documentation for each command goes into more detail of what you can do with each one.
Many operations expose a .stream
variant, which allows you to process its output in a streaming fashion. This avoids accumulating all the output in memory, letting you process large results without causing memory issues.
For example, os.read.lines.stream to stream the lines of a file:
@ os.read.lines.stream(os.pwd / ".gitignore").foreach(println)
target/
scratch/
*.iml
.idea
.settings
.classpath
.project
.cache
.sbtserver
project/.sbtserver
tags
os.list.stream for streaming the contents of a folder
@ os.list.stream(os.pwd).foreach(println)
/Users/lihaoyi/Github/blog/build.sc
/Users/lihaoyi/Github/blog/post
/Users/lihaoyi/Github/blog/target
/Users/lihaoyi/Github/blog/favicon.png
/Users/lihaoyi/Github/blog/pages.sc
/Users/lihaoyi/Github/blog/.gitignore
/Users/lihaoyi/Github/blog/new-folder
/Users/lihaoyi/Github/blog/newer-2.txt
/Users/lihaoyi/Github/blog/blog.iml
/Users/lihaoyi/Github/blog/.git
/Users/lihaoyi/Github/blog/pageStyles.sc
/Users/lihaoyi/Github/blog/.idea
*.stream
operations return a Generator type. These are similar to iterators, except they ensure that resources are always released after processing. This helps avoid leaking file handles or other filesystem resources. Other than that, most collection operators like .foreach
, .map
, .filter
, .toArray
, etc. all apply.
Now that we've gone over the basic operations that you can perform on a filesystem, let's walk through a simple use case.
Often when your disk is full, you want to look for the biggest files that you can remove to free up space. We can do this in a few steps:
First we list all the files and folders in a particular folder (for now just using os.pwd
):
@ val allPaths = os.walk(os.pwd)
allPaths: IndexedSeq[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/'post,
root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
...
Next, we can filter out the folders so we're only looking at files:
@ val allFiles = allPaths.filter(os.isFile)
allFiles: IndexedSeq[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/"build.sc",
root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
Find out how big each file is by using .map
@ val sizedFiles = allFiles.map(path => (os.size(path), path))
sizedFiles: IndexedSeq[(Long, os.Path)] = ArrayBuffer(
(8134L, root/'Users/'lihaoyi/'Github/'blog/"build.sc"),
(73028L, root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md"),
(49727L, root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md"),
(17269L, root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md"),
...
Lastly, sort by the size and take the first 5:
@ sizedFiles.sortBy(_._1).takeRight(5)
res59: IndexedSeq[(Long, os.Path)] = ArrayBuffer(
(5499949L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'slides/"Why-You-Might-Like-Scala.js.pdf"),
(6008395L, root/'Users/'lihaoyi/'Github/'blog/'post/'SmartNation/"routes.json"),
(6008395L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'SmartNation/"routes.json"),
(6340270L, root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubHistory.gif"),
(6340270L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'Reimagining/"GithubHistory.gif")
)
Here, we can see the 5 largest files: in this folder, it's a number of large Gifs, JSON datasets, and a PDF document. You can do all this in command using:
@ os.walk(os.pwd).filter(os.isFile).map(path => (os.size(path), path)).sortBy(_._1).takeRight(5)
res60: IndexedSeq[(Long, os.Path)] = ArrayBuffer(
(5499949L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'slides/"Why-You-Might-Like-Scala.js.pdf"),
(6008395L, root/'Users/'lihaoyi/'Github/'blog/'post/'SmartNation/"routes.json"),
(6008395L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'SmartNation/"routes.json"),
(6340270L, root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubHistory.gif"),
(6340270L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'Reimagining/"GithubHistory.gif")
)
Let's walk through a second use case: write a program that will take a source and destination folder, and efficiently update the destination folder to look like the source folder as files are added to it or modified (for simplicity, we will ignore deletions).
@ val src = os.pwd / "post"; val dest = os.pwd / "post-copy"
src: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post
dest: os.Path = root/'Users/'lihaoyi/'Github/'blog/"post-copy"
Lets also assume that simply deleting the destination and re-copying the source over is to inefficient:
@ os.remove.all(dest)
@ os.copy.all(src, dest)
And we want to do it on a per-file/folder basis.
To begin with, we need to recursively walk all contents of the source folder
@ val srcContents = os.walk(src)
srcContents: IndexedSeq[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining,
root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubSearch.png",
...
Then, we iterate over every entry, and see if its a file or folder:
@ for(path <- srcContents) println(os.isDir(path))
false
false
false
true
false
false
For simplicity, we'll ignore the presence of symbolic links, detectable via os.isLink
.
We can find the corresponding isDir
for the destination path using:
@ for(path <- srcContents) println(os.isDir(dest / path.relativeTo(src)))
false
false
false
false
false
false
For now, the source folder doesn't exist, so isDir
returns false
on all of the paths.
Next, we walk over the srcContents
and the corresponding paths in dest
together, and if they differ, delete the destination sub-path and copy the source sub-path over
@ for(srcSubPath <- srcContents) {
val destSubPath = dest / srcSubPath.relativeTo(src)
(os.isDir(srcSubPath), os.isDir(destSubPath)) match{
case (false, true) | (true, false) => os.copy.over(srcSubPath, destSubPath)
case (false, false)
if !os.exists(destSubPath)
|| os.read.bytes(srcSubPath) != os.read.bytes(destSubPath) =>
os.copy.over(srcSubPath, destSubPath, createFolders = true)
case _ => // do nothing
}
}
Now, we can walk the dest
path and see all our contents in place:
@ os.walk(dest)
res13: IndexedSeq[os.Path] = ArrayBuffer(
root/'Users/'lihaoyi/'Github/'blog/"post-copy"/"9 - Micro-optimizing your Scala code.md",
root/'Users/'lihaoyi/'Github/'blog/"post-copy"/"24 - How to conduct a good Programming Interview.md",
root/'Users/'lihaoyi/'Github/'blog/"post-copy"/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
root/'Users/'lihaoyi/'Github/'blog/"post-copy"/'Reimagining,
root/'Users/'lihaoyi/'Github/'blog/"post-copy"/'Reimagining/"GithubSearch.png",
root/'Users/'lihaoyi/'Github/'blog/"post-copy"/'Reimagining/"GithubBrowsing.gif",
We can wrap this all in a function for easy usage:
@ def sync(src: os.Path, dest: os.Path) = {
val srcContents = os.walk(src)
for(srcSubPath <- srcContents) {
val destSubPath = dest / srcSubPath.relativeTo(src)
(os.isDir(srcSubPath), os.isDir(destSubPath)) match{
case (false, true) | (true, false) => os.copy.over(srcSubPath, destSubPath)
case (false, false)
if !os.exists(destSubPath)
|| os.read.bytes(srcSubPath) != os.read.bytes(destSubPath) =>
os.copy.over(srcSubPath, destSubPath, createFolders = true)
case _ => // do nothing
}
}
}
defined function syncAdd
To test incremental updates, we can try adding an entry to the src
folder:
@ os.write(src / "ABC.txt", "Hello World")
Running the sync:
@ sync(src, dest)
We can then see our file has been synced over to dest
@ os.exists(dest / "ABC.txt")
res29: Boolean = true
@ os.read(dest / "ABC.txt")
res30: String = "Hello World"
And modifications to that file also get synced over:
@ os.write.append(src / "ABC.txt", "\nI am Cow")
@ sync(src, dest)
@ os.read(dest / "ABC.txt")
res33: String = """Hello World
I am Cow"""
This use case is greatly simplified for simplicity so it can fit within a blog post: we do not consider deletions, syncing permissions, sub-file level syncing of data (e.g. Dropbox famously syncs in 4mb blocks), or concurrency/parallelism concerns. Nevertheless, it should give you a good sense of how working with the filesystem via Scala's OS-Lib library works, and you can easily extend it if you need more functionality
While we have only covered two use cases in this post, the OS-Lib Cookbook has several other use cases you can browse to see how file handling works in a wider variety of situations:
This is only a quick tour of how to work with the filesystem in various ways. The library documentation has a much more thorough reference for all the things you can do and how to do them:
Dealing with files and folders in Scala doesn't need to be difficult or verbose. With the OS-Lib library, querying information about the filesystem is both convenient and safe: you can accomplish what you want in very little code, while the compiler and library helps you check your logic and make sure you aren't e.g. messing up your path handling.
While OS-Lib is a third-party library, it is available on Maven Central and easy to use in any Scala environment: whether built using SBT, Maven, Mill, or directly in Ammonite's REPL or Scripts. All systems end up needing to interact with the filesystem for various miscellaneous tasks, and in Scala such interactions can be quick, easy, and safe.
About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming