Commit 0cedbaa9 authored by Uli Middelberg's avatar Uli Middelberg

integrated support for aufs & Docker

parent 0fed6ed8
What: /debug/aufs/si_<id>/
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
Under /debug/aufs, a directory named si_<id> is created
per aufs mount, where <id> is a unique id generated
internally.
What: /debug/aufs/si_<id>/plink
Date: Apr 2013
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It has three lines and shows the information about the
pseudo-link. The first line is a single number
representing a number of buckets. The second line is a
number of pseudo-links per buckets (separated by a
blank). The last line is a single number representing a
total number of psedo-links.
When the aufs mount option 'noplink' is specified, it
will show "1\n0\n0\n".
What: /debug/aufs/si_<id>/xib
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It shows the consumed blocks by xib (External Inode Number
Bitmap), its block size and file size.
When the aufs mount option 'noxino' is specified, it
will be empty. About XINO files, see the aufs manual.
What: /debug/aufs/si_<id>/xino0, xino1 ... xinoN
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It shows the consumed blocks by xino (External Inode Number
Translation Table), its link count, block size and file
size.
When the aufs mount option 'noxino' is specified, it
will be empty. About XINO files, see the aufs manual.
What: /debug/aufs/si_<id>/xigen
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It shows the consumed blocks by xigen (External Inode
Generation Table), its block size and file size.
If CONFIG_AUFS_EXPORT is disabled, this entry will not
be created.
When the aufs mount option 'noxino' is specified, it
will be empty. About XINO files, see the aufs manual.
What: /sys/fs/aufs/si_<id>/
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
Under /sys/fs/aufs, a directory named si_<id> is created
per aufs mount, where <id> is a unique id generated
internally.
What: /sys/fs/aufs/si_<id>/br0, br1 ... brN
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It shows the abolute path of a member directory (which
is called branch) in aufs, and its permission.
What: /sys/fs/aufs/si_<id>/brid0, brid1 ... bridN
Date: July 2013
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It shows the id of a member directory (which is called
branch) in aufs.
What: /sys/fs/aufs/si_<id>/xi_path
Date: March 2009
Contact: J. R. Okajima <hooanon05g@gmail.com>
Description:
It shows the abolute path of XINO (External Inode Number
Bitmap, Translation Table and Generation Table) file
even if it is the default path.
When the aufs mount option 'noxino' is specified, it
will be empty. About XINO files, see the aufs manual.
This diff is collapsed.
# Copyright (C) 2005-2014 Junjiro R. Okajima
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
Introduction
----------------------------------------
aufs [ei ju: ef es] | [a u f s]
1. abbrev. for "advanced multi-layered unification filesystem".
2. abbrev. for "another unionfs".
3. abbrev. for "auf das" in German which means "on the" in English.
Ex. "Butter aufs Brot"(G) means "butter onto bread"(E).
But "Filesystem aufs Filesystem" is hard to understand.
AUFS is a filesystem with features:
- multi layered stackable unification filesystem, the member directory
is called as a branch.
- branch permission and attribute, 'readonly', 'real-readonly',
'readwrite', 'whiteout-able', 'link-able whiteout' and their
combination.
- internal "file copy-on-write".
- logical deletion, whiteout.
- dynamic branch manipulation, adding, deleting and changing permission.
- allow bypassing aufs, user's direct branch access.
- external inode number translation table and bitmap which maintains the
persistent aufs inode number.
- seekable directory, including NFS readdir.
- file mapping, mmap and sharing pages.
- pseudo-link, hardlink over branches.
- loopback mounted filesystem as a branch.
- several policies to select one among multiple writable branches.
- revert a single systemcall when an error occurs in aufs.
- and more...
Multi Layered Stackable Unification Filesystem
----------------------------------------------------------------------
Most people already knows what it is.
It is a filesystem which unifies several directories and provides a
merged single directory. When users access a file, the access will be
passed/re-directed/converted (sorry, I am not sure which English word is
correct) to the real file on the member filesystem. The member
filesystem is called 'lower filesystem' or 'branch' and has a mode
'readonly' and 'readwrite.' And the deletion for a file on the lower
readonly branch is handled by creating 'whiteout' on the upper writable
branch.
On LKML, there have been discussions about UnionMount (Jan Blunck,
Bharata B Rao and Valerie Aurora) and Unionfs (Erez Zadok). They took
different approaches to implement the merged-view.
The former tries putting it into VFS, and the latter implements as a
separate filesystem.
(If I misunderstand about these implementations, please let me know and
I shall correct it. Because it is a long time ago when I read their
source files last time).
UnionMount's approach will be able to small, but may be hard to share
branches between several UnionMount since the whiteout in it is
implemented in the inode on branch filesystem and always
shared. According to Bharata's post, readdir does not seems to be
finished yet.
There are several missing features known in this implementations such as
- for users, the inode number may change silently. eg. copy-up.
- link(2) may break by copy-up.
- read(2) may get an obsoleted filedata (fstat(2) too).
- fcntl(F_SETLK) may be broken by copy-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
open(O_RDWR).
Unionfs has a longer history. When I started implementing a stacking filesystem
(Aug 2005), it already existed. It has virtual super_block, inode,
dentry and file objects and they have an array pointing lower same kind
objects. After contributing many patches for Unionfs, I re-started my
project AUFS (Jun 2006).
In AUFS, the structure of filesystem resembles to Unionfs, but I
implemented my own ideas, approaches and enhancements and it became
totally different one.
Comparing DM snapshot and fs based implementation
- the number of bytes to be copied between devices is much smaller.
- the type of filesystem must be one and only.
- the fs must be writable, no readonly fs, even for the lower original
device. so the compression fs will not be usable. but if we use
loopback mount, we may address this issue.
for instance,
mount /cdrom/squashfs.img /sq
losetup /sq/ext2.img
losetup /somewhere/cow
dmsetup "snapshot /dev/loop0 /dev/loop1 ..."
- it will be difficult (or needs more operations) to extract the
difference between the original device and COW.
- DM snapshot-merge may help a lot when users try merging. in the
fs-layer union, users will use rsync(1).
Several characters/aspects of aufs
----------------------------------------------------------------------
Aufs has several characters or aspects.
1. a filesystem, callee of VFS helper
2. sub-VFS, caller of VFS helper for branches
3. a virtual filesystem which maintains persistent inode number
4. reader/writer of files on branches such like an application
1. Callee of VFS Helper
As an ordinary linux filesystem, aufs is a callee of VFS. For instance,
unlink(2) from an application reaches sys_unlink() kernel function and
then vfs_unlink() is called. vfs_unlink() is one of VFS helper and it
calls filesystem specific unlink operation. Actually aufs implements the
unlink operation but it behaves like a redirector.
2. Caller of VFS Helper for Branches
aufs_unlink() passes the unlink request to the branch filesystem as if
it were called from VFS. So the called unlink operation of the branch
filesystem acts as usual. As a caller of VFS helper, aufs should handle
every necessary pre/post operation for the branch filesystem.
- acquire the lock for the parent dir on a branch
- lookup in a branch
- revalidate dentry on a branch
- mnt_want_write() for a branch
- vfs_unlink() for a branch
- mnt_drop_write() for a branch
- release the lock on a branch
3. Persistent Inode Number
One of the most important issue for a filesystem is to maintain inode
numbers. This is particularly important to support exporting a
filesystem via NFS. Aufs is a virtual filesystem which doesn't have a
backend block device for its own. But some storage is necessary to
maintain inode number. It may be a large space and may not suit to keep
in memory. Aufs rents some space from its first writable branch
filesystem (by default) and creates file(s) on it. These files are
created by aufs internally and removed soon (currently) keeping opened.
Note: Because these files are removed, they are totally gone after
unmounting aufs. It means the inode numbers are not persistent
across unmount or reboot. I have a plan to make them really
persistent which will be important for aufs on NFS server.
4. Read/Write Files Internally (copy-on-write)
Because a branch can be readonly, when you write a file on it, aufs will
"copy-up" it to the upper writable branch internally. And then write the
originally requested thing to the file. Generally kernel doesn't
open/read/write file actively. In aufs, even a single write may cause a
internal "file copy". This behaviour is very similar to cp(1) command.
Some people may think it is better to pass such work to user space
helper, instead of doing in kernel space. Actually I am still thinking
about it. But currently I have implemented it in kernel space.
This diff is collapsed.
# Copyright (C) 2005-2014 Junjiro R. Okajima
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
Lookup in a Branch
----------------------------------------------------------------------
Since aufs has a character of sub-VFS (see Introduction), it operates
lookup for branches as VFS does. It may be a heavy work. Generally
speaking struct nameidata is a bigger structure and includes many
information. But almost all lookup operation in aufs is the simplest
case, ie. lookup only an entry directly connected to its parent. Digging
down the directory hierarchy is unnecessary.
VFS has a function lookup_one_len() for that use, but it is not usable
for a branch filesystem which requires struct nameidata. So aufs
implements a simple lookup wrapper function. When a branch filesystem
allows NULL as nameidata, it calls lookup_one_len(). Otherwise it builds
a simplest nameidata and calls lookup_hash().
Here aufs applies "a principle in NFSD", ie. if the filesystem supports
NFS-export, then it has to support NULL as a nameidata parameter for
->create(), ->lookup() and ->d_revalidate(). So the lookup wrapper in
aufs tests if ->s_export_op in the branch is NULL or not.
When a branch is a remote filesystem, aufs basically trusts its
->d_revalidate(), also aufs forces the hardest revalidate tests for
them.
For d_revalidate, aufs implements three levels of revalidate tests. See
"Revalidate Dentry and UDBA" in detail.
Test Only the Highest One for the Directory Permission (dirperm1 option)
----------------------------------------------------------------------
Let's try case study.
- aufs has two branches, upper readwrite and lower readonly.
/au = /rw + /ro
- "dirA" exists under /ro, but /rw. and its mode is 0700.
- user invoked "chmod a+rx /au/dirA"
- the internal copy-up is activated and "/rw/dirA" is created and its
permission bits are set to world readble.
- then "/au/dirA" becomes world readable?
In this case, /ro/dirA is still 0700 since it exists in readonly branch,
or it may be a natively readonly filesystem. If aufs respects the lower
branch, it should not respond readdir request from other users. But user
allowed it by chmod. Should really aufs rejects showing the entries
under /ro/dirA?
To be honest, I don't have a best solution for this case. So aufs
implements 'dirperm1' and 'nodirperm1' and leave it to users.
When dirperm1 is specified, aufs checks only the highest one for the
directory permission, and shows the entries. Otherwise, as usual, checks
every dir existing on all branches and rejects the request.
As a side effect, dirperm1 option improves the performance of aufs
because the number of permission check is reduced when the number of
branch is many.
Loopback Mount
----------------------------------------------------------------------
Basically aufs supports any type of filesystem and block device for a
branch (actually there are some exceptions). But it is prohibited to add
a loopback mounted one whose backend file exists in a filesystem which is
already added to aufs. The reason is to protect aufs from a recursive
lookup. If it was allowed, the aufs lookup operation might re-enter a
lookup for the loopback mounted branch in the same context, and will
cause a deadlock.
Revalidate Dentry and UDBA (User's Direct Branch Access)
----------------------------------------------------------------------
Generally VFS helpers re-validate a dentry as a part of lookup.
0. digging down the directory hierarchy.
1. lock the parent dir by its i_mutex.
2. lookup the final (child) entry.
3. revalidate it.
4. call the actual operation (create, unlink, etc.)
5. unlock the parent dir
If the filesystem implements its ->d_revalidate() (step 3), then it is
called. Actually aufs implements it and checks the dentry on a branch is
still valid.
But it is not enough. Because aufs has to release the lock for the
parent dir on a branch at the end of ->lookup() (step 2) and
->d_revalidate() (step 3) while the i_mutex of the aufs dir is still
held by VFS.
If the file on a branch is changed directly, eg. bypassing aufs, after
aufs released the lock, then the subsequent operation may cause
something unpleasant result.
This situation is a result of VFS architecture, ->lookup() and
->d_revalidate() is separated. But I never say it is wrong. It is a good
design from VFS's point of view. It is just not suitable for sub-VFS
character in aufs.
Aufs supports such case by three level of revalidation which is
selectable by user.
1. Simple Revalidate
Addition to the native flow in VFS's, confirm the child-parent
relationship on the branch just after locking the parent dir on the
branch in the "actual operation" (step 4). When this validation
fails, aufs returns EBUSY. ->d_revalidate() (step 3) in aufs still
checks the validation of the dentry on branches.
2. Monitor Changes Internally by Inotify/Fsnotify
Addition to above, in the "actual operation" (step 4) aufs re-lookup
the dentry on the branch, and returns EBUSY if it finds different
dentry.
Additionally, aufs sets the inotify/fsnotify watch for every dir on branches
during it is in cache. When the event is notified, aufs registers a
function to kernel 'events' thread by schedule_work(). And the
function sets some special status to the cached aufs dentry and inode
private data. If they are not cached, then aufs has nothing to
do. When the same file is accessed through aufs (step 0-3) later,
aufs will detect the status and refresh all necessary data.
In this mode, aufs has to ignore the event which is fired by aufs
itself.
3. No Extra Validation
This is the simplest test and doesn't add any additional revalidation
test, and skip therevalidatin in step 4. It is useful and improves
aufs performance when system surely hide the aufs branches from user,
by over-mounting something (or another method).
# Copyright (C) 2005-2014 Junjiro R. Okajima
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
Branch Manipulation
Since aufs supports dynamic branch manipulation, ie. add/remove a branch
and changing its permission/attribute, there are a lot of works to do.
Add a Branch
----------------------------------------------------------------------
o Confirm the adding dir exists outside of aufs, including loopback
mount.
- and other various attributes...
o Initialize the xino file and whiteout bases if necessary.
See struct.txt.
o Check the owner/group/mode of the directory
When the owner/group/mode of the adding directory differs from the
existing branch, aufs issues a warning because it may impose a
security risk.
For example, when a upper writable branch has a world writable empty
top directory, a malicious user can create any files on the writable
branch directly, like copy-up and modify manually. If something like
/etc/{passwd,shadow} exists on the lower readonly branch but the upper
writable branch, and the writable branch is world-writable, then a
malicious guy may create /etc/passwd on the writable branch directly
and the infected file will be valid in aufs.
I am afraid it can be a security issue, but nothing to do except
producing a warning.
Delete a Branch
----------------------------------------------------------------------
o Confirm the deleting branch is not busy
To be general, there is one merit to adopt "remount" interface to
manipulate branches. It is to discard caches. At deleting a branch,
aufs checks the still cached (and connected) dentries and inodes. If
there are any, then they are all in-use. An inode without its
corresponding dentry can be alive alone (for example, inotify/fsnotify case).
For the cached one, aufs checks whether the same named entry exists on
other branches.
If the cached one is a directory, because aufs provides a merged view
to users, as long as one dir is left on any branch aufs can show the
dir to users. In this case, the branch can be removed from aufs.
Otherwise aufs rejects deleting the branch.
If any file on the deleting branch is opened by aufs, then aufs
rejects deleting.
Modify the Permission of a Branch
----------------------------------------------------------------------
o Re-initialize or remove the xino file and whiteout bases if necessary.
See struct.txt.
o rw --> ro: Confirm the modifying branch is not busy
Aufs rejects the request if any of these conditions are true.
- a file on the branch is mmap-ed.
- a regular file on the branch is opened for write and there is no
same named entry on the upper branch.
# Copyright (C) 2005-2014 Junjiro R. Okajima
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
Policies to Select One among Multiple Writable Branches
----------------------------------------------------------------------
When the number of writable branch is more than one, aufs has to decide
the target branch for file creation or copy-up. By default, the highest
writable branch which has the parent (or ancestor) dir of the target
file is chosen (top-down-parent policy).
By user's request, aufs implements some other policies to select the
writable branch, for file creation two policies, round-robin and
most-free-space policies. For copy-up three policies, top-down-parent,
bottom-up-parent and bottom-up policies.
As expected, the round-robin policy selects the branch in circular. When
you have two writable branches and creates 10 new files, 5 files will be
created for each branch. mkdir(2) systemcall is an exception. When you
create 10 new directories, all will be created on the same branch.
And the most-free-space policy selects the one which has most free
space among the writable branches. The amount of free space will be
checked by aufs internally, and users can specify its time interval.
The policies for copy-up is more simple,
top-down-parent is equivalent to the same named on in create policy,
bottom-up-parent selects the writable branch where the parent dir
exists and the nearest upper one from the copyup-source,
bottom-up selects the nearest upper writable branch from the
copyup-source, regardless the existence of the parent dir.
There are some rules or exceptions to apply these policies.
- If there is a readonly branch above the policy-selected branch and
the parent dir is marked as opaque (a variation of whiteout), or the
target (creating) file is whiteout-ed on the upper readonly branch,
then the result of the policy is ignored and the target file will be
created on the nearest upper writable branch than the readonly branch.
- If there is a writable branch above the policy-selected branch and
the parent dir is marked as opaque or the target file is whiteouted
on the branch, then the result of the policy is ignored and the target
file will be created on the highest one among the upper writable
branches who has diropq or whiteout. In case of whiteout, aufs removes
it as usual.
- link(2) and rename(2) systemcalls are exceptions in every policy.
They try selecting the branch where the source exists as possible
since copyup a large file will take long time. If it can't be,
ie. the branch where the source exists is readonly, then they will
follow the copyup policy.
- There is an exception for rename(2) when the target exists.
If the rename target exists, aufs compares the index of the branches
where the source and the target exists and selects the higher
one. If the selected branch is readonly, then aufs follows the
copyup policy.
# Copyright (C) 2011-2014 Junjiro R. Okajima
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
File-based Hierarchical Storage Management (FHSM)
----------------------------------------------------------------------
Hierarchical Storage Management (or HSM) is a well-known feature in the
storage world. Aufs provides this feature as file-based with multiple
writable branches, based upon the principle of "Colder-Lower".
Here the word "colder" means that the less used files, and "lower" means
that the position in the order of the stacked branches.
These multiple writable branches are prioritized, ie. the topmost one
should be the fastest drive and be used heavily.
o Characters in aufs FHSM story
- aufs itself and a new branch attribute.
- a new ioctl interface to move-down and to establish a connection with
the daemon ("move-down" is a converse of "copy-up").
- userspace tool and daemon.
The userspace daemon establishes a connection with aufs and waits for
the notification. The notified information is very similar to struct
statfs containing the number of consumed blocks and inodes.
When the consumed blocks/inodes of a branch exceeds the user-specified
upper watermark, the daemon activates its move-down process until the
consumed blocks/inodes reaches the user-specified lower watermark.
The actual move-down is done by aufs based upon the request from
user-space since we need to maintain the inode number and the internal
pointer arrays in aufs.
Currently aufs FHSM handles the regular files only. Additionally they
must not be hard-linked nor pseudo-linked.
o Cowork of aufs and the user-space daemon
During the userspace daemon established the connection, aufs sends a
small notification to it whenever aufs writes something into the
writable branch. But it may cost high since aufs issues statfs(2)
internally. So user can specify a new option to cache the
info. Actually the notification is controlled by these factors.
+ the specified cache time.
+ classified as "force" by aufs internally.
Until the specified time expires, aufs doesn't send the info
except the forced cases. When aufs decide forcing, the info is always
notified to userspace.
For example, the number of free inodes is generally large enough and
the shortage of it happens rarely. So aufs doesn't force the
notification when creating a new file, directory and others. This is
the typical case which aufs doesn't force.
When aufs writes the actual filedata and the files consumes any of new
blocks, the aufs forces notifying.
o Interfaces in aufs
- New branch attribute.
+ fhsm
Specifies that the branch is managed by FHSM feature. In other word,
participant in the FHSM.
When nofhsm is set to the branch, it will not be the source/target
branch of the move-down operation. This attribute is set
independently from coo and moo attributes, and if you want full
FHSM, you should specify them as well.
- New mount option.
+ fhsm_sec
Specifies a second to suppress many less important info to be
notified.
- New ioctl.
+ AUFS_CTL_FHSM_FD
create a new file descriptor which userspace can read the notification
(a subset of struct statfs) from aufs.
- Module parameter 'brs'
It has to be set to 1. Otherwise the new mount option 'fhsm' will not
be set.
- mount helpers /sbin/mount.aufs and /sbin/umount.aufs
When there are two or more branches with fhsm attributes,
/sbin/mount.aufs invokes the user-space daemon and /sbin/umount.aufs
terminates it. As a result of remounting and branch-manipulation, the
number of branches with fhsm attribute can be one. In this case,
/sbin/mount.aufs will terminate the user-space daemon.
Finally the operation is done as these steps in kernel-space.
- make sure that,
+ no one else is using the file.
+ the file is not hard-linked.
+ the file is not pseudo-linked.
+ the file is a regular file.
+ the parent dir is not opaqued.
- find the target writable branch.
- make sure the file is not whiteout-ed by the upper (than the target)
branch.
- make the parent dir on the target branch.
- mutex lock the inode on the branch.
- unlink the whiteout on the target branch (if exists).
- lookup and create the whiteout-ed temporary name on the target branch.
- copy the file as the whiteout-ed temporary name on the target branch.
- rename the whiteout-ed temporary name to the original name.
- unlink the file on the source branch.
- maintain the internal pointer array and the external inode number
table (XINO).
- maintain the timestamps and other attributes of the parent dir and the
file.
And of course, in every step, an error may happen. So the operation
should restore the original file state after an error happens.
# Copyright (C) 2005-2014 Junjiro R. Okajima
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
mmap(2) -- File Memory Mapping
----------------------------------------------------------------------
In aufs, the file-mapped pages are handled by a branch fs directly, no
interaction with aufs. It means aufs_mmap() calls the branch fs's
->mmap().
This approach is simple and good, but there is one problem.
Under /proc, several entries show the mmap-ped files by its path (with
device and inode number), and the printed path will be the path on the
branch fs's instead of virtual aufs's.
This is not a problem in most cases, but some utilities lsof(1) (and its
user) may expect the path on aufs.
To address this issue, aufs adds a new member called vm_prfile in struct
vm_area_struct (and struct vm_region). The original vm_file points to
the file on the branch fs in order to handle everything correctly as
usual. The new vm_prfile points to a virtual file in aufs, and the
show-functions in procfs refers to vm_prfile if it is set.
Also we need to maintain several other places where touching vm_file
such like
- fork()/clone() copies vma and the reference count of vm_file is
incremented.
- merging vma maintains the ref count too.
This is not a good approach. It just faking the printed path. But it
leaves all behaviour around f_mapping unchanged. This is surely an
advantage.
Actually aufs had adopted another complicated approach which calls
generic_file_mmap() and handles struct vm_operations_struct. In this
approach, aufs met a hard problem and I could not solve it without
switching the approach.
# Copyright (C) 2014 Junjiro R. Okajima