mount -a remounts tmpfs entries: bug or feature?

Tue Nov 25 04:30:48 PST 2008

On Monday 24 November 2008 17:12:45 Denys Vlasenko wrote:
> On Monday 24 November 2008 14:59, Rob Landley wrote:
> > On Sunday 23 November 2008 09:01:35 Denys Vlasenko wrote:
> > > On Friday 07 November 2008 03:01, busybox at eehouse.org wrote:
> > > > busybox's implementation of mount differs from the standalone version
> >
> > Back in the 1.1 timeframe I rewrote it more or less from scratch,
> > something like 3 times, trying to get it to behave sanely.  (Mount is
> > tricksy.)
> >
> > I see it's been fairly heavily edited then.  Kind of horrible to read
> > through now, actually.
>
> NFS code was merged into mount.c. Somebody asked me to do it
> (it was before my maintainership). In hindsight, that was not
> such a good idea - now it's not readable.
>
> Other that NFS code - what places you don't like Rob?

An #ifdef for _dietlibc_, special casing rootfs, special casing shared subtree 
flags, the mount_option_str separately quoting "\0" (small but ugly, you 
wouldn't do it for \n)...

Some of it might just be that I've gotten used to looking at the toybox 
infrastructure for things like option parsing, so constructs like:

#if ENABLE_BLAH
#define ifBlah (logic)
#else
#define ifBlah 0
#endif

Just look really wrong, although that one was probably my fault once upon a 
time.  Also, I don't #ifdef things out of my shared globals struct on the 
theory that it's probably going to round up to page granularity anyway and 
usually the extra source complexity doesn't even make it pull in fewer cache 
lines, while with the simple way you can just go (CFG_WHATSIS && TT.whatsis) 
in the code and let the compiler drop it out.

Lots of #if ENABLE in general that could be if (ENABLE) instead.  
ENABLE_MOUNT_LABEL in resolve_mount_spec(), for example.  In general a static 
function should be inlineable and optimizable away with gcc 4.x.  I'm looking 
at verbose_mount() here, which only has two callers anyway so guarding it at 
the call sites might be better, although the second call site has comments on 
each _argument_, which is sick...

Speaking of verbose_mount(), in its second caller you have this:

        rc = verbose_mount(/*source:*/ "", /*target:*/ argv[0],
                /*type:*/ "", /*flags:*/ i, /*data:*/ "");

I would it would have been more readable as:
	// verbose_mount(source, target, type, flags, data)
	rc = verbose_mount ("", argv[0], "", i, "");

Or just assuming readers could look at the names of the variables in the 
prototype...

> > And kind of broken in several places.  Ooh, ick.
>
> Where? :(

Alas, I no longer remember what specifically that was in response to. (I was 
still recovering from food poisoning and trying not to give myself a 
headache.)  I do remember that "several" in this instance was only 2 or 3, and 
it was more "it won't do the right thing given these arguments" rather than a 
security issue.

A quick glance at the code shows it's got lots of:
        // WARNING. I am not sure this matches util-linux's
        // behavior. It's possible util-linux does not
        // take -o opts from mtab (takes only mount source).

I actually explicitly tested that sort of thing over a period of 3 months and 
worked out what the correct behavior should _be_, and implemented it at the 
time.  (See "needed to write up a spec, didn't manage".  My fault again.  
Mount is only anything like simple and straightforward after you sit down and 
study it for weeks, and the day or two I had set aside for writing down a 
coherent explanation of how mounting should work wasn't nearly enough.)

> > Some filesystem types are per-instance, and some are shared with all
> > instances (most block backed ones, non-containerized versions of /proc
> > and /sys...).
> >
> > Did you ever read the thing I wrote about the four types of filesystems
> > (blocked back, ram backed, synthetic, and network)?
>
> No. Do you have an URL?

Nope, I think it might have been on timesys's website, but a lot of their old 
content got closed up or moved around when their engineering department 
disintegrated in late 2006.  (It's all different people now...)

Here's a quick and dirty, unedited, stream of consciousness dump which sounds 
to me like it's coming from Captain Obvious, but on the off chance it might 
prove useful...

Mounting a filesystem just connects a filesystem driver to a directory, and 
the driver can put anything it darn well pleases in that directory, but it 
generally falls into four categories.  Two of them have "backing store" and 
two of them don't:

  - block backed: the classic one everybody thinks of, and which mount is
    actually _designed_ around.  When Unix was young this was the only type of
    filesystem.  Your filesystem driver (specified with -t fsname) acts as a
    lens to look at a specified block device through.  This means you have
    _two_ drivers involved in every read and write fro this filesystem: a
    filesystem driver to interpret the format and a block driver to talk to
    the hardware (which is implicit, it's providing the block device you
    pointed the filesystem driver at).  Note that ramdisks are block backed
    filesystems; a ramdisk driver produces a block device out of a chunk of
    memory, and then you format it and look at it through a filesystem driver
    such as ext2.

    Note that "block device" is a specific API, a randomly seekable range of
    bytes with invariant size.  Block backed filesystems _only_ talk to this
    API, to the point that Loopback devices exist solely to provide a block
    device API wrapper around normal files (which are perfectly capable of
    providing a range of bytes, but they don't guarantee their length won't
    change while you're using 'em.  I believe attempting to truncate a file
    which a loop device is attached to no longer panics the kernel, but I
    haven't actually tried it).

  - Network filesystems: the first complication, a filesystem that talks to
    something _other_ than a block device for its backing store.  (I think
    these showed up sometime in the early 80's, but am not looking it up right
    now.)  You'd think they'd have been smart and made it talk to a character
    device, but the BSD guys who added networking to Unix didn't give network
    cards /dev entries, and then Sun hired them to produce NFS. (They also
    inflicted vi upon us.)  In general these suckers sort of act like they
    talk to their backing store via a serial protocol that can fit through a
    pipe (or character device, or socket, or...), but you've really got to
    squint and _want_ to see it.  In practice with NFS as one of the early
    models for this (those who do not understand TCP are doomed to reinvent it
    poorly via UDP; you wouldn't _think_ Bill Joy would have been in that
    group, and yet...), the result was a mess of back-channels and
    side-channels and weird overlapping incestuous knowledge of their backing
    store.  This group includes everything from samba through FUSE, and
    outliers you can lump in here include jffs2 (since the backing store isn't
    a normal block device; it _must_ be flash, which the driver has incestuous
    knowledge of).

    With network file systems, the "block device" field mount passes in gets
    treated as an address of the backing store, but how to interpret that
    address (a URL?  Flash memory range?  Cookie to look up in a database?)
    is up to the driver.  You can identify these because the filesystem
    holds arbitrary files, it has a persistent backing store, and that backing
    store is something _other_ than a normal block device.  (The sane ones of
    these at least still have a separate driver or program or something
    handling the backing store, but they're not all cleanly separated.  Case
    in point, jffs2 again, which has code in it to erase flash banks, and thus
    has to know about NAND vs NOR and thus
    http://www.linuxdevices.com/news/NS7386103729.html is news and... ugh.
    Clean orthogonal separation is a good thing.  In the network filesystem
    space, Linux finally gave us a universal API, and it's called FUSE.)

  - ram backed: Now we get weird: filesystems that store arbitrary files, but
    have no persistent backing store.  Really this abuses the disk cache to
    act like a filesystem, by plugging it up so the cached data has nowhere to
    go and just stays in the cache instead.  The implementation is very small
    and very simple because the page and dentry caches already _exist_ as
    common code in the VFS layer, so it only takes a ~100 line driver to
    stub out a few things and give you a temporary filesystem.

    Linus Torvalds invented this approach in April 2000:
    http://kernel-traffic.org/kernel-traffic/kt20000424_64.html#1

    Linus wanted ramfs kept simple so another variant (tmpfs) was invented
    that allows size limits and swapping out the pages (ordinarily, swapping
    out disk cache is counterproductive because you tell the filesystem driver
    to get rid of the, page, and it writes it to backing store if need be and
    then frees the memory since it can read it in again from backing store (or
    in the case of synthetic filesystems generate new contents.  Ramfs is
    almost unique in that the data has nowhere to go and _can't_ be freed. 
    tmpfs shuffles cache pages into anonymous pages as if they belonged to a
    process, and lets 'em get swapped out.  It uses the swap partition as
    a transient backing store, but it still goes away when you reboot.)

    The next fun thing was rootfs, which is an instance of ramfs (alas, _not_
    tmpfs) that gets auto-created at boot and populated from a cpio archive.
    Remember how ramdisks are really block backed filesystems?  That means
    they need two device drivers (the block driver and the filesystem format
    interpreting driver that turns the data in the block device into files and
    directories and writes it all back again in the right places as
    necessary).  And to boot, they need to be statically linked.  Plus the
    data is copied fromm the block device into the page cache, so you have
    _two_ copies of the data when the files are in use.  I don't have to sell
    this crowd on why this is cool, but it's also _simple_.  The trick to
    making the problems of turning a chunk of memory into a block device that
    could be used as a block backed filesystem was _not_to_do_that_, and as
    far as I can tell it simply hadn't occurred to anybody that you could get
    _away_ with not doing that until Linus did it.  Obvious in retrospect, of
    course, but most good ideas are. :)

  - synthetic filesystems.  Here we really go off into the weeds, filesystem
    drivers that don't even store arbitrary files.  The files here are just a
    way of communicating with the driver; writing to them provides information
    for the driver to act upon to perform special effects, and reading from
    them lets the driver to supply information back to userspace.  The driver
    isn't "storing" information like a normal filesystem, it's eating what you
    write into it and hallucinating any darn contents it feels like in return.

    Examples include sysfs, proc, debugfs, usbfs, the late unlamented devfs,
    and more. The first synthetic filesystem was /proc, which was invented to
    show information about processes (so ps didn't have to try to parse
    /dev/mem to find internal kernel structures; yes, that's how it used to do
    it, you may barf now).  At the time it was the first and at one time only
    synthetic filesystem, so every time somebody wanted to pass any _other_
    info to userspace (like /proc/version) they added it to /proc until it
    became a horrible compost heap.  And then libfs was invented, as described
    in http://lwn.net/Articles/57369/ and go read that instead because I'm
    falling asleep.

All of the above was A) much more coherent in the version I actually bothered 
to _edit_, B) actually the _introduction_ to a longer document (it's what I 
wound up writing when I sat down to do a mount spec and decided I needed to 
start with some background).  To be honest, I don't really remember where it 
went from there.  (I vaguely recall a segue into mount parameters; which 
includes flags, string flags, and the block device parameter itself.  Was that 
it?  Dunno.)

I go sleep now.

Rob