File System as a Loadable Kernel Module(LKM)
Some revelations I had during a File system port activity to Linux.
Why an LKM?
As you can see, user space idea didn’t pan out quite well:http://tekrants.me/2012/05/22/fuse-file-system-port-for-embedded-linux/
A word which defines VFS ?
Inode - This represents every file/directory/link in a file system , if your file system is like FAT and lacks an inode then get ready to write an abstraction layer.
1. Register mount & unmount calls with the VFS.
2. Mount call creates and registers a root directory inode .
3. Root directory inode is the point of entry to the volume, it gives VFS function pointers to inode operations (like create) and file operations (like open,read) & directory operations (like readdir).
The above three steps and your file system driver is all set, this means Linux will have enough information to translate an “Open” call from application to the file system specific open call, thanks to the function pointers inside the root inode.
What are these dentries?
Dentry is another kernel structure which exist for every file & directory, for example, accessing a path “/mnt/ramfs” will lead to creation of two dentries, one each for “mnt” and “ramfs”. Note that “ramfs” dentry will have a parent pointer to “mnt” dentry and a pointer to its own VFS inode. A Dentry in fact encompasses the attributes like name & parent of a file system entry, one of the rationales behind separation of the Inode core from these attributes are the existence of links where a single Inode is shared across multiple Dentries.
Opening a file? Easily said than done!
a. Consider opening a file with the path “/mnt/ramfs/dir1/dir2/foo.txt”
b. The dentry elements in the above path are “mnt”, “ramfs”, “dir1″, “dir2″ & “foo.txt”
c. “mnt” dentry will be part of Linux root file system, all dentries are part of a hash table, the string “mnt” will map to a hash table entry giving the dentry pointer. VFS gets the inode pointer from this dentry and every directory inode has a function pointer for look up operation on its contents.
d. Look up called on “mnt” inode will return the inode for “ramfs” along with a brand new “ramfs” dentry.
c. This is an iterative process and eventually VFS will figure out the inodes & dentries of all the elements in a path.
d. Inode of “foo.txt” will give the open function pointer to open call specific to the file system driver.
A file system ported to Linux is expected to populate the fields of VFS data structures like Inodes and Dentries so that Linux can understand and convey the file attributes and contents to the user , the obvious differentiating factor across file systems are their respective algorithms which define the device access. How Dentries/Inodes are represented and accessed from a storage is specific to file system and this inherently defines their strengths and weaknesses, in crude terms a file system in Linux comprises of a set of call backs for managing VFS data structures, so we have inode data structure and corresponding inode operations, we have file pointer data structure and file operations, dentry data structure and dentry operations and so on.
The crux of a Linux file system is its ability to talk in Linux language of Inodes and Dentries, also unless its a read only volume this interpretations needs to be done in reverse too. When user makes changes to a file then a file system needs to comprehend the Linux talk and translate those changes into a representation which it might have on the storage. Undoubtedly comprehending Linux VFS mandates deep understanding of Kernel data structures which might mean that a file system writer needs to have a kernel specific layer in the file system code, this undesirable complexity is immensely reduced by the use of kernel library functions. Functions which usually start with “generic_” can be interpreted as such a helper function which abstracts the kernel specifics from a kernel module, this is widely used for file system operations like “read”, “write” and even unmount. The usage of generic helper functions within a kernel module can be confusing when studying the code because they tend to blur the kernel and a module specific boundaries, this overlap is a convoluted but an extremely effective way for avoiding kernel dependencies.
Image Source : http://wiki.osdev.org/images/e/e5/Vfs_diagram.png
Some philosophy perhaps?
The design thought behind Linux VFS seems to of “choice”, other than a bare minimum there is always a choice provided to the kernel hacker regarding the implementation of an interface, he or she can either use a generic function, create a custom one or simply set it to NULL. This is an ideal approach for supporting a plethora of widely differing devices and file systems, quite easily visible when we look at the code of an ext4 where there is buffer cache abstraction usage over page cache compared to page cache sans buffer for UBIFS versus a direct to the device approach of JFFS2, accommodating all these widely varying designs need a flexible ‘rule of law‘ driven framework where everyone can find their niche and yet not impinge on the free will of another kernel module.
Been quite some time since the last post, life has been busy, thanks to the NAND flash chips from Toshiba & Samsung. Ironic enough their seemingly naive data sheets introduces NAND as an angelic technology, simple protocols, even more simple hardware interface and a totally reasonable requirement placed on driver to fix one bit errors and detection of two bit error (which are not supposed to happen but still for some unknown reason vendors mention this requirement too, if anyone can help me i would be ecstatic to know why). A touch of complexity is felt only when bad blocks are encountered, which is totally fair considering the cost effectiveness of NANDs.
My initial impression of NAND being a fairly simple fixed hassle free storage media was progressively crushed to shreds during the last one year of NAND torments. I worked only on SLC NANDs from Toshiba & Samsung used on mobile handset platforms(MLC is an unknown inferno to me). I hope that the below mentioned points might help the posterity from enduring the crisis, always remember to religiously follow the Data sheet (henceforth referred to as “the book” ;-) ) for NAND salvation.
- Keep innovative operation sequences for hobby projects.
- Do not try stuff like NAND reset command during NAND busy unless the book clearly explains the behavior of its effect on read, program and erase operation with a CLEAR timing diagram.
- Do NOT use read back check to detect bad blocks unless that is mentioned as one of the methods in the book
- MORAL : Follow ONLY what is written in the book, do not infer or even worse assume.
- Read wear leveling cannot prevent bit errors nor can erase refresh solve bit errors.
- I have seen bit errors happening on Samsung NAND flash when i executed multiple partial page writes, whose number exceeded the maximum specified for a page and by also executing multiple partial page reads, interestingly even after continuous block erases, the single bit errors refused to disappear.
- Any deviation from the strict protocol mentioned in the book can result in manifestation of strange symptoms.
- BTW: Read Wear count is a myth unless it is mentioned in the book.
- MORAL : Symptoms and root causes never have 1:1 ratio.
- Never go back and correct mistakes within a block
- Samsung NAND flashes “prohibits” going back to a lesser numbered page in a block and reprogramming it (for Eg: Do not program page 10 after programming page 20 within a block) the effect of such an operation is not documented so you do not know the symptoms which can incarnate in any form.
- Go ahead and question the logic of any file system which does random page programming in a block to mark dirty pages, I already did!.
- MORAL:Do not question what the book says, just blindly follow it.
I intend to keep posting the dogmas I discovered during interactions with NAND.