Filesystem Networking Performance Security Architecture Infrastructure Syscalls Features Tools Manifest Data Structures

The Book

Some things in Nanos are set in stone and others are not. In general security and performance are top of mind and we abide by KISS principles.

Quick FYI: This site is mainly for Nanos specific information. If you are an end-user and you just want more "getting started" docs please check out the DOCS on OPS.CITY which are substantial.

This site is a WIP (work in progress).

Filesystem Networking Performance Security Architecture Infrastructure Syscalls Features Tools Manifest Data Structures

Filesystem

The filesystem currently used by Nanos is TFS. Nanos isn't opposed to other file systems but hasn't identified a large need yet either. As with most of these sections if your team requires different filesystem support please reach out to the NanoVMs team for a support subscription.

For more info on the TFS filesystem.

Nanos supports the following storage drivers:
virtio_blk
virtio_scsi
pvscsi
nvme
ata_pci
storvsc
xenblk

Storvsc is used on Hyper-v/Azure. The drivers virtio_blk and virtio_scsi are used in QEMU/KVM cloud-based like AWS, GCE, Vultr, Digital Ocean and Oracle cloud. The xenblk driver supports the xenblk device in Xen. The ata_pci driver is supported in QEMU/KVM. The pvscsi, which is the driver for VMware paravirtualized scsi devices, is used in ESX instances. The driver for nvme can be used in clouds like GCE, AWS, Digital Ocean and Vultr.

Networking

Nanos supports both IPV4 and IPV6. For more information on configuring things like VPCs, firewalls and the like please consult the OPS networking config pages for your specific cloud.

Performance

Requests/Second

Not a lot of benchmarking and tuning has been done yet, however, there is plenty of potential. Currently, our naive tests can push 2X the amount of requests/second for Go webservers. This website is hosted on a Go webserver running a recent 0.1.27 version of Nanos. We've also seen up to 3X improvements on AWS.

Boot Time

Using the method described here with an un-stripped kernel we are seeing boot times of ~195ms from top of MBR bootloader to userspace frame return. However if you use a stripped kernel we can see 72ms. It should be noted that both infrastructure provider and application payload are going to significantly alter your boot time. For instance booting under firecracker with your own hardware is going to vastly out-perform booting on Azure. Likewise, booting a small c webserver is going to be much faster than booting a full blown Rails or JVM application. In short, the biggger the filesystem payload expect to pay more in boot time.

Minimum Memory Utilization

Currently the minimum memory utilization we have seen operate under the following conditions:

Qemu:	C 26 MB	Go 40 MB
Firecracker:	C 24 MB	Go 37 MB
VirtualBox:	C 24 MB	Go 36 MB

Security

Nanos has an opionated view of security. Users and their associated permissions are not supported. Nanos is also a single process (but multi-threaded) system. This means there is no support for SSH, shells or any other interactive multiple command/program running. While this prevents quite a few security issues extra precaution should be taken for things such as RFI style attacks. For instance you wouldn't want to leak your SSL private key or database credentials.

Similarily, just cause you can't create a new process doesn't mean an attacker couldn't inject their process.

Nanos employs various forms of security measures found in other general purpose operating systems including ASLR and respects page protections that compilers produce.

Nanos, unlike other general purpose operating systems, only provision what is necessary on the filesystem to run an application so most filesystems will have a few to maybe 10 libraries and many applications might have filesystems with only a handful of files on them.

Nanos's kernel lives on a different partition and is separated from the user-viewable partition. Nanos goes further with the idea of exec protection with an optional exec_protection flag available in the manifest. When this is enabled the application cannot modify the executable files and cannot create new executable files. For further information check out this PR.

For more info: more info
Nanos reduces its attack surface through a variety of thrusts. Compared to a normal Ubuntu or Debian instance has multiple orders of magnitude less lines of code, libraries only that are needed by an application and thousands of less executables - in fact it only can run one.
Nanos employs the following:

ASLR:

Stack Randomization
Heap Randomization
Library Randomization
Binary Randomization

Page Protections:

Stack Execution off by Default
Heap Execution off by Default
Null Page is Not Mapped
Stack Cookies/Canaries
Rodata no execute
Text no write

Architecture-level Security Technologies:

Supervisor Mode Execution Protection (SMEP)
User Mode Instruction Prevention (UMIP)

In addition to storing global constants in read-only pages at link time, Nanos gathers globals that should only be set at initialization time and stores them in a special section. The pages of this section are then rendered immutable after initialization is complete.

Code pages in Nanos have writes disabled, and execution of code in stack pages is forbidden. In the few cases that Nanos generates executable instructions (for vdso and vsyscall use), such code is similarly protected from writes. Closures do not contain jumps or other executable instructions.

Nanos has an 'exec protection' feature that prevents the kernel from executing any code outside the main executable and other 'trusted' files explicitly marked. The program is further limited from modifying the executable file and creating new ones. This flag may also be used on individual files within the children tuple. This prevents the application from exec-mapping anything that is not explicitly mapped as executable. This is not on by default, however, as many JITs won't work with it turned on. You can turn this on via the ops flag:

{
  "ManifestPassthrough": {
    "exec_protection": "t"
  }
}

Architecture

Currently Nanos supports X86-64, ARM64 (specifically for rpi4, graviton 2/3 instances and has SMP) and has limited RISC-V support.

The POWER family of architectures have been asked for but so far there is no roadmap for it. If you are interested in getting that sooner reach out to the NanoVMs team.

Nanos is always deployed as a guest VM directly on top of a hypervisor. Unlike Linux that runs many different applications on top of it Nanos molds the system and application into one discrete unit. Unlike Containers that duplicate storage and networking layers with an orchestrator in between Linux and the application Nanos relies on the native storage and networking layers present in the hypervisor of choice.

Infrastructure

Nanos can currently deploy to the following public cloud providers:

→ Google Cloud
→ Amazon Web Services
→ Digital Ocean
→ Vultr
→ Microsoft Azure
→ Oracle Cloud
→ UpCloud

→ OpenStack

→ ProxMox

Nanos can also deploy to the following hypervisors:

→ KVM
→ Xen
Bhyve

→ ESX
FireCracker
VirtualBox
→ Hyper-V

Nanos can even run on K8S.

Syscalls

Supported:

socket
bind
listen
accept
accept4
connect
sendto
sendmsg
sendmmsg
recvfrom
recvmsg
setsockopt
getsockname
getpeername
getsockopt
shutdown
futex
clone
arch_prctl
set_tid_address
gettid
timerfd_create
timerfd_gettime
timerfd_settime
timer_create
timer_settime
timer_gettime
timer_getoverrun
timer_delete
getitimer
setitimer
alarm
mincore
mmap
mremap
msync
munmap
mprotect
epoll_create
epoll_create1
epoll_ctl
poll
ppoll
select
pselect6
epoll_wait
epoll_pwait
read
pread64
write
pwrite64
open
openat
dup
dup2
dup3
fstat
fallocate
fadvise64
sendfile
stat
lstat
readv
writev
truncate
ftruncate
fdatasync
fsync
sync
syncfs
io_setup
io_submit
io_getevents
io_destroy
access
lseek
fcntl
ioctl
getcwd
symlink
symlinkat
readlink
readlinkat
unlink
unlinkat
rmdir
rename
renameat
renameat2
close
sched_yield
brk
uname
getrlimit
setrlimit
prlimit64
getrusage
getpid
exit_group
exit
getdents
getdents64
mkdir
mkdirat
getrandom
pipe
pipe2
socketpair
eventfd
eventfd2
creat
chdir
fchdir
utime
utimes
newfstatat
sched_getaffinity
sched_setaffinity
capget
prctl
sysinfo
umask
statfs
fstatfs
io_uring_setup
io_uring_enter
io_uring_register
kill
pause
rt_sigaction
rt_sigpending
rt_sigprocmask
rt_sigqueueinfo
rt_tgsigqueueinfo
rt_sigreturn
rt_sigsuspend
rt_sigtimedwait
sigaltstack
signalfd
signalfd4
tgkill
tkill
clock_gettime
clock_nanosleep
gettimeofday
nanosleep
time
times
inotify_init
inotify_add_watch
inotify_rm_watch
inotify_init1

unsupported:

shmget
shmat
shmctl
fork
vfork
execve
wait4, syscall_ignore);
semget
semop
semctl
shmdt
msgget
msgsnd
msgrcv
msgctl
flock, syscall_ignore);
link
chmod, syscall_ignore);
fchmod, syscall_ignore);
fchown, syscall_ignore);
lchown, syscall_ignore);
ptrace
syslog
getgid, syscall_ignore);
getegid, syscall_ignore);
setpgid
getppid
getpgrp
setsid
setreuid
setregid
getgroups
setresuid
getresuid
setresgid
getresgid
getpgid
setfsuid
setfsgid
getsid
mknod
uselib
personality
ustat
sysfs
getpriority
setpriority
sched_setparam
sched_getparam
sched_setscheduler
sched_getscheduler
sched_get_priority_max
sched_get_priority_min
sched_rr_get_interval
mlock, syscall_ignore);
munlock, syscall_ignore);
mlockall, syscall_ignore);
munlockall, syscall_ignore);
vhangup
modify_ldt
pivot_root
_sysctl
adjtimex
chroot
acct
settimeofday
mount
umount2
swapon
swapoff
reboot
sethostname
setdomainname
iopl
ioperm
create_module
init_module
delete_module
get_kernel_syms
query_module
quotactl
nfsservctl
getpmsg
putpmsg
afs_syscall
tuxcall
security
readahead
setxattr
lsetxattr
fsetxattr
getxattr
lgetxattr
fgetxattr
listxattr
llistxattr
flistxattr
removexattr
lremovexattr
fremovexattr
set_thread_area
io_cancel
get_thread_area
lookup_dcookie
epoll_ctl_old
epoll_wait_old
remap_file_pages
restart_syscall
semtimedop
clock_settime
vserver
mbind
set_mempolicy
get_mempolicy
mq_open
mq_unlink
mq_timedsend
mq_timedreceive
mq_notify
mq_getsetattr
kexec_load
waitid
add_key
request_key
keyctl
ioprio_set
ioprio_get
migrate_pages
mknodat
fchownat, syscall_ignore);
futimesat
linkat
fchmodat, syscall_ignore);
faccessat
unshare
set_robust_list
get_robust_list
splice
tee
sync_file_range
vmsplice
move_pages
utimensat
preadv
pwritev
perf_event_open
recvmmsg
fanotify_init
fanotify_mark
name_to_handle_at
open_by_handle_at
clock_adjtime
setns
getcpu
process_vm_readv
process_vm_writev
kcmp
finit_module
sched_setattr
sched_getattr
seccomp
memfd_create
kexec_file_load
bpf
execveat
userfaultfd
membarrier
mlock2, syscall_ignore);
copy_file_range
preadv2
pwritev2
pkey_mprotect
pkey_alloc
pkey_free

Features

→ -d strace
→ ftrace
→ http server dump

Tools

Several tools are packaged inside Nanos: mkfs

➜  ~ ~/.ops/0.1.27/mkfs -help
/Users/eyberg/.ops/0.1.27/mkfs: illegal option -- h
Usage:
mkfs [options] image-file < manifest-file
mkfs [options] -e image-file
Options:
-b boot-image   - specify boot image to prepend
-k kern-image   - specify kernel image
-r target-root  - specify target root
-s image-size   - specify minimum image file size; can be expressed in
bytes, KB (with k or K suffix), MB (with m or M suffix), and GB (with g
or G suffix)
-e              - create empty filesystem

dump

➜  ~ ~/.ops/0.1.27/dump
Usage: dump [OPTION]... 
Options:
  -d        Copy filesystem contents from  into 
  -t                    Display filesystem from  as a tree

There are also development tools available such as plugins for various editors: OPS for Visual Studio
IntelliJ

Manifest

The nanos manifest is an extremely powerful tool as it comes with many different flags and is the synthesis of a filesystem merged with various settings. Most users will never craft their own manifests by hand, opting to use OPS to craft it automatically.

→ futex_trace
→ debugsyscalls
→ fault
→ exec_protect

Data Structures

Nanos uses a variety of internal data structures. This is only a partial list.

Bitmaps

A bitmap is an array of bits to store binary variables. A bitmap is represented with the struct bitmap (see bitmap.h):

typedef struct bitmap {
  u64 maxbits;
  u64 mapbits;
  heap meta;
  heap map;
  buffer alloc_map;
} *bitmap;

maxbits is the number of bits that the bitmap contains
mapbits is the number of bits that has been allocated for this bitmap which is rounded up to the nearest 64 bits
meta is the heap to allocate the bitmap structure
map is the heap to allocate the bitmap buffer
alloc_map is the buffer that contains the actual bitmap

The bitmap length may be arbitrarily sized. The bitmap buffer is allocated in ALLOC_EXTEND_BITS / 8 byte increments as needed. In the following, we present the functions that manipulate bitmaps.

Instantiation

allocate_bitmap allocates and initializes a bitmap structures. The function is defined as follows:

bitmap allocate_bitmap(heap meta, heap map, u64 length)
{
  bitmap b = allocate_bitmap_internal(meta, length);
  if (b == INVALID_ADDRESS)
    return b;
  u64 mapbytes = b->mapbits >> 3;
  b->map = map;
  b->alloc_map = allocate_buffer(map, mapbytes);
  if (b->alloc_map == INVALID_ADDRESS)
    return INVALID_ADDRESS;
  zero(bitmap_base(b), mapbytes);
  buffer_produce(b->alloc_map, mapbytes);
  return b;
}

This function begins by allocating the memory for the bitmap structure by using the heap at meta. This is done by the function allocate_bitmap_internal. The function initializes the heap used by the bitmap to meta. It sets mapbits to lenght by padding to the nearest 64 bit multiple. mapbits can't be greater than ALLOC_EXTEND_BITS or 4096 bits. maxbits is initialized to lenght. The function allocates the memory for the actual bitmap. At line 6, mapbytes contains the size of the bitmap in bytes. In line 8, the function allocate_buffer gets a buffer of size mapbytes. In line 11, this chunk is filling with zeros, and in line 12, the buffer is created. This is needed because the buffer buffer structure tracks the start and end points of the data in the allocated memory. While allocate_buffer allocates the required memory, buffer_produce moves the end pointer to mark that memory as used by the buffer and not available to grow into. Note that the allocation of a bitmap relies on different memory allocators (heaps) optimized for different tasks. One heap is using for metadata, i.e., meta, and one for data, i.e., map. If success, the function returns a pointer to a bitmap structure; otherwise, it returns INVALID_ADDRESS.

bitmap bitmap_clone(bitmap b)

bitmap_clone can be used to allocate a new bitmap from an existing one. The function returns a bitmap which is a copy of the bitmap at b.

To release a bitmap, the caller uses the function deallocate_bitmap, which is defined as follows:

void deallocate_bitmap(bitmap b)
  {
    if (b->alloc_map)
    deallocate_buffer(b->alloc_map);
    deallocate(b->meta, b, sizeof(struct bitmap));
  }

The function first releases the buffer that contains the bitmap, and then, it releases the bitmap structure.

bitmap bitmap_wrap(heap h, u64 * map, u64 length)
void bitmap_unwrap(bitmap b)

bitmap_wrap creates a bitmap structure from an already existing bitmap. This function reuses the bitmap at map so no allocation of a new bitmap is required.

bitmap_unwrap releases the wrapped buffer by releasing the buffer structure and the bitmap structure.

Accessors

u64 bitmap_alloc(bitmap b, u64 size);
u64 bitmap_alloc_within_range(bitmap b, u64 nbits, u64 start, u64 end);

bitmap_alloc searches a bitmap for a range of nbits consecutive bits that are all cleared, and if such a range is found, sets all bits in the range and returns the first bit of the range; if not found, returns INVALID_PHYSICAL. The function bitmap_alloc_within_range behaves similarly than bitmap_alloc but it takes as parameter a range in which the region is searched.

boolean bitmap_dealloc(bitmap b, u64 bit, u64 size);

bitmap_dealloc checks that the range of size bits starting from bit is set in the bitmap b; if this is true, the function clears all the bits in that range and returns true, otherwise it returns false.

static inline void bitmap_set(bitmap b, u64 i, int val);
static inline void bitmap_set_atomic(bitmap b, u64 i, int val);
static inline int bitmap_test_and_set_atomic(bitmap b, u64 i, int val);

bitmap_set changes the value of the ith bit in the bitmap b depending on the value of val. If val is greater than zero, the bit is set; otherwise, the bit is cleared. The function bitmap_set_atomic sets a bit by using atomic operations. bitmap_test_and_set_atomic set or clear a bit depending on val. If true, the bit at i is set; otherwise, the bit is clear. This function is atomic.

static inline boolean bitmap_get(bitmap b, u64 i);

bitmap_get returns the value of the i bit from the bitmap b. The following macros are used to iterate over the elements of a bitmap:

#define bitmap_foreach_word(b, w, offset)				\
  for (u64 offset = 0, * __wp = bitmap_base(b), w = *__wp;		\
  offset < (b)->mapbits; offset += 64, w = *++__wp)

#define bitmap_word_foreach_set(w, bit, i, offset)			\
  for (u64 __w = (w), bit = lsb(__w), i = (offset) + (bit); __w;      \
  __w &= ~(1ull << (bit)), bit = lsb(__w), i = (offset) + (bit))

#define bitmap_foreach_set(b, i)					\
  bitmap_foreach_word((b), _w, s) bitmap_word_foreach_set(_w, __bit, (i), s)

The macro bitmap_foreach_word is used to walk a bitmap in 64 bits chunks by starting from offset. The macro bitmap_word_foreach_set is used to walk a chunk of 64 bits. The macro bitmap_foreach_set is used to walk a bitmap. For example, the following code uses bitmap_foreach_set to walk the bitmap and checks if it has been correctly initialized:

bitmap b = allocate_bitmap(h, h, 4096);
bitmap_foreach_set(b, i) {
  if (i) {
    msg_err("!!! allocation failed for bitmap\n");
    return NULL; 
  }
}

void bitmap_copy(bitmap dest, bitmap src)

bitmap_copy copies the bitmap at src into the bitmap at dest. In case src is larger than dst, i.e., src.maxbits > dst.maxbits, dst is extended. The extended bits are cleared.

boolean bitmap_range_check_and_set(bitmap b, u64 start, u64 nbits, boolean validate, boolean set);

bitmap_range_check_and_set sets a number of nbits by starting from start with the value at set. if validate is true, the function checks whether all bits in the supplied bit range are cleared in the bitmap before attempting to set them; otherwise, this check is skipped and the bits are set regardless of whether they were already set.

u64 bitmap_range_get_first(bitmap b, u64 start, u64 nbits)

bitmap_range_get_first returns the first bit set in a given range, or INVALID_PHYSICAL if no bits are set.

Red/Black Trees

A rbtree (see rbtree.h) is represented with the following structure:

typedef struct rbtree {
  rbnode root;
  u64 count;
  rb_key_compare key_compare;
  rbnode_handler print_key;
  heap h;
} *rbtree

root is a pointer to the root node of tree
count is a counter of the number of nodes in the tree
print_key is a function to print nodes
key_compare is the comparator function to compare nodes
head h is the heap to allocate the nodes

Each node in a rbtree is defined as follows:

typedef struct rbnode *rbnode;
struct rbnode {
  word parent_color;           /* parent used for verification */
  rbnode c[2];
};

parent_color is the color of the parent
c[2] is an array of rbnodes that identifies each child

The color of a node is defines as follows:

#define black 0
#define red 1

Instantiation

Rbtrees are instantiated by using the function allocate_rbtree():

rbtree allocate_rbtree(heap h, rb_key_compare key_compare, rbnode_handler print_key);

The function returns a pointer to a new rbtree. When defining a new rbtree, we require to define the comparator function, e.g., thread_tid_compare, and the function to print keys, e.g., tid_print_key. For example, the following code shows the creation of the rbtree that keeps all the threads of the system:

p->threads = allocate_rbtree(h, closure(h, thread_tid_compare), closure(h, tid_print_key)

We use the macros closure and closure_function for the definition of the functions:

closure_function(0, 2, int, thread_tid_compare,
                 rbnode, a, rbnode, b)
{
  thread ta = struct_from_field(a, thread, n);
  thread tb = struct_from_field(b, thread, n);
  return ta->tid == tb->tid ? 0 : (ta->tid < tb->tid ? -1 : 1);
}
  closure_function(0, 1, boolean, tid_print_key,
                 rbnode, n)
{
  rprintf(" %!d(MISSING)", struct_from_field(n, thread, n)->tid);
  return true;
}

A rbtree can be initialized by using the function init_rbtree():

void init_rbtree(rbtree t, rb_key_compare key_compare, rbnode_handler print_key);

This function is used in the same way than allocate_rbtree():

init_rbtree(&rm->t, closure(h, rmnode_compare), closure(h, print_key));

Accesors

The insertion of a node is done by the function rbtree_insert_node():

boolean rbtree_insert_node(rbtree t, rbnode n);

This function returns true if the rbtree has no root node so node is inserted in the root position; it returns false otherwise. The insertion of a node is based on the the comparator function. The traverse of a rbtree is done by the function rbtree_traverse():

#define RB_INORDER 0
#define RB_PREORDER 1
#define RB_POSTORDER 2
boolean rbtree_traverse(rbtree t, int order, rbnode_handler rh);

This function takes as parameters the rbtree to traverse, the order in which the rbtree is traversed, e.g., INORDER, and the function called for each node. The function rbtree_lookup() is used to look for a node. The function uses the comparator over all the tree's nodes to compare with the rbnode k.

rbnode rbtree_lookup(rbtree t, rbnode k);

If success, the function returns a pointer to a node; or INVALID_ADDRESS otherwise. The function rbtree_remove_by_key() is used to remove a node by a key:

boolean rbtree_remove_by_key(rbtree t, rbnode k);

When success, the function returns true; or false otherwise. The function rbtree_dump() is used to dump a tree. The function is defined as follows:

void rbtree_dump(rbtree t, int order);

The function gets as parameters the rbtree and the order in which the tree is traversed. The function remove_min() allows to get the node with the minimum key:

static rbnode remove_min(rbnode h, rbnode *removed)

The functions rbnode_get_prev() and rbnode_get_next() are meant to get the next and the previous node from a given key:

rbnode rbnode_get_prev(rbnode h);
rbnode rbnode_get_next(rbnode h);

These functions return INVALID_ADDRESS when the node is not found.

Releasing

The function destruct_rbtree() is used to release each node in a rbtree. The function is defined as follows:

void destruct_rbtree(rbtree t, rbnode_handler destructor);

The rbtree is traversed and the destructor is called for each node. The function resets the root node of the rbtree. The function does not release the rbtree itself. The function deallocate_rbtree() is used to release all the nodes and the rbtree. This function is defined as follows:

void deallocate_rbtree(rbtree rb, rbnode_handler destructor);

The function first call destruct_rbtree() and then it releases the rbtree structure.

ID Heap

Nanos allows developers to define different memory allocators named heaps optimized for different tasks. Rather than having a couple of heap allocation functions like Linux's kmalloc or vmalloc, Nanos uses a heap interface with allocate and deallocate function pointers that can be different for each heap structure. This results in a very flexible memory allocation. Developers can create hierarchies of heaps, or trivially change one heap to a special debug heap to help with troubleshooting. This section presents the API to manipulate Id Heaps (see id.h). These are general purpose allocators that carve up number space from a set of ranges that may be specified on initialization.

Instantiation

To create an Id heap, we use the function create_id_heap():

id_heap create_id_heap(heap meta, heap map, u64 base, u64 length, bytes pagesize, boolean locking);

meta is the heap to allocate the metadata for the new heap
map is the heap to allocate a bitmap to store the used elements
base is the starting address of the chunk that is allocated
length is the size of the chunk that is allocated
pagesize is the minimum size of the allocations
locking is a boolean variable that indicates that the heap may be accessed concurrently so accesses must be protected by using locking mechanisms

The function returns an id_heap that can be used to allocate chunks of memory by using the function allocate():

void * addr = allocate(id, PAGE_SIZE);

This returns a pointer to a chunk of memory of PAGE_SIZE size. The size of the allocation must be greater or equal to the page size that is defined when the heap is created. This is not to be confused with the system page size, although allocators meant to serve allocations in page-sized units do indeed have a pagesize value equivalent to the system page size.

The function create_id_heap_backed() allows developers to create hierarchical heaps:

id_heap create_id_heap_backed(heap meta, heap map, heap parent, bytes pagesize, boolean locking)

The function requires the heap parent argument from which the new heap goes to allocate new ranges of address space with allocations being a minimum of the parent heap page size.

Releasing

To de-allocate a chunk of memory, we use the deallocate() function:

deallocate(id, addr, PAGE_SIZE);

We use the function named destroy_heap(id) to release all the resources allocated by the heap.

Page-backed Heap

The page-backed heap allocates in page-size units (see here). A page-backed heap is represented with the page_backed_page structure:

typedef struct page_backed_heap {
    struct backed_heap bh;
    heap physical;
    heap virtual;
    struct spinlock lock;
} *page_backed_heap;

Each address space is represented with a different heap: the physical and the virtual heap. The function allocates from both heaps an creates a mapping between them. The page-backed heap is required for individual pages that need different protection flags or otherwise need to have mappings manipulated for some reason.

Instantiation

The instantiation of a page_backed heap is done by the function:

backed_heap allocate_page_backed_heap(heap meta, heap virtual, heap physical, u64 pagesize, boolean locking)

The function takes three heaps as arguments: one for the meta, one for the virtual address and one for the physical. In this heap, the pagesize corresponds with the page-size units of the architecture.

The page_backed_alloc_map() function is used to allocate a number of bytes from the page-backed heap:

static inline void *page_backed_alloc_map(backed_heap bh, bytes len, u64 *phys)

The number of bytes in len are allocated from both heaps. The function also maps the physical address into the virtual address. In phys, It returns the pointer to the physical address. The function returns a pointer to the virtual address.

The page_backed_alloc_map_locking() function is used when mutual exclusion is required during allocation. The function is declared as follows:

void *page_backed_alloc_map_locking(backed_heap bh, bytes len, u64 *phys)

Releasing

The page_backed_dealloc() function is used for the deallocation. It is based on page_backed_dealloc_unmap() for the unmapping between the virtual and the physical address. The function is declared as follows:

void page_backed_dealloc(heap h, u64 x, bytes length)

When mutual exclusion is required during deallocation, developers can use the page_backed_dealloc_locking() function:

static void page_backed_dealloc_locking(heap h, u64 x, bytes length)

Linear-backed Heap

The linear-backed heap takes advantage of the largest available pagesize on the architecture. The heap maps all physical memory that is allocated and thus memory can be accessed through the linear mapping without the need to set up a mapping for each allocation. A linear-backed heap is represented with the linear_backed_page structure (see here):

typedef struct linear_backed_heap {
    struct backed_heap bh;
    heap meta;
    id_heap physical;
    bitmap mapped;
} *linear_backed_heap;

The physical is the id heap from where the physical chunks are allocated.

Instantiation

The instantiation of a linear-backed heap is done by using the allocate_linear_backed_heap() function:

backed_heap allocate_linear_backed_heap(heap meta, id_heap physical)

The first parameter corresponds with the heap that is used for bitmap allocation. The second parameter corresponds with the id_heap for the physical memory allocation. The function returns a backed_heap structure. During the allocation of the linear-backed heap, the whole physical heap is mapped by relying on the function linear_backed_init_maps(). This function iterates through the phys id heap ranges and adds the required mappings. The physical memory is mapped at a fixed offset for all memory supported by the linear backed heap. This is an advantage over the page-backed heap because a lookup within page table memory is not necessary.

The linear_backed_alloc() function allocates size bytes:

static u64 linear_backed_alloc(heap h, bytes size)

This function returns the physical address of the chunk.

The linear_backed_alloc_map() allocates size bytes and maps it. The function returns the physical address of the new allocation at phys:

static void *linear_backed_alloc_map(backed_heap bh, bytes len, u64 *phys)

Releasing

The deallocation is done by using the linear_backed_dealloc() function:

static void linear_backed_dealloc(heap h, u64 x, bytes size)

The linear_backed_dealloc_unmap() function deallocates and unmaps a chunk of memory of len bytes which is located at the *virt virtual address and the phys physical address:

static void linear_backed_dealloc_unmap(backed_heap bh, void *virt, u64 phys, bytes len)