OpenStack users deploying Ceph for block (Cinder) and object (S3/Swift) are unsurprisingly looking at Manila and CephFS to round out a unified storage solution. Ceph is based on a low-level object storage layer call RADOS that serves as the foundation for its object, block, and file services. Manila's file as a service in OpenStack enables a range of container-based use-cases with Docker and Kubernetes, but a variety of deployment architectures are possible.
This talk will cover the current state of CephFS support in Manila, including upstream Manila support, Manila works in progress, a progress update on CephFS itself, including new multi-tenancy support to facilitate cloud deployments, and a discussion of how this impact container deployment scenarios in an OpenStack cloud.
1 of 43
Downloaded 277 times
More Related Content
The State of Ceph, Manila, and Containers in OpenStack
1. THE STATE OF CEPH, MANILA,
AND CONTAINERS IN OPENSTACK
SAGE WEIL
OPENSTACK SUMMIT TOKYO - 2015.10.28
2. 2
OUTLINE
● CephFS
● CephFS status update
● Current Manila landscape
● CephFS native driver
● Better FS plumbing to VMs
● Manila vs Nova responsibilities
● Manila and containers
● Summary
4. 4
WHY USE FILE IN THE CLOUD?
Why file?
● File-based applications aren't
going away
– POSIX is lingua-franca
● Interoperability with other
storage systems and data sets
● Container “volumes” are file
systems
– probably just directories
● Permissions and directories are
useful concepts
Why not block?
● Block is not useful for sharing data
between hosts
– ext4, XFS, etc assume exclusive
access
● Block devices are not very elastic
– File volumes can grow or shrink
without administrative resizing
5. 5
WHY CEPH?
● All components scale horizontally
● No single point of failure
● Hardware agnostic, commodity hardware
● Self-manage whenever possible
● Open source (LGPL)
● Move beyond legacy approaches
– client/cluster instead of client/server
– avoid ad hoc approaches HA
6. 6
CEPH COMPONENTS
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed
block device with cloud
platform integration
CEPHFS
A distributed file system
with POSIX semantics and
scale-out metadata
management
OBJECT BLOCK FILE
7. 7
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 01
10
CEPHFS DISTRIBUTED FILE SYSTEM
●
CephFS
– scalable data: files are stored
directly in RADOS
– scalable metadata: cluster of
metadata servers (MDS)
– POSIX: drop-in replacement for
any local or network file system
●
Multiple clients
– Linux kernel
– ceph-fuse
– libcephfs.so
●
Samba, Ganesha, Hadoop
MOSD Monitor MDS
8. 8
CEPHFS DYNAMIC SUBTREE PARTITIONING
Root
Busy directory fragmented across many MDS’s
MDS 0
MDS 1
MDS 2
MDS 3
MDS 4
● Scalable
– Arbitrarily partition hierarchy
– 10s to 100s of MDSs
● Adaptive
– Move load from busy to idle servers
– Replicate hot metadata on multiple nodes
9. 9
OTHER GOODIES
● Strongly consistent / coherent client caches
● Recursive accounting
– directory file size is amount of data stored
● Snapshots
– on any directory
● Directory quotas (libcephfs/ceph-fuse only)
– limit by bytes or file count
● xattrs
● ACLs
● Client-side persistent cache (Kernel client only)
11. 11
ROAD TO PRODUCTION
● Focus on resilience
– handle errors gracefully
– detect and report issues
– provide recovery tools
● Achieve this first with a single-MDS configuration
● CephFS as dog food
– use CephFS internally to run our QA infrastructure
– have found (and fixed) several hard to reproduce client bugs
● “Production-ready” CephFS with Jewel release (Q1 2016)
12. 12
WHAT IS NEW, WORK IN PROGRESS
● Improved health checks for diagnosing problems
– misbehaving clients, OSDs
● Diagnostic tools
– visibility into MDS request processing
– client session metadata (who is mounting what from where)
● Full space handling
● Client management
– evict misbehaving or dead clients
● Continuous verification
– online scrubbing of metadata
13. 13
FSCK AND REPAIR
● Repair tools
– loss of data objects (which files are damaged)
– loss (or corruption) of metadata objects (which subtrees are damaged)
● cephfs-journal-tool
– disaster recovery for damaged MDS journal
– repair damaged journal
– recover metadata from damaged or partial journal
● cephfs-table-tool
– adjust/repair/reset session, inode, snap metadata
● cephfs-data-scan
– rebuild metadata (directory hierarchy) from data objects (disaster recovery)
14. 14
ACCESS CONTROL
● Path-based
– restrict client mount to a subdirectory (e.g., /volumes/foo or /home/user)
– implemented in MDS
– integration with RADOS namespaces is WIP (targeting Jewel)
– (Jashan from GSoC)
● User-based
– mount file system with as single user (or small set of users)
– UID and GID based
– implement Unix-like permission checks at the MDS
– eventual integration with external auth/auth frameworks (e.g., Kerberos/AD)
– (Nishtha from Outreachy)
16. 16
MANILA FILE STORAGE
● Manila manages file volumes (“shares”)
– create/delete, share/unshare
– file server network connectivity
– snapshot management
● Caveats, awkward bits
– Manila also manages (only) part of the connectivity problem
● somewhat limited view of options (network file protocols only)
● manages “share networks” via Neutron
– User has responsibility for the “last mile”
● user must attach guest to share network
● user must mount the share (mount -t …)
● mount mechanism varies with storage type and/or hypervisor (NFS or CIFS)
MANILA
17. 17
?
APPLIANCE DRIVERS
● Appliance drivers
– tell an appliance to export NFS to guest IP
– map appliance IP into tenant network
(Neutron)
– boring (closed, proprietary, expensive, etc.)
● Status
– several drivers from usual suspects
– security punted to vendor
MANILA
VM
NFS CLIENT
NFS OVER NEUTRON NET
18. 18
KVM
VM
GANESHA
GENERIC SHARE DRIVER
● Model
– Cinder volume attached to service VM
– local file system (XFS, ext4, btrfs, ...)
– Ganesha NFS server
– Neutron network shared with tenant
● Pros
– built from existing components
– tenant isolation, security
● Cons
– extra hop → higher latency
– service VM consumes resources
– service VM is SPoF
● Status
– reference driver
LIBRBD
VM
NFS OVER NEUTRON NET
NFS CLIENT
MANILACINDER
XFS
CEPH
NATIVE CEPH TO
STORAGE NETWORK
19. 19
VM
GANESHA
GANESHA + LIBCEPHFS
● Model
– existing Ganesha driver toolkit, currently
used by GlusterFS
– Ganesha's libcephfs FSAL
● Pros
– simple, existing model
– security
● Cons
– extra hop → higher latency
– service VM is SpoF
– service VM consumes resources
● Status
– Manila Ganesha toolkit exists
– used for GlusterFS
– not yet integrated with CephFS
M M
RADOS CLUSTER
LIBCEPHFS
MANILA
NATIVE CEPH
VM
NFS CLIENT
NFS OVER NEUTRON NET
20. 20
THE PROBLEM WITH SERVICE VMS
● Architecture is limited
– slow: extra hop
– expensive: extra VM
● Current implementation is not highly-available
– need service monitoring, failover
– possibly load balancing
– Manila code assumes a single service endpoint/proxy
● It's a big TODO list. Is it the right end point?
22. 22
CEPH NATIVE DRIVER
● Model
– allow tenant access to storage network
– mount CephFS directly from tenant VM
● Pros
– best performance
– access to full CephFS feature set
– simple
● Cons
– guest must have modern distro/kernel
– exposes tenant to Ceph cluster
– networking currently left to user
– must deliver mount secret to client
VM
M M
RADOS CLUSTER
CEPH.KO
MANILA
NATIVE CEPH
23. 23
CEPHFS-VOLUME-MANAGER.PY
● cephfs-volume-manager.py → libcephfs.py → libcephfs.so → CephFS
– will be packaged as part of Ceph (with python-cephfs)
● Manila volumes/shares and consistency groups are just CephFS directories
– e.g., /manila/$cg/$volume
● Capture useful CephFS volume management tasks
– create – mkdir /manila/$cg/$volume
– delete (async) – mv /manila/$cg/$volume /manila/.trash
– snapshot volume – mkdir /manila/$cg/$volume/.snapshot/$snapname
– snapshot consistency group – mkdir /manila/$cg/.snapshot/$snapname
– promote snapshot to new volume
● read/write – cp -r ...
● read only – ln -s /manila/$cg/$vol/.snapshot/$snap /manila/$cg/$newvol
● Result is very simple Manila driver
– ~250 lines of code for native driver
24. 24
SECURITY
● Tenant has access to the storage network
– Ceph and CephFS are responsible for security isolation between tenants
● Client authentication has been there for years
– modeled after Kerberos (mutual client/server authentication)
● New CephFS path-based authorization
– new in MDS. Now upstream
– missing CephFS support for rados namespaces
● needed to restrict client access to CephFS objects using librados API
– will be in Jewel (Q1 2016)
● Is that enough?
– Ceph's security is the only barrier
– DoS potential against cluster
– it depends on the environment...
26. 26
WE WANT
● Better security
– ...like we get with block storage
● Simplicity of configuration and deployment
– ...like with Qemu and librbd
● Good performance
– ...like with CephFS native
27. 27
KVM + 9P/VIRTFS + LIBCEPHFS.SO
● Model
– link libcephfs.so into qemu virtfs layer
– guest OS (Linux) mounts via 9P-2000.L
● Pros
– security: tenant remains isolated from storage
net + locked inside a directory
– extremely simple deployment
● Cons
– requires (modern) Linux guests
– not supported on some distros
●
9p (kernel) and virtfs (qemu) code quality
– 9P isn't the great file protocol
● Status
– Prototype from Jevon Qiao, Haomai Wang, et al
● Qemu virtfs + libcephfs
● Manila driver + Nova mods
M M
RADOS CLUSTER
KVM
VIRTFS
NATIVE CEPH
VM
9P
LIBCEPHFS
28. 28
KVM + NFS + NFSD/GANESHA + CEPHFS
● Model
– mount CephFS on host
● or Ganesha + libcephfs on host
– export NFS to guest over private net
● Pros
– security: tenant remains isolated from
storage net + locked inside a directory
– NFS is well supported everywhere
– reliable: same HW failure domain as guest
– works for any FS, not just CephFS
● Cons
– NFS has weak caching consistency
– protocol translation will slow us down some
– awkward and/or insecure networking...
HOST
M M
RADOS CLUSTER
KVM
NATIVE CEPH
CEPH.KO
VM
NFS
TCP/IP OVER
PRIVATE NET
29. 29
NFS TO HOST: PROBLEMS WITH TCP/IP
● Slightly awkward networking
– add dedicated network device to VM
– configure local subnet and assign IPs on host and guest
– configure NFS export on hypervisor
– mount export from the VM
● Tricky to automate special-purpose network interfaces
● Guest networking infrastructure can disrupt file sharing
– firewalld
– networking restart
– “What is this weird network and interface doing here?”
● Other services on host may inadvertently be exposed to guest
– anything binding to INADDR_ANY (e.g., sshd)
30. 30
AF_VSOCK
● VMware vSockets: a new(-ish) address family / socket type
– designed for communication between VMs and hosts
– stream-based or connectionless datagrams (just like IP)
– address is a simple integer (e.g., vsock:2)
– supported in Linux kernel since v3.9 (2013)
● Zero configuration simplicity
– hypervisor is always address vsock:1
– hypervisor assigns an address >1 to each VM
31. 31
NFS TO HOST: VSOCK
● NFS v4.1 only
– older NFS versions have awkward legacy connectivity/addressing requirements (e.g.,
lockd)
– v4.1 consolidates protocol into a single connection
● Easy to support
– mostly boilerplate to add new address type, as with IPv6 (e.g., parsing)
● Linux kernel NFS client and server
– patches from Stefan Hajnoczi are under review
● Ganesha
– patches from Matt Benjamin are under review
● nfs-utils
– patches from Matt Benjamin are pending review
32. 32
KVM + NFS (VSOCK) + NFSD + CEPHFS.KO
● Model
– mount CephFS on host, knfsd
● or Ganesha + libcephfs on host
– export to VM's VSOCK address
● Pros
– NFSv4.1 is well supported
– security is better...
– simpler configuration...
– more reliable...
● Cons
– VSOCK support for Qemu and NFS
is shiny and new
HOST
M M
RADOS CLUSTER
KVM
NATIVE CEPH
CEPH.KO + KNFSD
VM
NFS
AF_VSOCK
33. 33
WE LIKE THE VSOCK-BASED MODEL
● Security
– tenant remains isolated from storage network
– no shared IP network between guest and host
● avoid INADDR_ANY problem (e.g., by sshd on host)
● Simplicity
– no network configuration beyond VM VSOCK address assignment (on host)
– treats VM as a black box
– no software-defined networking
● Reliability
– no gateway in a separate hardware failure domain
– fewer network traversals
● Performance
– clear win over a service VM
– possible win over TCP/IP to host (but not currently optimized with this in mind!)
34. 34
VSOCK CHALLENGES
● New hotness
– need to get code upstream and intro supported distros/products
– Qemu, Linux kernel, Ganesha, nfs-utils
● Host configuration
– someone needs to assign VSOCK addresses to VMs
– someone needs to mount CephFS (or other FS) on the host and reexport NFS
to the guest
● User experience and the last mile
– How does a consumer of the Manila API know how to mount this thing?
– Do they need intimate knowledge of which Manila driver is in use, and what
attachment mechanism is supported by this particular OpenStack instance?
– Can they choose?
36. 36
MANILA VS NOVA
● Manila manages shares/volumes
● Nova manages the VMs
● Cinder manages block volumes
● Nova manages VMs
● Nova attaches Cinder volumes to VMs
– mechanism is dependent on the Nova driver (KVM vs Xen vs lxd vs ...)
MANILA
NOVA
vs
37. 37
NOVA: ATTACH/DETACH FS API
● Attach or detach a file system
– hypervisor mediates access to Manila shares/volumes
– networking?
● attach to Neutron network
● assign VSOCK address
– gateway/proxy?
● knfds or Ganesha
– containers?
● Fetch access metadata (e.g., mount command inputs)
– mount protocol and options depend on Nova instance type and share
type
● Now Nova...
– can reattach after reboot
– manage live migration
MANILA
NOVA
vs
38. 38
NOVA: ATTACH/DETACH FS
FS access mode Meaning of attach/detach Meaning of access metadata
KVM, NFS from guest (e.g., to NetApp) Attach guest to Manila share's network Typical NFS mount command:
mount -t nfs $filervip:/ ...
KVM, VSOCK, Ganesha, libcephfs/libgfapi Write share definition to local Ganesha
config file for guest's VSOCK addr. Start
Ganesha.
NFS VSOCK mount command:
mount -t nfs vsock://1/ ...
KVM, VSOCK, knfsd, cephfs.ko mount Mount Cephfs. Write share definition to
/etc/exports for guest's VSOCK addr.
exportfs -a
NFS VSOCK mount command:
mount -t nfs vsock://1/ ...
KVM, NFS to generic share driver Attach guest to Manila share's network NFS IP mount command:
mount -t nfs $filerip:/ ...
KVM, NFS to Ganesha service VM Attach guest to Manila share's network NFS IP mount command:
mount -t nfs $filerip:/ ...
KVM or Ironic, native CephFS No-op
(or, attach guest to storage network)
CephFS mount requires a secret:
mount -t ceph $monip:/ -o secret=X ...
Nova container (lxc, lxd) Mount remote fs on host; mount –bind
share to guests /dev/manila/$shareid
Bind mount to desired location:
mount –bind /dev/manila/$shareid ...
40. 40
(LXC, LXD) + CEPHFS.KO
● Model
– host mounts CephFS (or whatever)
directly
– mount --bind share into container
namespace (/dev/manila/$shareid)
– user does mount --bind to final
location
● Pros
– best performance
– full CephFS semantics
● Cons
– rely on container for security
– need Nova attach/detach API
HOST
M M
RADOS CLUSTER
CONTAINER
MANILA
NATIVE CEPH
CEPH.KO
NOVA
41. 42
SUMMARY
● Ceph native driver should land soon
– and Ceph Jewel (Q1 2016) will have production-ready CephFS!
● Current Manila models are appliance-centric or limited
● NFS over VSOCK to the host is promising
– simplicity, reliability, security, performance
– either kernel NFS server or Ganesha
● We need to sort out the Nova vs Manila interaction
– Nova APIs would help enable
● non-KVM users for Manila (containers, Ironic)
● NFS over VSOCK to a host gateway