Raw notes from the Storage Developers Conference (SDC 2013)

This blog post is a compilation of my raw notes from SNIA’s SDC 2013 (Storage Developers Conference).

Notes and disclaimers:

  • These notes were typed during the talks and they may include typos and my own misinterpretations.
  • Text in the bullets under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
  • If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
  • I spent limited time fixing typos or correcting the text after the event. Just so many hours in a day...
  • I have not attended all sessions (since there are 4 or 5 at a time, that would actually not be possible :-)…
  • SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.
  • You can find the event agenda at https://www.snia.org/events/storage-developer2013/agenda2013

SMB3 Meets Linux: The Linux Kernel Client
Steven French, Senior Engineer SMB3 Architecture, IBM

  • Title showing is (with the strikethrough text): CIFS SMB2 SMB2.1 SMB3 SMB3.02 and Linux, a Status Update.
  • How do you use it? What works? What is coming?
  • Who is Steven French: maintains the Linux kernel client, at SMB3 Architect for IBM Storage
  • Excited about SMB3
  • Why SMB3 is important: cluster friendly, large IO sizes, more scalable.
  • Goals: local/remote transparency, near POSIX semantics to Samba, fast/efficient/full function/secure method, as reliable as possible over bad networks
  • Focused on SMB 2.1, 3, 3.02 (SMB 2.02 works, but lower priority)
  • SMB3 faster than CIFS. SMB3 remote file access near local file access speed (with RDMA)
  • Last year SMB 2.1, this year SMB 3.0 and minimal SMB 3.02 support
  • 308 kernel changes this year, a very active year. More than 20 developers contributed
  • A year ago 3.6-rc5 – now at 3.11 going to 3.12
  • Working on today copy offload, full linux xattr support, SMB3 UNIX extension prototyping, recover pending locks, starting work on Multichannel
  • Outline of changes in the latest releases (from kernel version 3.4 to 3.12), version by version
  • Planned for kernel 3.13: copy chunk, quota support, per-share encryption, multichannel, considering RDMA (since Samba is doing RDMA)
  • Improvements for performance: large IO sizes, credit based flow control, improved caching model. Still need to add compounding,
  • Status: can negotiate multiple dialects (SMB 2.1, 3, 3.02)
  • Working well: basic file/dir operations, passes most functional tests, can follow symlinks, can leverage durable and persistent handles, file leases
  • Need to work on: cluster enablement, persistent handles, witness, directory leases, per-share encryption, multichannel, RDMA
  • Plans: SMB 2.1 no longer experimental in 3.12, SMB 2.1 and 3 passing similar set of functional tests to CIFS
  • Configuration hints: adjusting rsize, wsize, max_pending, cache, smb3 signing, UNIX extension, nosharelock
  • UNIX extensions: POSIX pathnames, case sensitive path name, POSIX delete/rename/create/mkdir, minor extensions to stat/statfs, brl, xattr, symlinks, POSIX ACLs
  • Optional POSIX SMB3 features outlined: list of flags used for each capability
  • Question: Encryption: Considering support for multiple algorithms, since AES support just went in the last kernel.
  • Development is active! Would like to think more seriously about NAS appliances. This can be extended…
  • This is a nice, elegant protocol. SMB3 fits well with Linux workloads like HPC, databases. Unbelievable performance with RDMA.
  • Question: Cluster enablement? Durable handle support is in. Pieces missing for persistent handle and witness are small. Discussing option to implement and test witness.
  • Need to look into the failover timing for workloads other than Hyper-V.
  • Do we need something like p-NFS? Probably not, with these very fast RDMA interfaces…

Mapping SMB onto Distributed Storage
Christopher R. Hertel, Senior Principal Software Engineer, Red Hat
José Rivera, Software Engineer, Red Hat

  • Trying to get SMB running on top of a distributed file system, Gluster
  • Chris and Jose: Both work for RedHat, both part of the Samba team, authors, etc…
  • Metadata: data about data, pathnames, inode numbers, timestamps, permissions, access controls, file size, allocation, quota.
  • Metadata applies to volumes, devices, file systems, directories, shares, files, pipes, etc…
  • Semantics are interpreted in different contexts
  • Behavior: predictable outcomes. Make them the same throughout the environments, even if they are not exactly the same
  • Windows vs. POSIX: different metadata + different semantics = different behavior
  • That’s why we have a plugfest downstairs
  • Long list of things to consider: ADS, BRL, deleteonclose, directory change notify, NTFS attributes, offline ops, quota, etc…
  • Samba is a Semantic Translator. Clients expect Windows semantics from the server, Samba expects POSIX semantics from the underlying file system
  • UNIX extensions for SMB allows POSIX clients to bypass some of this translation
  • If Samba does not properly handle the SMB protocol, we call it a bug. If cannot handle the POSIX translation, it’s also a bug.
  • General Samba approach: Emulate the Windows behavior, translate the semantics to POSIX (ensure other local processes play by similar rules)
  • The Samba VFS layers SMB Protocol Initial Request Handling  VFS Layer  Default VFS Layer  actual file system
  • Gluster: Distributed File System, not a cluster file system. Brick  Directory in the underlying file system. Bricks bound together as a volume. Access via SMB, NFS, REST.
  • Gluster can be FUSE mounted. Just another access method. FUSE hides the fact that it’s Gluster underneath.
  • Explaining translations: Samba/Gluster/FUSE. Gluster is adaptable. Translator stack like Samba VFS modules…
  • Can add support for: Windows ACLs, oplocks, leases, Windows timestamps.
  • Vfs_glusterfs: Relatively new code, similar to other Samba VFS modules. Took less than a week to write.
  • Can bypass the lower VFS layers by using libgfapi. All VFS calls must be implemented to avoid errors.
  • CTDB offers three basics services: distributed metadata database (for SMB state), node failure detection/recovery, IP address service failover.
  • CTDB forms a Samba cluster. Separate from the underlying Gluster cluster. May duplicate some activity. Flexible configuration.
  • SMB testing, compared to other access methods: has different usage patterns, has tougher requirements, pushes corner cases.
  • Red Hat using stable versions, kernel 2.x or something. So using SMB1 still…
  • Fixed: Byte range locking. Fixed a bug in F_GETLK to get POSIX byte range locking to work.
  • Fixed:  SMB has strict locking and data consistency requirements. Stock Gluster config failed ping_pong test. Fixed cache bugs  ping_pong passes
  • Fixed: Slow directory lookups. Samba must do extra work to detect and avoid name collisions. Windows is case-INsensitive, POSIX is case-sensitive. Fixed by using vfs_glusterfs.
  • Still working on: CTDB node banning. Under heavy load (FSCT), CTDB permanently bans a running node. Goal: reach peak capacity without node banning. New CTDB versions improved capacity.
  • Still working on: CTDB recovery lock file loss. Gluster is a distributed FS, not a Cluster FS. In replicated mode, there are two copies of each file. If Recovery Lock File is partitioned, CTDB cannot recover.
  • Conclusion: If implementing SMB in a cluster or distributed environment, you should know enough about SMB to know where to look for trouble… Make sure metadata is correct and consistent.
  • Question: Gluster and Ceph have VFS. Is Samba suitable for that? Yes. Richard wrote a guide on how to write a VFS. Discussing a few issues around passing user context.
  • Question: How to change SMB3 to be more distributed? Client could talk to multiple nodes. Gluster working on RDMA between nodes. Protocol itself could offer more about how the cluster is setup.

Pike - Making SMB Testing Less Torturous
Brian Koropoff, Consulting Software Engineer, EMC Isilon

  • Pike – written in Python – starting with a demo
  • Support for a modest subset of SMB2/3. Currently more depth than breadth.
  • Emphasis on fiddly cases like failover, complex creates
  • Mature solutions largely in C (not convenient for prototyping)
  • Why python: ubiquitous, expressive, flexible, huge ecosystem.
  • Flexibility and ease of use over performance. Convenient abstractions. Extensible, re-usable.
  • Layers: core primitives (abstract data model), SMB2/3 packet definitions, SMB2/3 client model (connection, state, request, response), test harness
  • Core primitives: Cursor (buffer+offset indicating read/write location), frame (packet model), enums, anti-boilerplate magic. Examples.
  • SMB2/SMB3 protocol (pike.smb2) header, request/response, create {request/response} context, concrete frame. Examples.
  • SMB2/SMB3 model: SMB3 object model + glue. Future, client, connection (submit, trasceive, error handling), session, channel (treeconnect, create, read), tree, open, lease, oplocks.
  • Examples: Connect, tree connect, create, write, close. Oplocks. Leases.
  • Advanced uses. Manually construct and submit exotic requests. Override _encode. Example of a manual request.
  • Test harness (pike,test): quickly establish connection, session and tree connect to server. Host, credentials, share parameters taken from environment.
  • Odds and ends: NT time class, signing, key derivation helpers.
  • Future work: increase breadth of SMB2/3 support. Security descriptors, improvement to mode, NTLM story, API documentation, more tests!
  • https://github.com/emc-isilon/pike - open source, patches are welcome. Has to figure out how to accept contributions with lawyers…
  • Question: Microsoft has a test suite. It’s in C#, doesn’t work in our environment. Could bring it to the plugfest.
  • Question: I would like to work on implementing it for SMB1. What do you think? Not a priority for me. Open to it, but should use a different model to avoid confusion.
  • Example: Multichannel. Create a session, bind another channel to the same session, pretend failover occurred. Write fencing of stable write.

 Exploiting the High Availability features in SMB 3.0 to support Speed and Scale
James Cain, Principal Software Architect, Quantel Ltd

  • Working with TV/Video production. We only care about speed.
  • RESTful recap. RESTful filesystems talk from SDC 2010. Allows for massive scale by storing application state in the URLs instead of in the servers.
  • Demo (skipped due to technical issues): RESTful SMB3.
  • Filling pipes: Speed (throughput) vs. Bandwidth vs. Latency. Keeping packets back to back on the wire.
  • TCP Window size used to limit it. Mitigate by using multiple wires, multiple connections.
  • Filling the pipes: SMB1 – XP era. Filling the pipes required application participation. 1 session could do about 60MBps. Getting Final Cut Pro 7 to lay over SMB1 was hard. No choice to reduce latency.
  • Filling the pipes: SMB 2.0 – Vista era. Added credits, SMB2 server can control overlapped requests using credits. Client application could make normal requests and fill the pipe.
  • Filling the pipes: SMB 2.1 – 7 era. Large MTU helps.
  • Filling the pipes: SMB 3 – 8 era. Multi-path support. Enables: RSS, Multiple NICs, Multiple machines, RDMA.
  • SMB3 added lots of other features for high availability and fault tolerance. SignKey derivation.
  • Filesystem has DirectX GUI :-) - We use GPUs to render, so our SMB3 server has Cuda compute built in too. Realtime visualization tool for optimization.
  • SMB3 Multi-machine with assumed shared state. Single SMB3 client talking to two SMB3 servers. Distributed non-homogeneous storage behind the SMB servers.
  • Second NIC (channel) initiation has no additional CREATE. No distinction on the protocol between single server or multiple server. Assume homogeneous storage.
  • Asking Microsoft to consider “NUMA for disks”. Currently, shared nothing is not possible. Session, trees, handles are shared state.
  • “SMB2++” is getting massive traction. Simple use cases are well supported by the protocol. SMB3 has a high cost of entry, but lower than writing n IFS in kernel mode.
  • There are limits to how far SMB3 can scale due to its model.
  • I know this is not what the protocol is designed to do. But want to see how far I can go.
  • It could be help by changing the protocol to have duplicate handle semantics associated with the additional channels.
  • The protocol is really, really flexible. But I’m having a hard time doing what I was trying to do.
  • Question: You’re basic trying to do Multichannel  to multiple machines. Do you have a use case? I’m experimenting with it. Trying to discover new things.
  • Question: You could use CTDB to solve the problem. How much would it slow down? It could be a solution, not an awful lot of state.             

SMB3 Update
David Kruse, Development Lead, Microsoft

  • SMB 3.02 - Don’t panic! If you’re on the road to SMB3, there are no radical changes.
  • Considered not revving the dialect and doing just capability bits, but thought it would be better to rev the dialect.
  • Dialects vs. Capabilities: Assymetric Shares, FILE_ATTRIBUTE_INTEGRITY_STREAMS.
  • SMB 2.0 client attempting MC or CA? Consistency/documentation question.
  • A server that receives a request from a client with a flag/option/capability that is not valid for the dialect should ignore it.
  • Showing code on how to mask the capabilities that don’t make sense for a specific dialect
  • Read/Write changes: request specific flag for unbuffered IO. RDMA flag for invalidation.
  • Comparing “Traditional” File Server Cluster vs. “Scale-Out” File Server cluster
  • Outlining the asymmetric scale-out file server cluster. Server-side redirection. Can we get the client to the optimal node?
  • Asymmetric shares. New capability in the TREE_CONNECT response. Witness used to notify client to move.
  • Different connections for different shares in the same scale-out file server cluster. Share scope is the unit of resource location.
  • Client processes share-level “move” in the same fashion as a server-level “move” (disconnect, reconnects to IP, rebinds handle).
  • If the cost accessing the data is the same for all nodes, there is no need to move the client to another node.
  • Move-SmbWitnessClient will not work with asymmetric shares.
  • In Windows, asymmetric shares are typically associated with Mirrored Storage Spaces, not iSCSI/FC uniform deployment. Registry key to override.
  • Witness changes: Additional fields: Sharename, Flags, KeepAliveTimeOutInSeconds.
  • Witness changes: Multichannel notification request. Insight into arrival/loss of network interfaces.
  • Witness changes: Keepalive. Timeout for async IO are very coarse. Guarantees client and server discover lost peer in minutes instead of hours.
  • Demos in Jose’s blog. Thanks for the plug!
  • Diagnosability events. New always-on events. Example: failed to reconnect a persistent handle includes previous reconnect error and reason. New events on server and client.
  • If Asymmetric is not important to you, you don’t need to implement it.
  • SMB for IPC (Inter-process communications) – What happened to named pipes?
  • Named pipes over SMB has been declined in popularity. Performance concerns with serialized IO. But this is a property of named pipes, not SMB.
  • SMB provides: discovery, negotiation, authentication, authorization, message semantics, multichannel, RDMA, etc…
  • If you can abstract your application as a file system interface, you could extend it to removte via SMB.
  • First example: Remote Shared Virtual Disk Protocol
  • Second example: Hyper-V Live Migration over SMB. VID issues writes over SMB to target for memory pages. Leverages SMB Multichannel, SMB Direct.
  • Future thoughts on SMB for IPC. Not a protocol change or Microsoft new feature. Just ideas shared as a thought experiment.
    • MessageFs – User mode-client and user-mode server. Named Pipes vs. MessageFs. Each offset marks a distinct transaction, enables parallel actions.
    • MemFs – Kernel mode component on the server side. Server registers a memory region and clients can access that memory region.
    • MemFs+ - What if we combine the two? Fast exchange for small messages plus high bandwidth, zero copy access for large transfers. Model maps directly to RDMA: send/receive messages, read/write memory access.
  • One last thing… On Windows 8.1, you can actually disable SMB 1.0 completely.

Architecting Block and Object Geo-replication Solutions with Ceph
Sage Weil, Founder & CTO, Inktank

  • Impossible to take notes, speaker goes too fast :-)

1 S(a) 2 M 3 B(a) 4
Michael Adam, SerNet GmbH - Delivered by Volker

  • What is Samba? The open source SMB server (Samba3). The upcoming open source AD controller (Samba4). Two different projects.
  • Who is Samba? List of team members. Some 35 or so people… www.samba.org/samba/team
  • Development focus: Not a single concentrated development effort. Various companies (RedHat, SuSE, IBM, SerNet, …) Different interests, changing interests.
  • Development quality: Established. Autobuild selftest mechanism. New voluntary review system (since October 2012).
  • What about Samba 4.0 after all?
    • First (!?) open source Active Directory domain controller
    • The direct continuation of the Samba 3.6 SMB file server
    • A big success in reuniting two de-facto separated projects!
    • Also a big and important file server release (SMB 2.0 with durable handles, SMB 2.1 (no leases), SMB 3.0 (basic support)
  • History. Long slide with history from 2003-06-07 (Samba 3.0.0 beta 1) to 2012-12-11 (Samba 4.0.0). Samba4 switched to using SMB2 by default.
  • What will 4.1 bring? Current 4.1.0rc3 – final planned for 2013-09-27.
  • Samba 4.1 details: mostly stabilization (AD, file server). SMB2/3 support in smbclient, including SMB3 encryption. Server side copy. Removed SWAT.
  • Included in Samba 4.0: SMB 2.0 (durable handles). SMB 2.1 (multi-credit, large MTU, dynamic reauth), SMB 3.0 (signing, encryption, secure negotiate, durable handles v2)
  • Missing in Samba 4.0: SMB 2.1 (leasing*, resilient file handles), SMB 3.0 (persistent file handles, multichannel*, SMB direct*, witness*, cluster features, storage features*, …) *=designed, started or in progress
  • Leases: Oplocks done right. Remove 1:1 relationship between open and oplock, add lease/oplock key. https://wiki.samba.org/index.php/Samba3/SMB2#Leases
  • Witness: Explored protocol with Samba rpcclient implementation. Working on pre-req async RPC. https://wiki.samba.org/index.php/Samba3/SMB2#Witness_Notification_Protocol
  • SMB Direct:  Currently approaching from the Linux kernel side. See related SDC talk. https://wiki.samba.org/index.php/Samba3/SMB2#SMB_Direct
  • Multichannel and persistent handles: just experimentation and discussion for now. No code yet.

Keynote: The Impact of the NVM Programming Model
Andy Rudoff, Intel

  • Title is Impact of NVM Programming Model (… and Persistent Memory!)
  • What do we need to do to prepare, to leverage persistent memory
  • Why now? Programming model is decades old!
  • What changes? Incremental changes vs. major disruptions
  • What does this means to developers? This is SDC…
  • Why now?
  • One movements here: Block mode innovation (atomics, access hints, new types of trim, NVM-oriented operations). Incremental.
  • The other movement: Emerging NVM technologies (Performance, performance, perf… okay, Cost)
  • Started talking to companies in the industry  SNIA NVM Programming TWG - https://snia.org/forums/sssi/nvmp
  • NVM TWG: Develop specifications for new software “programming models”as NVM becomes a standard feature of platforms
  • If you don’t build it and show that it works…
  • NVM TWG: Programming Model is not an API. Cannot define those in a committee and push on OSVs. Cannot define one API for multiple OS platforms
  • Next best thing is to agree on an overall model.
  • What changes?
  • Focus on major disruptions.
  • Next generation scalable NVM: Talking about resistive RAM NVM options. 1000x speed up over NND, closer do DRAM.
  • Phase Change Memory, Magnetic Tunnel Junction (MT), Electrochemical Cells (ECM), Binary Oxide Filament Cells, Interfacial Switching
  • Timing. Chart showing NAND SATA3 (ONFI2, ONFI3), NAND PCIe Gen3 x4 ONFI3 and future NVM PCIE Gen3 x4.
  • Cost of software stack is not changing, for the last one (NVM PCIe) read latency, software is 60% of it?!
  • Describing Persistent Memory…
  • Byte-addressable (as far as programming model goes), load/store access (not demand-paged), memory-like performance (would stall a CPU load waiting for PM), probably DMA-able (including RDMA)
  • For modeling, think battery-backed RAM. These are clunky and expensive, but it’s a good model.
  • It is not tablet-like memory for the entire system. It is not NAND Flash (at least not directly, perhaps with caching). It is not block-oriented.
  • PM does not surprise the program with unexpected latencies (no major page faults). Does not kick other things out of memory. Does not use page cache unexpectedly.
  • PM stores are not durable until data is flushed. Looks like a bug, but it’s always been like this. Same behavior that’s been around for decades. It’s how physics works.
  • PM may not always stay in the same address (physically, virtually). Different location each time your program runs. Don’t store pointers and expect them to work. You have to use relative pointers. Welcome to the world of file systems…
  • Types of Persistent Memory: Battery-backed RAM. DRAM saved on power failure. NVM with significant caching. Next generation NVM (still quite a bit unknown/emerging here).
  • Existing use cases: From volatile use cases (typical) to persistent memory use case (emerging). NVDIMM, Copy to Flash, NVM used as memory.
  • Value: Data sets with no DRAM footprint. RDMA directly to persistence (no buffer copy required!). The “warm cache” effect. Byte-addressable. Direct user-mode access.
  • Challenges: New programming models, API. It’s not storage, it’s not memory. Programming challenges. File system engineers and database engineers always did this. Now other apps need to learn.
  • Comparing to the change that happened when we switched to parallel programming. Some things can be parallelized, some cannot.
  • Two persistent memory programming models (there are four models, more on the talk this afternoon).
  • First: NVM PM Volume mode. PM-aware kernel module. A list of physical ranges of NVMs (GET_RANGESET).
  • For example, used by file systems, memory management, storage stack components like RAID, caches.
  • Second: NVM PM File. Uses a persistent-memory-aware file system. Open a file and memory map it. But when you do load and store you go directly to persistent memory.
  • Native file APIs and management. Did a prototype on Linux.
  • Application memory allocation. Ptr=malloc(len). Simple, familiar interface. But it’s persistent and you need to have a way to get back to it, give it a name. Like a file…
  • Who uses NVM.PM.FILE. Applications, must reconnect with blobs of persistence (name, permissions)
  • What does it means to developers?
  • Mmap() on UNIX, MapViewOfFile() on Windows. Have been around for decades. Present in all modern operating systems. Shared or Copy-on-write.
  • NVM.PM.FILE – surfaces PM to application. Still somewhat raw at this point. Two ways: 1-Build on it with additional libraries. 2-Eventually turn to language extensions…
  • All these things are coming. Libraries, language extensions. But how does it work?
  • Creating resilient data structures. Resilient to a power failure. It will be in state you left it before the power failure. Full example: resilient malloc.
  • In summary: models are evolving. Many companies in the TWG. Apps can make a big splash by leveraging this… Looking forward to libraries and language extensions.

Keynote: Windows Azure Storage – Scaling Cloud Storage
Andrew Edwards, Microsoft

  • Turning block devices into very, very large block devices. Overview, architecture, key points.
  • Overview
  • Cloud storage: Blobs, disks, tables and queues. Highly durable, available and massively scalable.
  • 10+ trillion objects. 1M+ requests per seconds average. Exposed via easy and open REST APIs
  • Blobs: Simple interface to retrieve files in the cloud. Data sharing, big data, backups.
  • Disks: Built on top on blobs. Mounted disks as VHDs stored on blobs.
  • Tables: Massively scalable key-value pairs. You can do queries, scan. Metadata for your systems.
  • Queues: Reliable messaging system. Deals with failure cases.
  • Azure is spread all over the world.
  • Storage Concepts: Accounts  ContainerBlobs/TableEntities/QueuesMessages. URLs to identify.
  • Used by Microsoft (XBOX, SkyDrive, etc…) and many external companies
  • Architecture
  • Design Goals: Highly available with strong consistency. Durability, scalability (to zettabytes). Additional information in the SOSP paper.
  • Storage stamps: Access to blog via the URL. LB  Front-end  Partition layer  DFS Layer. Inter-stamp partition replication.
  • Architecture layer: Distributed file system layer. JBODs, append-only file system, each extent is replicated 3 times.
  • Architecture layer: Partition layer. Understands our data abstractions (blobs, queues, etc). Massively scalable index. Log Structure Merge Tree. Linked list of extents
  • Architecture layer: Front-end layer. REST front end. Authentication/authorization. Metrics/logging.
  • Key Design Points
  • Availability with consistency for writing. All writes we do are to a log. Append to the last extent of the log.
  • Ordered the same across all 3 replicas. Success only if 3 replicas are commited. Extents get sealed (no more appends) when they get to a certain size.
  • If you lose a node, seal the old two copies, create 3 new instances to append to. Also make a 3rd copy for the old one.
  • Availability with consistency for reading. Can read from any replica. Send out parallel read requests if first read is taking higher than 95% latency.
  • Partition Layer: spread index/transaction processing across servers. If there is a hot node, split that part of the index off. Dynamically load balance. Just the index, this does not move the data.
  • DFS Layer: load balancing there as well. No disk or node should be hot. Applies to both reads and writes. Lazily move replicas around to load balancing.
  • Append only system. Benefits: simple replication, easier diagnostics, erasure coding, keep snapshots with no extra cost, works well with future dirve technology. Tradeoff: GC overhead.
  • Our approach to the CAP theorem. Tradeoff in Availability vs. Consistency. Extra flexibility to achieve C and A at the same time.
  • Lessons learned: Automatic load balancing. Adapt to conditions. Tunable and extensible to tune load balancing rules. Tune based on any dimension (CPU, network, memory, tpc, GC load, etc.)
  • Lessons learned: Achieve consistently low append latencies. Ended up using SSD journaling.
  • Lessons learned: Efficient upgrade support. We update frequently, almost consistently. Handle them almost as failures.
  • Lessons learned: Pressure point testing. Make sure we’re resilient despite errors.
  • Erasure coding. Implemented at the DFS Layer. See last year’s SDC presentation.
  • Azure VM persistent disks: VHDs for persistent disks are directly stored in Windows Azure Storage blobs. You can access your VHDs via REST.
  • Easy to upload/download your own VHD and mount them. REST writes are blocked when mounted to a VM. Snapshots and Geo replication as well.
  • Separating compute from storage. Allows them to be scaled separately. Provide flat network storage. Using a Quantum 10 network architecture.
  • Summary: Durability (3 copies), Consistency (commit across 3 copies). Availability (can read from any of the 3 relicas). Performance/Scale.
  • Windows Azure developer website: https://www.windowsazure.com/en-us/develop/net
  • Windows Azure storage blog: https://blogs.msdn.com/b/windowsazurestorage
  • SOSP paper/talk: https://blogs.msdn.com/b/windowsazure/archive/2011/11/21/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx

SMB Direct update
Greg Kramer, Microsoft
Tom Talpey, Microsoft

  • Two parts: 1 - Tom shares Ecosystem status and updates, 2 - Greg shares SMB Direct details
  • Protocols and updates: SMB 3.02 is a minor update. Documented in MS-SMB2 and MS-SMBD. See Dave's talk yesterday.